I think you'll find that on that card most models that are approaching the 16G memory size will be more than fast enough and sufficient for chat. You're in the happy position of needing steeper requirements rather than faster hardware! :D
Ollama is the easiest way to get started trying things out IMO: https://ollama.com/
I found LM Studios so much easier than ollama given it has a UI: https://lmstudio.ai/ Did you know about LM Studio? Why is ollama still recommended given it's just a CLI with worse UX?
I recommended ollama because IMO that is the easiest way to get started (as I said).
lM studio is closed source
Any FOSS solutions that let you browse models and guesstimates for you on whether you have enough VRAM to fully load the model? That's the only selling point to LM Studio for me.
Ollama's default context length is frustratingly short in the era of 100k+ context windows.
My solution so far has been to boot up LM Studio to check if a model will work well on my machine, manually download the model myself through huggingface, run llama.cpp, and hook it up to open-webui. Which is less than ideal, and LM Studio's proprietary code has access to my machine specs.
> Ollama's default context length is frustratingly short in the era of 100k+ context windows.
Nobody uses Ollama as is. It's a model server. In clients you can specify the proper context lengths. This has never been a problem.
For sure, though it's tripped me up a few times for clients that don't pass in a reasonable context length with each call.
https://huggingface.co/docs/accelerate/v0.32.0/en/usage_guid...
Thanks! That's really helpful.
And I think LM Studio has non commercial restrictions