Choose local GGUF models for desktop AI agents.
A practical decision framework for picking a local GGUF model that actually holds up under agent workloads on Mac or Windows. Quantization, RAM, VRAM, context length, instruction following, and a quick test routine you can repeat for every new model.
Route each task through the right model or tool surface.
MultiAgentOS supports API keys, local servers, CLI pipes, OAuth, terminal templates, and local AI/GGUF workflows.
- 1 Choose provider
- 2 Store secret
- 3 Test model
- 4 Enable tools
What is GGUF, exactly?
GGUF is the file format used by llama.cpp and the runtimes built on top of it (Ollama, LM Studio, GPT4All, llamafile). A single .gguf file contains the quantized weights, the tokenizer, the prompt template, and a small amount of metadata. That is why a Q4_K_M Mistral 7B from one tool just works in another.
1. Start with the task, not the leaderboard
The biggest mistake in local model selection is reaching for the top of a benchmark leaderboard. For agent work, leaderboards rarely measure what matters: tool-call reliability, structured output, refusal calibration, and steady latency.
Pick the task category first:
- Chat & summarisation. Any current 7B instruction-tuned model.
- Code reasoning. Qwen 2.5 Coder, Llama 3.1, DeepSeek Coder.
- Tool use / agents. Qwen 2.5 7B/14B Instruct, Llama 3.1 8B, larger if RAM allows.
- Long-document Q&A. Models with proven 32K+ context, e.g. Qwen 2.5 with extended context.
2. Match memory to the machine
Use these as fast estimates for Q4_K_M quantization, which is the most common sweet spot:
| Model size | RAM needed | Typical fit |
|---|---|---|
| 3B | ~3 GB | Light chat, classification, fast subagents |
| 7B / 8B | ~5-6 GB | General-purpose agent on 16 GB machines |
| 13B / 14B | ~9-10 GB | Stronger reasoning on 32 GB machines |
| 32B | ~20 GB | Workstations, 64 GB systems |
| 70B | ~40-45 GB | Apple Silicon with unified memory or multi-GPU |
Leave headroom. Agents load files, screenshots, and tool outputs into the context window — those bytes live in RAM next to the weights.
3. Pick the right quantization
Quantization compresses weights at the cost of some fidelity. Common GGUF quants and when to use them:
- Q4_K_M. Default. Smallest size with low quality loss. Use unless you have a reason not to.
- Q5_K_M. Slightly larger, slightly more accurate. Worth it if RAM is plentiful.
- Q6_K. Closer to fp16 quality. Good for code or math-heavy agents.
- Q8_0. Near-lossless. Use on workstations only.
- IQ2/IQ3 (imatrix). Aggressive compression. Useful when the alternative is "doesn't fit."
4. Test instruction following before trusting tools
Before enabling any tools or MCP servers, run a five-minute checklist:
- Structured output. Ask for a JSON object with three fields, then four. Does the model produce valid JSON every time?
- Stop conditions. Tell the model to reply with a single sentence and stop. Does it actually stop?
- Honest gaps. Ask about a fact it cannot know. Does it admit uncertainty or hallucinate?
- Tool grammar. Give it a simple tool schema and ask it to call the tool. Does it follow the format?
- Refusal calibration. Ask something benign that small models often over-refuse. Does it cooperate?
If the model fails any of these on simple prompts, it will fail harder once you add real tools and files.
5. Keep an API fallback
Local-first does not mean local-only. The best desktop agent setups treat the local model as the default and the API model as the fallback. In MultiAgentOS, route the bulk of work through the local connection and switch to OpenAI or Anthropic for the few prompts that genuinely need a frontier model.
See Add an OpenAI API key to a desktop AI agent for the routing setup.
6. Record a baseline
Whenever a model "works", write down:
- Model name and version.
- Quantization.
- Context length used.
- Endpoint (Ollama, LM Studio, local AI file path).
- The exact five-prompt test you ran.
Six months later, when you swap a model and something feels worse, that baseline saves hours of guessing.
Recommended starting models (May 2026)
- Llama 3.1 8B Instruct — broad, well-known, strong tool following.
- Qwen 2.5 7B Instruct — currently the most reliable small tool-use model.
- Mistral 7B Instruct v0.3 — fast, low-RAM, classic agent fit.
- Qwen 2.5 Coder 14B — best small code-reasoning model that fits on 32 GB.
- DeepSeek V2 Lite (when GGUF is current) — efficient mixture-of-experts.
Hugging Face's GGUF tag is the most up-to-date directory — search there, then verify hash, license, and quant before downloading.
Related
- Set up Ollama for local AI agents — how to load a GGUF via Ollama.
- Connect MCP tools to desktop AI agents — extend a local model with real tools.
- Best local LLM models in 2026 — current recommendations.
- MultiAgentOS vs LM Studio — when each fits.