From Ollama to llama.cpp: running Claude Code locally with Qwen 3.6 on a 2021 MacBook Pro
A few months ago I wrote about running Claude Code locally with Ollama as a free alternative to the Anthropic API. That setup still works. This post is about the next step: swapping Ollama for
llama.cpp, swappinggpt-oss:20bfor Qwen 3.6 27B, and swapping the Claude Code harness itself for Pi, a minimal MIT-licensed terminal coding agent.The result, on a 2021 M1 Max MacBook Pro with 32 GB of RAM, is a private coding stack that feels close enough to a cloud agent that I stopped reaching for the cloud one by default. This guide covers why the move from Ollama to llama.cpp is worth making in April 2026, how to set it up, and the real tradeoffs.
Why Move From Ollama to llama.cpp?
Ollama is excellent. It hides the messy parts (model formats, quantization, memory allocation) behind one command, and since v0.14 it speaks the Anthropic Messages API natively. For most readers of the previous post, it remains the right starting point.
So why move?
Three reasons converged in April 2026:
- Qwen 3.6 27B was released on 22 April 2026 under Apache 2.0, and it is the first open-weight model where I genuinely stopped noticing I was on local inference for ordinary coding work. It scores 77.2% on SWE-bench Verified, which is higher than its much larger 397B predecessor. On a 32 GB M1 Max, the Q4_K_M quantization fits comfortably with room for a real context window.
- Qwen 3.6 GGUFs do not currently run in Ollama. The model ships with a separate
mmprojvision file that Ollama's loader does not yet handle.llama.cpploads it directly. If you want this specific model today, the choice is made for you. llama.cpp'sllama-servergained native Anthropic Messages API support (merged in PR #17570). It handles/v1/messages, streaming via SSE, token counting, tool use via--jinja, and vision inputs. Anything that points at the Anthropic API can point at it.
The general pattern: Ollama is the comfortable on-ramp. llama.cpp is what you graduate to when you want a specific model, more knobs, or fewer layers between you and the inference.
| Ollama | llama.cpp | |
|---|---|---|
| Setup | One installer, one command | Build from source (recommended on macOS) |
| Model format | Curated registry | Any GGUF on Hugging Face |
| Anthropic API | Native since v0.14 | Native since PR #17570 |
| Tool calling | Default-on for supported models | Requires --jinja flag |
| Qwen 3.6 27B | Not yet supported | Supported day one |
| Best fit | Getting started, multiple models | One model you care about, tuned hard |
What This Stack Looks Like
Three pieces, each replaceable:
llama.cpp: the inference engine. Open source, MIT-licensed, built around the GGUF model format. Runs on Apple Silicon via Metal, on NVIDIA GPUs via CUDA, and on CPU as a fallback. Ships a binary calledllama-serverthat exposes an HTTP API.- Qwen 3.6 27B (Q4_K_M): the model. Apache 2.0, ~16.8 GB on disk, ~18 GB resident in memory at modest context. Strong on coding tasks, multilingual, with native tool-calling support. Q4_K_M is a 4-bit quantization: the original model weights are compressed to about a quarter of their full-precision size with a small accuracy cost. GGUF is the file format
llama.cppuses to ship those weights. - Pi: the coding harness. A minimal terminal coding agent by Mario Zechner. MIT-licensed, single binary, no permission popups, full filesystem access. The Claude Code equivalent in spirit; about a tenth the surface area.
Claude Code itself still works as the harness if you prefer it. Point it at llama-server exactly as in the Ollama post. I switched to Pi for reasons worth their own section.
Why Pi Instead of Claude Code This Time
The previous post used Claude Code, and that was the right call at the time. Claude Code is polished, well-instrumented, and ships with everything most developers expect from a modern coding agent: sub-agents, MCP, permission prompts, plan mode, built-in to-dos, background bash. On a fast cloud model where capability is abundant, those features are mostly free. On a local 27B model where every token of system prompt and every extra tool call eats real wall-clock time, the same features become a tax.
Pi takes the opposite stance. Mario Zechner's design notes (philosophy, original write-up) make three arguments that matter especially to a local-inference setup:
- Minimal system prompt. Pi's full system prompt plus tool definitions fits in under 1,000 tokens. The argument is that frontier models have been "RL-trained up the wazoo" and already know what a coding agent is, so reciting the obvious back at them is wasted context. On Opus 4.7 the difference is invisible. On a local 27B at 32K context, every saved token is real.
- Minimal toolset. Pi exposes four tools:
read,write,edit,bash. The claim is that this is what models were trained against, and additional specialized tools add surface area without proportional benefit. Fewer tool schemas means smaller prompts and fewer ways for the model to wander. - Minimal agent scaffold. No
max_steps, no built-in to-do system ("they confuse models"), no plan mode, no sub-agents, no permission popups, no MCP, no background bash. Anything you want, you build as a Pi extension or a Skill (a directory with a README the agent reads on demand). The core stays small; the workflow stays yours.
The slogan: Pi is aggressively extensible so it doesn't have to dictate your workflow.
Side-by-side:
| Claude Code | Pi | |
|---|---|---|
| License | Proprietary (Anthropic) | MIT, open source |
| Source availability | Closed; the obfuscated source has been decompiled and circulated publicly | Open since day one; has never leaked, on account of nothing to leak |
| System prompt + tool schemas | Several thousand tokens | Under 1,000 tokens |
| Core tools | A dozen-plus, including specialized ones | Four (read, write, edit, bash) |
| Sub-agents | Built in | Not built in (spawn Pi via tmux, or write an extension) |
| Plan mode | Built in | Not built in (write plans to a file) |
| Built-in to-dos | Yes | No (use a TODO.md) |
| Permission popups | Yes, by default | No (run inside a sandbox if you need them) |
| MCP support | Built in | Not built in (add via extension) |
| Background bash | Yes | No (use tmux for full observability) |
| Best fit | Cloud-hosted frontier model, broad team, default-safe ergonomics | Local model, single operator, tuning for token economy and observability |
The trade is honest. Pi gives up safety rails and built-in features in exchange for a smaller footprint, more control, and a codebase you can read in an afternoon. On a cloud model that is a wash. On Qwen 3.6 27B running locally, the smaller footprint is exactly the resource that was scarce, and the effect on responsiveness is the thing that made the stack feel calm rather than effortful.
If you want the Claude Code ergonomics with this server, that path is still in Step 5, Option B. If you want the lightest possible loop between you and a local model, Pi is the better fit.
Cost Comparison
| Setup | Monthly cost | Notes |
|---|---|---|
| Claude Code + Opus 4.7 API | ~$50 to $200+ | Scales with token usage |
| Claude Max subscription | $100 to $200 | Flat rate |
| Claude Code + Ollama (local) | $0 | Covered in the previous post |
| Pi + llama.cpp + Qwen 3.6 27B (this post) | $0 | Electricity only |
Annual savings remain in the $600 to $2,400 range. The thing that changed in April 2026 is not the price; it is that the local model is finally good enough that the cheaper option is also a reasonable option.
Requirements
| Component | Specification |
|---|---|
| OS | macOS 14+ (tested on macOS 26 "Tahoe", M1 Max). Linux and Windows also supported. |
| Chip | Apple Silicon recommended for unified memory; NVIDIA GPU also fine |
| RAM | 24 GB minimum for Qwen 3.6 27B Q4_K_M; 32 GB comfortable |
| Disk | ~25 GB free for the model and build artifacts |
| Tooling | Xcode Command Line Tools, Homebrew, cmake, aria2c (recommended for fast downloads) |
| Internet | Required for the initial model and source downloads only |
For machines with 16 GB of RAM, stay on the Ollama setup with gpt-oss:20b or qwen3-coder:14b. Qwen 3.6 27B is the wrong model for that tier; it will swap aggressively and feel worse than a smaller model running cleanly.
Step 1: Install Build Tools and Dependencies
On macOS, use Homebrew. The Homebrew llama.cpp formula trails upstream and has shipped with broken Metal builds during fast-moving release windows; building from source is more reliable.
xcode-select --install
brew install cmake aria2 git
aria2 (aria2c) is optional but cuts a 16 GB model download from tens of minutes to a few. Hugging Face's CLI works too; the flag layout has changed twice in the last year, so I no longer rely on it.
Step 2: Build llama.cpp From Source
git clone https://github.com/ggml-org/llama.cpp ~/src/llama.cpp
cd ~/src/llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j
The Metal flag enables GPU-accelerated inference on Apple Silicon. On an M1 Max this build takes about three minutes.
If you hit an OpenSSL header conflict during the build, it almost always means a stray openssl@1.1 is on your PATH ahead of the system or openssl@3 Homebrew copy. Either remove the older one or prepend the correct prefix to CPATH and LIBRARY_PATH.
Verify:
./build/bin/llama-server --version
You should see a version string and a build date. Add ~/src/llama.cpp/build/bin to your PATH if you want llama-server available globally.
Step 3: Download Qwen 3.6 27B
Use one of the community Q4_K_M quantizations from Hugging Face. Unsloth's repository is current and well-maintained:
mkdir -p ~/models/qwen3.6-27b
cd ~/models/qwen3.6-27b
aria2c -x 8 -s 8 \
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf
Two quick traps to avoid:
aria2cwrites to the current directory by default. Use-d <path>if you want it elsewhere; do not assume-ocontrols the directory.- Some repositories ship the model as multiple shards (
*-00001-of-00002.gguf, etc). Download all shards to the same directory;llama.cppwill assemble them automatically when given the first shard.
You should end up with a single ~16.8 GB .gguf file (or a small set of shards summing to that).
Step 4: Run llama-server With the Anthropic API Enabled
llama-server \
--model ~/models/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf \
--host 127.0.0.1 --port 8080 \
--ctx-size 32768 \
--n-gpu-layers 999 \
--jinja \
--alias qwen3.6-27b
What each flag does:
--ctx-size 32768: 32K tokens of context. Comfortable on 32 GB. Drop to 16K on 24 GB. Larger context costs RAM linearly.--n-gpu-layers 999: push every layer onto Metal. On Apple Silicon's unified memory there is no penalty for this; on discrete GPUs, dial it down if you run out of VRAM.--jinja: required for tool calling. Without it, Claude Code and Pi will receive malformed tool-call schemas and silently fail.--alias qwen3.6-27b: what the model will call itself when clients ask. This is the string you pass as--modelfrom the client side.
The Anthropic Messages endpoint will be live at http://127.0.0.1:8080/v1/messages. Confirm with:
curl -s http://127.0.0.1:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.6-27b","max_tokens":64,"messages":[{"role":"user","content":"hello"}]}'
A JSON response with a content array means everything below this layer is working.
Step 5: Connect a Coding Agent
Two options. Pick one.
Option A: Pi (recommended for this stack)
Pi is a minimal terminal coding agent. No subscription, no telemetry, MIT-licensed. It speaks both the OpenAI-compatible and Anthropic-compatible APIs.
Pi ships on npm. With a recent Node.js installed:
npm install -g @mariozechner/pi-coding-agent
Configure it to talk to llama-server:
export ANTHROPIC_BASE_URL="http://127.0.0.1:8080"
export ANTHROPIC_AUTH_TOKEN="local"
export ANTHROPIC_API_KEY=""
cd ~/my-project
pi --model qwen3.6-27b
Add the three export lines to ~/.zshrc to persist them.
Pi has fewer guardrails than Claude Code. It will run shell commands without prompting. Use it inside a project directory you are comfortable letting an agent edit. For sharper isolation, see Docker sandboxes for Claude Code; the same container pattern works for Pi.
Option B: Claude Code
If you already have Claude Code installed from the previous post, the configuration is almost identical to the Ollama setup; only the port changes:
export ANTHROPIC_BASE_URL="http://127.0.0.1:8080"
export ANTHROPIC_AUTH_TOKEN="local"
export ANTHROPIC_API_KEY=""
claude --model qwen3.6-27b
There is one more setting worth knowing about. Claude Code injects a Claude Code Attribution header into every request. llama.cpp treats this header as part of the prompt prefix, which invalidates its KV cache (the saved intermediate state that lets the model skip recomputing tokens it has already seen) on every turn and slows inference by up to 90%. The fix lives in ~/.claude/settings.json:
{
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
"ANTHROPIC_BASE_URL": "http://127.0.0.1:8080",
"ANTHROPIC_AUTH_TOKEN": "local",
"ANTHROPIC_API_KEY": ""
}
}
Skip this and you will conclude (wrongly) that local inference is unusable. Set it and the cache stays warm across turns.
Step 6: Use It
cd ~/my-project
pi --model qwen3.6-27b
A first prompt that exercises the full loop:
Read the project structure, summarize what this codebase does in three sentences, then add a CLI flag --dry-run to the main entry point that prints what would happen without making changes.
This forces filesystem reads, multi-turn reasoning, and at least one edit. If it returns coherent prose and a usable diff, the stack is healthy.
Offline check
Disconnect from the network and run another prompt. A successful response confirms there is no fallback path quietly reaching the cloud.
What the Performance Actually Feels Like
Numbers from real sessions on a 2021 M1 Max with 32 GB:
| Metric | Value | Notes |
|---|---|---|
| Decode speed (generation only) | ~11 tok/s | The number people usually quote |
| Blended throughput | ~41.5 tok/s | Includes prompt ingestion, which is much faster |
| First-token latency | 1–3 s | At 32K context, after warm-up |
| RAM at idle (model loaded) | ~18 GB | |
| RAM under load | ~22 GB | With 32K context filled |
Decode speed alone reads as mediocre. Blended throughput, what you actually feel during a coding session, sits in a range where multi-file edits resolve quickly enough to stay in flow. I notice the slowness compared to Opus 4.7 on the API. I do not resent it.
The other surprise was the Pi harness itself. Less ceremony, less permission theater, less wrapper overhead between request and response. On local models, where every layer of indirection costs visible time, the lighter harness is not just an aesthetic preference. It is a perceptible speedup.
Caveats and Limitations
These are real, not footnotes.
Model quality. Qwen 3.6 27B is excellent for an open-weight 27B model and it benchmarks above some much larger predecessors. It is not Opus 4.7. For deep multi-step reasoning, novel architecture decisions, or reasoning about a brand-new framework, the gap is still visible. Use it for ordinary coding work; reach for the cloud for the hard 10%.
Build complexity vs Ollama. Building llama.cpp from source is a real step up in friction. If you hit an OpenSSL header issue, a CMake mismatch, or a Metal compile failure, you will spend an hour. The setup sticks once it works, but the cost is paid up front.
Tool calling fragility. Without --jinja, tool calls silently degrade. With --jinja but a client that injects unexpected headers (the Claude Code attribution issue above), inference slows dramatically. Both failure modes look like "the model is dumb today" until you find the cause.
16 GB machines. Qwen 3.6 27B is the wrong model on 16 GB. Stay on the Ollama path with gpt-oss:20b or qwen3-coder:14b until you have more memory.
Maturity window. llama.cpp's Anthropic Messages API was merged recently and the Qwen 3.6 release is two days old at the time of writing. Expect rough edges in streaming, image inputs, and edge cases of the tool-calling schema. Pin a known-good commit if you find one that works for your workflow.
Privacy is a property of this stack, not a magic word. No data leaves the machine while inference runs. That guarantee disappears the moment you opt into a cloud model from the same harness, paste a snippet into a chat app, or share a screen. The setup is private; usage habits decide whether the privacy survives.
Quick Reference
# --- ONE-TIME SETUP ---
xcode-select --install
brew install cmake aria2 git
git clone https://github.com/ggml-org/llama.cpp ~/src/llama.cpp
cd ~/src/llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j
mkdir -p ~/models/qwen3.6-27b
aria2c -x 8 -s 8 -d ~/models/qwen3.6-27b \
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf
npm install -g @mariozechner/pi-coding-agent
# --- DAILY USE ---
# Terminal 1: inference server
~/src/llama.cpp/build/bin/llama-server \
--model ~/models/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf \
--host 127.0.0.1 --port 8080 \
--ctx-size 32768 --n-gpu-layers 999 \
--jinja --alias qwen3.6-27b
# Terminal 2: coding agent
export ANTHROPIC_BASE_URL="http://127.0.0.1:8080"
export ANTHROPIC_AUTH_TOKEN="local"
export ANTHROPIC_API_KEY=""
cd ~/my-project
pi --model qwen3.6-27b
Switching Back to Anthropic
unset ANTHROPIC_BASE_URL
unset ANTHROPIC_AUTH_TOKEN
unset ANTHROPIC_API_KEY
Then run claude (or pi) without the --model override to use the default Anthropic backend. A practical workflow on this hardware: Pi + Qwen 3.6 for routine work, the cloud for the hardest 10%. The split lands somewhere around 90/10 for me.
Where This Sits in the Bigger Picture
The previous post made the case that local inference had reached the point where you could plausibly use it for free. This post is the second iteration of the same idea: the model under the hood has improved enough that the experience feels normal, not novel. A 2021 laptop is not supposed to be the right tool for this. In April 2026 it is.
Both posts point at the same conclusion. Local AI is no longer a science project. It is an option, and on hardware most readers already own.
References
- Running Claude Code locally with Ollama (the previous post)
- llama.cpp: GitHub
- llama.cpp: Anthropic Messages API support (PR #17570)
- llama.cpp: Anthropic Messages API blog post
- Qwen3.6: GitHub
- Unsloth: Qwen 3.6 GGUFs and run guide
- Pi coding agent: write-up by Mario Zechner
- Offline agentic coding with llama-server: discussion
April 2026. Models and tooling evolve rapidly; verify versions against official documentation before following these steps.