From Ollama to llama.cpp: running Claude Code locally with Qwen 3.6 on a 2021 MacBook Pro

A few months ago I wrote about running Claude Code locally with Ollama as a free alternative to the Anthropic API. That setup still works. This post is about the next step: swapping Ollama for llama.cpp, swapping gpt-oss:20b for Qwen 3.6 27B, and swapping the Claude Code harness itself for Pi, a minimal MIT-licensed terminal coding agent.

The result, on a 2021 M1 Max MacBook Pro with 32 GB of RAM, is a private coding stack that feels close enough to a cloud agent that I stopped reaching for the cloud one by default. This guide covers why the move from Ollama to llama.cpp is worth making in April 2026, how to set it up, and the real tradeoffs.

Why Move From Ollama to llama.cpp?

Ollama is excellent. It hides the messy parts (model formats, quantization, memory allocation) behind one command, and since v0.14 it speaks the Anthropic Messages API natively. For most readers of the previous post, it remains the right starting point.

So why move?

Three reasons converged in April 2026:

Qwen 3.6 27B was released on 22 April 2026 under Apache 2.0, and it is the first open-weight model where I genuinely stopped noticing I was on local inference for ordinary coding work. It scores 77.2% on SWE-bench Verified, which is higher than its much larger 397B predecessor. On a 32 GB M1 Max, the Q4_K_M quantization fits comfortably with room for a real context window.
Qwen 3.6 GGUFs do not currently run in Ollama. The model ships with a separate mmproj vision file that Ollama's loader does not yet handle. llama.cpp loads it directly. If you want this specific model today, the choice is made for you.
llama.cpp's llama-server gained native Anthropic Messages API support (merged in PR #17570). It handles /v1/messages, streaming via SSE, token counting, tool use via --jinja, and vision inputs. Anything that points at the Anthropic API can point at it.

The general pattern: Ollama is the comfortable on-ramp. llama.cpp is what you graduate to when you want a specific model, more knobs, or fewer layers between you and the inference.

	Ollama	llama.cpp
Setup	One installer, one command	Build from source (recommended on macOS)
Model format	Curated registry	Any GGUF on Hugging Face
Anthropic API	Native since v0.14	Native since PR #17570
Tool calling	Default-on for supported models	Requires `--jinja` flag
Qwen 3.6 27B	Not yet supported	Supported day one
Best fit	Getting started, multiple models	One model you care about, tuned hard

What This Stack Looks Like

Three pieces, each replaceable:

llama.cpp: the inference engine. Open source, MIT-licensed, built around the GGUF model format. Runs on Apple Silicon via Metal, on NVIDIA GPUs via CUDA, and on CPU as a fallback. Ships a binary called llama-server that exposes an HTTP API.
Qwen 3.6 27B (Q4_K_M): the model. Apache 2.0, ~16.8 GB on disk, ~18 GB resident in memory at modest context. Strong on coding tasks, multilingual, with native tool-calling support. Q4_K_M is a 4-bit quantization: the original model weights are compressed to about a quarter of their full-precision size with a small accuracy cost. GGUF is the file format llama.cpp uses to ship those weights.
Pi: the coding harness. A minimal terminal coding agent by Mario Zechner. MIT-licensed, single binary, no permission popups, full filesystem access. The Claude Code equivalent in spirit; about a tenth the surface area.

Claude Code itself still works as the harness if you prefer it. Point it at llama-server exactly as in the Ollama post. I switched to Pi for reasons worth their own section.

Why Pi Instead of Claude Code This Time

The previous post used Claude Code, and that was the right call at the time. Claude Code is polished, well-instrumented, and ships with everything most developers expect from a modern coding agent: sub-agents, MCP, permission prompts, plan mode, built-in to-dos, background bash. On a fast cloud model where capability is abundant, those features are mostly free. On a local 27B model where every token of system prompt and every extra tool call eats real wall-clock time, the same features become a tax.

Pi takes the opposite stance. Mario Zechner's design notes (philosophy, original write-up) make three arguments that matter especially to a local-inference setup:

Minimal system prompt. Pi's full system prompt plus tool definitions fits in under 1,000 tokens. The argument is that frontier models have been "RL-trained up the wazoo" and already know what a coding agent is, so reciting the obvious back at them is wasted context. On Opus 4.7 the difference is invisible. On a local 27B at 32K context, every saved token is real.
Minimal toolset. Pi exposes four tools: read, write, edit, bash. The claim is that this is what models were trained against, and additional specialized tools add surface area without proportional benefit. Fewer tool schemas means smaller prompts and fewer ways for the model to wander.
Minimal agent scaffold. No max_steps, no built-in to-do system ("they confuse models"), no plan mode, no sub-agents, no permission popups, no MCP, no background bash. Anything you want, you build as a Pi extension or a Skill (a directory with a README the agent reads on demand). The core stays small; the workflow stays yours.

The slogan: Pi is aggressively extensible so it doesn't have to dictate your workflow.

Side-by-side:

	Claude Code	Pi
License	Proprietary (Anthropic)	MIT, open source
Source availability	Closed; the obfuscated source has been decompiled and circulated publicly	Open since day one; has never leaked, on account of nothing to leak
System prompt + tool schemas	Several thousand tokens	Under 1,000 tokens
Core tools	A dozen-plus, including specialized ones	Four (`read`, `write`, `edit`, `bash`)
Sub-agents	Built in	Not built in (spawn Pi via tmux, or write an extension)
Plan mode	Built in	Not built in (write plans to a file)
Built-in to-dos	Yes	No (use a `TODO.md`)
Permission popups	Yes, by default	No (run inside a sandbox if you need them)
MCP support	Built in	Not built in (add via extension)
Background bash	Yes	No (use tmux for full observability)
Best fit	Cloud-hosted frontier model, broad team, default-safe ergonomics	Local model, single operator, tuning for token economy and observability

The trade is honest. Pi gives up safety rails and built-in features in exchange for a smaller footprint, more control, and a codebase you can read in an afternoon. On a cloud model that is a wash. On Qwen 3.6 27B running locally, the smaller footprint is exactly the resource that was scarce, and the effect on responsiveness is the thing that made the stack feel calm rather than effortful.

If you want the Claude Code ergonomics with this server, that path is still in Step 5, Option B. If you want the lightest possible loop between you and a local model, Pi is the better fit.

Cost Comparison

Setup	Monthly cost	Notes
Claude Code + Opus 4.7 API	~$50 to $200+	Scales with token usage
Claude Max subscription	$100 to $200	Flat rate
Claude Code + Ollama (local)	$0	Covered in the previous post
Pi + llama.cpp + Qwen 3.6 27B (this post)	$0	Electricity only

Annual savings remain in the $600 to $2,400 range. The thing that changed in April 2026 is not the price; it is that the local model is finally good enough that the cheaper option is also a reasonable option.

Requirements

Component	Specification
OS	macOS 14+ (tested on macOS 26 "Tahoe", M1 Max). Linux and Windows also supported.
Chip	Apple Silicon recommended for unified memory; NVIDIA GPU also fine
RAM	24 GB minimum for Qwen 3.6 27B Q4_K_M; 32 GB comfortable
Disk	~25 GB free for the model and build artifacts
Tooling	Xcode Command Line Tools, Homebrew, `cmake`, `aria2c` (recommended for fast downloads)
Internet	Required for the initial model and source downloads only

For machines with 16 GB of RAM, stay on the Ollama setup with gpt-oss:20b or qwen3-coder:14b. Qwen 3.6 27B is the wrong model for that tier; it will swap aggressively and feel worse than a smaller model running cleanly.

Step 1: Install Build Tools and Dependencies

On macOS, use Homebrew. The Homebrew llama.cpp formula trails upstream and has shipped with broken Metal builds during fast-moving release windows; building from source is more reliable.

xcode-select --install
brew install cmake aria2 git

aria2 (aria2c) is optional but cuts a 16 GB model download from tens of minutes to a few. Hugging Face's CLI works too; the flag layout has changed twice in the last year, so I no longer rely on it.

Step 2: Build llama.cpp From Source

git clone https://github.com/ggml-org/llama.cpp ~/src/llama.cpp
cd ~/src/llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j

The Metal flag enables GPU-accelerated inference on Apple Silicon. On an M1 Max this build takes about three minutes.

If you hit an OpenSSL header conflict during the build, it almost always means a stray openssl@1.1 is on your PATH ahead of the system or openssl@3 Homebrew copy. Either remove the older one or prepend the correct prefix to CPATH and LIBRARY_PATH.

Verify:

./build/bin/llama-server --version

You should see a version string and a build date. Add ~/src/llama.cpp/build/bin to your PATH if you want llama-server available globally.

Step 3: Download Qwen 3.6 27B

Use one of the community Q4_K_M quantizations from Hugging Face. Unsloth's repository is current and well-maintained:

mkdir -p ~/models/qwen3.6-27b
cd ~/models/qwen3.6-27b

aria2c -x 8 -s 8 \
  https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf

Two quick traps to avoid:

aria2c writes to the current directory by default. Use -d <path> if you want it elsewhere; do not assume -o controls the directory.
Some repositories ship the model as multiple shards (*-00001-of-00002.gguf, etc). Download all shards to the same directory; llama.cpp will assemble them automatically when given the first shard.

You should end up with a single ~16.8 GB .gguf file (or a small set of shards summing to that).

Step 4: Run llama-server With the Anthropic API Enabled

llama-server \
  --model ~/models/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --jinja \
  --alias qwen3.6-27b

What each flag does:

--ctx-size 32768: 32K tokens of context. Comfortable on 32 GB. Drop to 16K on 24 GB. Larger context costs RAM linearly.
--n-gpu-layers 999: push every layer onto Metal. On Apple Silicon's unified memory there is no penalty for this; on discrete GPUs, dial it down if you run out of VRAM.
--jinja: required for tool calling. Without it, Claude Code and Pi will receive malformed tool-call schemas and silently fail.
--alias qwen3.6-27b: what the model will call itself when clients ask. This is the string you pass as --model from the client side.

The Anthropic Messages endpoint will be live at http://127.0.0.1:8080/v1/messages. Confirm with:

curl -s http://127.0.0.1:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.6-27b","max_tokens":64,"messages":[{"role":"user","content":"hello"}]}'

A JSON response with a content array means everything below this layer is working.

Step 5: Connect a Coding Agent

Two options. Pick one.

Option A: Pi (recommended for this stack)

Pi is a minimal terminal coding agent. No subscription, no telemetry, MIT-licensed. It speaks both the OpenAI-compatible and Anthropic-compatible APIs.

Pi ships on npm. With a recent Node.js installed:

npm install -g @mariozechner/pi-coding-agent

Configure it to talk to llama-server:

export ANTHROPIC_BASE_URL="http://127.0.0.1:8080"
export ANTHROPIC_AUTH_TOKEN="local"
export ANTHROPIC_API_KEY=""

cd ~/my-project
pi --model qwen3.6-27b

Add the three export lines to ~/.zshrc to persist them.

Pi has fewer guardrails than Claude Code. It will run shell commands without prompting. Use it inside a project directory you are comfortable letting an agent edit. For sharper isolation, see Docker sandboxes for Claude Code; the same container pattern works for Pi.

Option B: Claude Code

If you already have Claude Code installed from the previous post, the configuration is almost identical to the Ollama setup; only the port changes:

export ANTHROPIC_BASE_URL="http://127.0.0.1:8080"
export ANTHROPIC_AUTH_TOKEN="local"
export ANTHROPIC_API_KEY=""

claude --model qwen3.6-27b

There is one more setting worth knowing about. Claude Code injects a Claude Code Attribution header into every request. llama.cpp treats this header as part of the prompt prefix, which invalidates its KV cache (the saved intermediate state that lets the model skip recomputing tokens it has already seen) on every turn and slows inference by up to 90%. The fix lives in ~/.claude/settings.json:

{
  "env": {
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8080",
    "ANTHROPIC_AUTH_TOKEN": "local",
    "ANTHROPIC_API_KEY": ""
  }
}

Skip this and you will conclude (wrongly) that local inference is unusable. Set it and the cache stays warm across turns.

Step 6: Use It

cd ~/my-project
pi --model qwen3.6-27b

A first prompt that exercises the full loop:

Read the project structure, summarize what this codebase does in three sentences, then add a CLI flag --dry-run to the main entry point that prints what would happen without making changes.

This forces filesystem reads, multi-turn reasoning, and at least one edit. If it returns coherent prose and a usable diff, the stack is healthy.

Offline check

Disconnect from the network and run another prompt. A successful response confirms there is no fallback path quietly reaching the cloud.

What the Performance Actually Feels Like

Numbers from real sessions on a 2021 M1 Max with 32 GB:

Metric	Value	Notes
Decode speed (generation only)	~11 tok/s	The number people usually quote
Blended throughput	~41.5 tok/s	Includes prompt ingestion, which is much faster
First-token latency	1–3 s	At 32K context, after warm-up
RAM at idle (model loaded)	~18 GB
RAM under load	~22 GB	With 32K context filled

Decode speed alone reads as mediocre. Blended throughput, what you actually feel during a coding session, sits in a range where multi-file edits resolve quickly enough to stay in flow. I notice the slowness compared to Opus 4.7 on the API. I do not resent it.

The other surprise was the Pi harness itself. Less ceremony, less permission theater, less wrapper overhead between request and response. On local models, where every layer of indirection costs visible time, the lighter harness is not just an aesthetic preference. It is a perceptible speedup.

Caveats and Limitations

These are real, not footnotes.

Model quality. Qwen 3.6 27B is excellent for an open-weight 27B model and it benchmarks above some much larger predecessors. It is not Opus 4.7. For deep multi-step reasoning, novel architecture decisions, or reasoning about a brand-new framework, the gap is still visible. Use it for ordinary coding work; reach for the cloud for the hard 10%.

Build complexity vs Ollama. Building llama.cpp from source is a real step up in friction. If you hit an OpenSSL header issue, a CMake mismatch, or a Metal compile failure, you will spend an hour. The setup sticks once it works, but the cost is paid up front.

Tool calling fragility. Without --jinja, tool calls silently degrade. With --jinja but a client that injects unexpected headers (the Claude Code attribution issue above), inference slows dramatically. Both failure modes look like "the model is dumb today" until you find the cause.

16 GB machines. Qwen 3.6 27B is the wrong model on 16 GB. Stay on the Ollama path with gpt-oss:20b or qwen3-coder:14b until you have more memory.

Maturity window. llama.cpp's Anthropic Messages API was merged recently and the Qwen 3.6 release is two days old at the time of writing. Expect rough edges in streaming, image inputs, and edge cases of the tool-calling schema. Pin a known-good commit if you find one that works for your workflow.

Privacy is a property of this stack, not a magic word. No data leaves the machine while inference runs. That guarantee disappears the moment you opt into a cloud model from the same harness, paste a snippet into a chat app, or share a screen. The setup is private; usage habits decide whether the privacy survives.

Quick Reference

# --- ONE-TIME SETUP ---

xcode-select --install
brew install cmake aria2 git

git clone https://github.com/ggml-org/llama.cpp ~/src/llama.cpp
cd ~/src/llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j

mkdir -p ~/models/qwen3.6-27b
aria2c -x 8 -s 8 -d ~/models/qwen3.6-27b \
  https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf

npm install -g @mariozechner/pi-coding-agent

# --- DAILY USE ---

# Terminal 1: inference server
~/src/llama.cpp/build/bin/llama-server \
  --model ~/models/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 \
  --ctx-size 32768 --n-gpu-layers 999 \
  --jinja --alias qwen3.6-27b

# Terminal 2: coding agent
export ANTHROPIC_BASE_URL="http://127.0.0.1:8080"
export ANTHROPIC_AUTH_TOKEN="local"
export ANTHROPIC_API_KEY=""
cd ~/my-project
pi --model qwen3.6-27b

Switching Back to Anthropic

unset ANTHROPIC_BASE_URL
unset ANTHROPIC_AUTH_TOKEN
unset ANTHROPIC_API_KEY

Then run claude (or pi) without the --model override to use the default Anthropic backend. A practical workflow on this hardware: Pi + Qwen 3.6 for routine work, the cloud for the hardest 10%. The split lands somewhere around 90/10 for me.

Where This Sits in the Bigger Picture

The previous post made the case that local inference had reached the point where you could plausibly use it for free. This post is the second iteration of the same idea: the model under the hood has improved enough that the experience feels normal, not novel. A 2021 laptop is not supposed to be the right tool for this. In April 2026 it is.

Both posts point at the same conclusion. Local AI is no longer a science project. It is an option, and on hardware most readers already own.

References

April 2026. Models and tooling evolve rapidly; verify versions against official documentation before following these steps.

← Back to journal