Running Claude Code locally with Ollama and open-source models as a free alternative to the Anthropic API
Claude Code's API costs add up fast for heavy users, often $50 to $200+/month on Opus 4.5/4.6. Ollama (v0.14.0+) now supports the Anthropic Messages API natively, which means Claude Code can run against local open-source models at zero cost, with no data leaving the machine.
This guide covers the full setup: installing Ollama and Claude Code, choosing a model that fits 16 GB of RAM, connecting the pieces, and understanding the real tradeoffs.
The Cost Problem
Claude Code is Anthropic's terminal-based coding agent. It reads codebases, edits files, runs shell commands, calls tools, and handles multi-step workflows, all from the command line via natural language.
Under the hood, it communicates with Anthropic's API, typically using Claude Sonnet 4.5 or Opus 4.5/4.6. Opus 4.5 charges roughly $15 per million input tokens and $75 per million output tokens. Daily use during active development routinely reaches $50 to $200/month. The Claude Max subscription ($100 to $200/month) flattens that cost but remains significant for independent developers, students, and hobbyists.
How Ollama Changes the Equation
Ollama runs large language models locally. It handles model downloads, quantization, memory management, and API serving. One command pulls a model, another runs it. It supports macOS, Windows, and Linux across both Apple Silicon and NVIDIA GPUs.
Since v0.14.0 (January 2026), Ollama exposes an Anthropic-compatible Messages API on localhost:11434. This is the same protocol Claude Code uses to reach Anthropic's servers. By redirecting Claude Code to Ollama's local endpoint, the agent continues to function (file editing, tool calling, multi-turn reasoning) but inference runs on a local open-source model instead of Anthropic's cloud.
No API key. No usage bill. No data transmitted externally.
How Claude Code Works (Brief Overview)
Claude Code is not an autocomplete tool or a chat wrapper. It is an agent that operates inside the terminal with the following capabilities:
- Codebase awareness: reads project structure, files, and dependencies
- Direct file editing: writes and modifies code across multiple files
- Shell execution: runs commands, tests, and package installations
- Tool calling: invokes external tools and chains multi-step operations
- Git integration: handles commits, branches, and diffs
- Multi-turn reasoning: plans, iterates, and refines across conversation turns
The agent itself is free to install. The cost comes from the model it talks to. This guide replaces the paid model with a free, locally-hosted one.
What Is Ollama
Ollama is an open-source tool for downloading, managing, and serving LLMs (large language models) on local hardware. It abstracts away model format handling (GGUF quantization, memory allocation, GPU offloading) behind a CLI and HTTP API.
Key details:
- Runs on macOS, Windows 11, and Linux
- Supports Apple Silicon (unified memory) and NVIDIA GPUs (CUDA)
- Falls back to CPU inference when no GPU is available (much slower)
- Serves models via a local HTTP API on port 11434
- Since v0.14.0, that API includes Anthropic Messages API compatibility
- Since v0.15.0, the
ollama launchcommand automates Claude Code configuration
Cost Comparison
| Setup | Monthly Cost | Notes |
|---|---|---|
| Claude Code + Opus 4.5 API | ~$50 to $200+ | Scales with token usage |
| Claude Max subscription | $100 to $200 | Flat rate |
| Claude Code + Ollama (local) | $0 | Electricity only |
| Claude Code + Ollama Cloud | Free tier available | Paid plans start at ~$3/month |
Annual savings range from $600 to $2,400 depending on prior usage. The tradeoff is model capability, covered in the Caveats section below.
Requirements
| Component | Specification |
|---|---|
| OS | macOS 13.0+ (Apple Silicon recommended) or Windows 11 |
| RAM | 16 GB minimum |
| Disk | 15 to 25 GB free for model files |
| Ollama | v0.15+ (ollama.com/download) |
| Claude Code | Current release (code.claude.com) |
| Internet | Required for initial downloads only |
16 GB of RAM limits model selection to the 14B to 20B parameter range. The experience will be noticeably slower than on 32 GB+ machines, and model quality drops compared to larger models. Specific model recommendations for this constraint follow in Step 2.
Step 1: Install Ollama
Download the installer from ollama.com/download.
macOS: Open the .dmg, drag Ollama to Applications. It runs as a background service.
Windows 11: Run OllamaSetup.exe and follow the prompts. It installs as a system service.
Verify the installation:
ollama --version
Expected output: ollama version is 0.15.x (or newer). If the command fails, the Ollama service may not be running. Start it manually with ollama serve in a separate terminal window.
Step 2: Choose and Pull a Model
Model selection determines the quality/speed/memory tradeoff. These are the current recommendations for 16 GB RAM systems, ordered by general coding effectiveness:
Local Models (16 GB RAM)
| Model | Download Size | Strengths | Command |
|---|---|---|---|
gpt-oss:20b |
~13 GB | Strong coding, reliable tool calling. Top pick at this memory tier. | ollama pull gpt-oss:20b |
glm-4.7-flash |
~12 GB | MoE architecture (30B total, 3B active per token). Fast inference, native tool calling, 128K context. | ollama pull glm-4.7-flash |
qwen3-coder:14b |
~9 GB | Coding-specialized. Lower memory footprint, reasonable quality. | ollama pull qwen3-coder:14b |
devstral-small |
~14 GB | Mistral's coding model. Competent at general development tasks. | ollama pull devstral-small |
Start with gpt-oss:20b. If memory pressure causes slowdowns or heavy swapping, drop to qwen3-coder:14b.
Cloud Models (No Local Hardware Constraint)
Ollama also hosts cloud-served models accessible through the same CLI. These run at full context length on remote infrastructure with a free tier:
ollama pull glm-4.7:cloud
ollama pull gpt-oss:120b-cloud
ollama pull minimax-m2.1:cloud
Cloud models are significantly more capable than anything that fits in 16 GB locally. They serve as a practical fallback when local inference is too slow or too limited for a given task. Note: data does leave the machine when using cloud models.
Pull the Model
For this guide, using gpt-oss:20b:
ollama pull gpt-oss:20b
This downloads approximately 13 GB of model weights. After completion, verify:
ollama list
The model should appear with its name, ID, and size.
Step 3: Install Claude Code
Claude Code installs as a standalone binary. The npm installation method is deprecated; use the native installer.
macOS / Linux / WSL
curl -fsSL https://claude.ai/install.sh | bash
Reload the shell configuration:
source ~/.bashrc # or: source ~/.zshrc
Windows (PowerShell)
irm https://claude.ai/install.ps1 | iex
Windows (CMD)
curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmd
Verify
claude --version
If the command is not found, confirm that ~/.local/bin or ~/.claude/bin is in the system PATH. On macOS/Linux:
echo 'export PATH="$HOME/.local/bin:$HOME/.claude/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
Step 4: Connect Claude Code to Ollama
Three configuration methods, from simplest to most flexible:
Option A: ollama launch (Recommended, Ollama v0.15+)
ollama launch claude
This walks through model selection and starts Claude Code with the correct environment variables. No manual configuration required.
Option B: Environment Variables (Manual)
macOS / Linux:
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
claude --model gpt-oss:20b
Add the three export lines to ~/.bashrc or ~/.zshrc to persist across sessions.
Windows (PowerShell):
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = ""
claude --model gpt-oss:20b
Option C: Settings File (Persistent)
Create or edit ~/.claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:11434",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_API_KEY": ""
}
}
Then run Claude Code with the --model flag:
claude --model gpt-oss:20b
Step 5: Run It
Navigate to a project directory and start Claude Code:
cd ~/my-project
claude --model gpt-oss:20b
Test with a prompt:
What files are in this project and what does it do?
or:
Write a Python function that reads a CSV and returns the top 5 rows sorted by a given column.
Claude Code reads the project files, reasons about the request, and writes or edits code, all powered by the local model.
Offline Verification
Disconnect from the internet and run a prompt. A successful response confirms fully local operation with no external data transmission.
Context Length Configuration
Claude Code performs better with large context windows. Ollama recommends at least 64K tokens for coding tools. On a 16 GB RAM machine, 16K to 32K is more realistic to avoid excessive memory pressure.
Set context length via environment variable before starting Ollama:
export OLLAMA_CONTEXT_LENGTH=32000
ollama serve
Cloud models do not have this constraint; they run at their full context length on remote infrastructure.
Caveats and Limitations
These are real constraints, not footnotes.
Inference speed. Local inference on 16 GB hardware is slow. Expect 10 to 60 seconds per response depending on complexity. Multi-file refactors can take several minutes. This is a fundamental hardware limitation, not a software bug.
Model quality. Open-source models in the 14B to 20B parameter range are competent for common coding patterns, code explanation, test generation, and standard refactoring. They fall short on complex multi-step reasoning, novel architecture decisions, and tasks requiring deep domain knowledge. They are not comparable to Opus 4.5/4.6 in capability.
Tool calling reliability. Claude Code depends on the model's ability to produce correctly formatted tool calls. gpt-oss:20b and glm-4.7-flash handle this consistently. Other models may fail intermittently. If tool calling breaks repeatedly with a given model, switch to one of these two.
Memory pressure at 16 GB. Running a 13 GB model leaves roughly 3 GB for the OS, context window, and other applications. Close unnecessary programs. Expect swapping if running memory-heavy applications alongside inference. If the system becomes unresponsive, switch to a smaller model or use Ollama Cloud.
Maturity. Ollama's Anthropic API compatibility shipped in January 2026. Edge cases in streaming and tool calling are still being patched. Check Ollama's release notes for fixes relevant to Claude Code workflows.
Quick Reference
# --- ONE-TIME SETUP ---
# Install Ollama (or download from ollama.com/download)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull gpt-oss:20b
# Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
source ~/.bashrc
# --- DAILY USE ---
# Easiest method (Ollama v0.15+):
ollama launch claude
# Manual method:
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
cd ~/my-project
claude --model gpt-oss:20b
# --- MANAGEMENT ---
ollama list # List installed models
ollama pull <model> # Download a model
ollama rm <model> # Remove a model
claude --version # Check Claude Code version
Switching Back to Anthropic
To return to Anthropic's API for tasks that require a more capable model, unset the environment variables:
unset ANTHROPIC_BASE_URL
unset ANTHROPIC_AUTH_TOKEN
unset ANTHROPIC_API_KEY
Run claude without --model to use the default Anthropic backend. A practical workflow: use local models for routine development, switch to Anthropic for tasks where model quality is the bottleneck.
References
- Ollama - Download
- Ollama Blog - Claude Code Compatibility
- Ollama Blog -
ollama launch - Ollama Docs - Claude Code Integration
- Claude Code - Setup Documentation
- Ollama Docs - Anthropic API Compatibility
February 2026. Models and tooling evolve rapidly; verify versions against official documentation before following these steps.