Running Claude Code locally with Ollama and open-source models as a free alternative to the Anthropic API

Claude Code's API costs add up fast for heavy users, often $50 to $200+/month on Opus 4.5/4.6. Ollama (v0.14.0+) now supports the Anthropic Messages API natively, which means Claude Code can run against local open-source models at zero cost, with no data leaving the machine.

This guide covers the full setup: installing Ollama and Claude Code, choosing a model that fits 16 GB of RAM, connecting the pieces, and understanding the real tradeoffs.

The Cost Problem

Claude Code is Anthropic's terminal-based coding agent. It reads codebases, edits files, runs shell commands, calls tools, and handles multi-step workflows, all from the command line via natural language.

Under the hood, it communicates with Anthropic's API, typically using Claude Sonnet 4.5 or Opus 4.5/4.6. Opus 4.5 charges roughly $15 per million input tokens and $75 per million output tokens. Daily use during active development routinely reaches $50 to $200/month. The Claude Max subscription ($100 to $200/month) flattens that cost but remains significant for independent developers, students, and hobbyists.

How Ollama Changes the Equation

Ollama runs large language models locally. It handles model downloads, quantization, memory management, and API serving. One command pulls a model, another runs it. It supports macOS, Windows, and Linux across both Apple Silicon and NVIDIA GPUs.

Since v0.14.0 (January 2026), Ollama exposes an Anthropic-compatible Messages API on localhost:11434. This is the same protocol Claude Code uses to reach Anthropic's servers. By redirecting Claude Code to Ollama's local endpoint, the agent continues to function (file editing, tool calling, multi-turn reasoning) but inference runs on a local open-source model instead of Anthropic's cloud.

No API key. No usage bill. No data transmitted externally.

How Claude Code Works (Brief Overview)

Claude Code is not an autocomplete tool or a chat wrapper. It is an agent that operates inside the terminal with the following capabilities:

Codebase awareness: reads project structure, files, and dependencies
Direct file editing: writes and modifies code across multiple files
Shell execution: runs commands, tests, and package installations
Tool calling: invokes external tools and chains multi-step operations
Git integration: handles commits, branches, and diffs
Multi-turn reasoning: plans, iterates, and refines across conversation turns

The agent itself is free to install. The cost comes from the model it talks to. This guide replaces the paid model with a free, locally-hosted one.

What Is Ollama

Ollama is an open-source tool for downloading, managing, and serving LLMs (large language models) on local hardware. It abstracts away model format handling (GGUF quantization, memory allocation, GPU offloading) behind a CLI and HTTP API.

Key details:

Runs on macOS, Windows 11, and Linux
Supports Apple Silicon (unified memory) and NVIDIA GPUs (CUDA)
Falls back to CPU inference when no GPU is available (much slower)
Serves models via a local HTTP API on port 11434
Since v0.14.0, that API includes Anthropic Messages API compatibility
Since v0.15.0, the ollama launch command automates Claude Code configuration

Cost Comparison

Setup	Monthly Cost	Notes
Claude Code + Opus 4.5 API	~$50 to $200+	Scales with token usage
Claude Max subscription	$100 to $200	Flat rate
Claude Code + Ollama (local)	$0	Electricity only
Claude Code + Ollama Cloud	Free tier available	Paid plans start at ~$3/month

Annual savings range from $600 to $2,400 depending on prior usage. The tradeoff is model capability, covered in the Caveats section below.

Requirements

Component	Specification
OS	macOS 13.0+ (Apple Silicon recommended) or Windows 11
RAM	16 GB minimum
Disk	15 to 25 GB free for model files
Ollama	v0.15+ (ollama.com/download)
Claude Code	Current release (code.claude.com)
Internet	Required for initial downloads only

16 GB of RAM limits model selection to the 14B to 20B parameter range. The experience will be noticeably slower than on 32 GB+ machines, and model quality drops compared to larger models. Specific model recommendations for this constraint follow in Step 2.

Step 1: Install Ollama

Download the installer from ollama.com/download.

macOS: Open the .dmg, drag Ollama to Applications. It runs as a background service.

Windows 11: Run OllamaSetup.exe and follow the prompts. It installs as a system service.

Verify the installation:

ollama --version

Expected output: ollama version is 0.15.x (or newer). If the command fails, the Ollama service may not be running. Start it manually with ollama serve in a separate terminal window.

Step 2: Choose and Pull a Model

Model selection determines the quality/speed/memory tradeoff. These are the current recommendations for 16 GB RAM systems, ordered by general coding effectiveness:

Local Models (16 GB RAM)

Model	Download Size	Strengths	Command
`gpt-oss:20b`	~13 GB	Strong coding, reliable tool calling. Top pick at this memory tier.	`ollama pull gpt-oss:20b`
`glm-4.7-flash`	~12 GB	MoE architecture (30B total, 3B active per token). Fast inference, native tool calling, 128K context.	`ollama pull glm-4.7-flash`
`qwen3-coder:14b`	~9 GB	Coding-specialized. Lower memory footprint, reasonable quality.	`ollama pull qwen3-coder:14b`
`devstral-small`	~14 GB	Mistral's coding model. Competent at general development tasks.	`ollama pull devstral-small`

Start with gpt-oss:20b. If memory pressure causes slowdowns or heavy swapping, drop to qwen3-coder:14b.

Cloud Models (No Local Hardware Constraint)

Ollama also hosts cloud-served models accessible through the same CLI. These run at full context length on remote infrastructure with a free tier:

ollama pull glm-4.7:cloud
ollama pull gpt-oss:120b-cloud
ollama pull minimax-m2.1:cloud

Cloud models are significantly more capable than anything that fits in 16 GB locally. They serve as a practical fallback when local inference is too slow or too limited for a given task. Note: data does leave the machine when using cloud models.

Pull the Model

For this guide, using gpt-oss:20b:

ollama pull gpt-oss:20b

This downloads approximately 13 GB of model weights. After completion, verify:

ollama list

The model should appear with its name, ID, and size.

Step 3: Install Claude Code

Claude Code installs as a standalone binary. The npm installation method is deprecated; use the native installer.

macOS / Linux / WSL

curl -fsSL https://claude.ai/install.sh | bash

Reload the shell configuration:

source ~/.bashrc   # or: source ~/.zshrc

Windows (PowerShell)

irm https://claude.ai/install.ps1 | iex

Windows (CMD)

curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmd

Verify

claude --version

If the command is not found, confirm that ~/.local/bin or ~/.claude/bin is in the system PATH. On macOS/Linux:

echo 'export PATH="$HOME/.local/bin:$HOME/.claude/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Step 4: Connect Claude Code to Ollama

Three configuration methods, from simplest to most flexible:

Option A: `ollama launch` (Recommended, Ollama v0.15+)

ollama launch claude

This walks through model selection and starts Claude Code with the correct environment variables. No manual configuration required.

Option B: Environment Variables (Manual)

macOS / Linux:

export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""

claude --model gpt-oss:20b

Add the three export lines to ~/.bashrc or ~/.zshrc to persist across sessions.

Windows (PowerShell):

$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = ""

claude --model gpt-oss:20b

Option C: Settings File (Persistent)

Create or edit ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": ""
  }
}

Then run Claude Code with the --model flag:

claude --model gpt-oss:20b

Step 5: Run It

Navigate to a project directory and start Claude Code:

cd ~/my-project
claude --model gpt-oss:20b

Test with a prompt:

What files are in this project and what does it do?

or:

Write a Python function that reads a CSV and returns the top 5 rows sorted by a given column.

Claude Code reads the project files, reasons about the request, and writes or edits code, all powered by the local model.

Offline Verification

Disconnect from the internet and run a prompt. A successful response confirms fully local operation with no external data transmission.

Context Length Configuration

Claude Code performs better with large context windows. Ollama recommends at least 64K tokens for coding tools. On a 16 GB RAM machine, 16K to 32K is more realistic to avoid excessive memory pressure.

Set context length via environment variable before starting Ollama:

export OLLAMA_CONTEXT_LENGTH=32000
ollama serve

Cloud models do not have this constraint; they run at their full context length on remote infrastructure.

Caveats and Limitations

These are real constraints, not footnotes.

Inference speed. Local inference on 16 GB hardware is slow. Expect 10 to 60 seconds per response depending on complexity. Multi-file refactors can take several minutes. This is a fundamental hardware limitation, not a software bug.

Model quality. Open-source models in the 14B to 20B parameter range are competent for common coding patterns, code explanation, test generation, and standard refactoring. They fall short on complex multi-step reasoning, novel architecture decisions, and tasks requiring deep domain knowledge. They are not comparable to Opus 4.5/4.6 in capability.

Tool calling reliability. Claude Code depends on the model's ability to produce correctly formatted tool calls. gpt-oss:20b and glm-4.7-flash handle this consistently. Other models may fail intermittently. If tool calling breaks repeatedly with a given model, switch to one of these two.

Memory pressure at 16 GB. Running a 13 GB model leaves roughly 3 GB for the OS, context window, and other applications. Close unnecessary programs. Expect swapping if running memory-heavy applications alongside inference. If the system becomes unresponsive, switch to a smaller model or use Ollama Cloud.

Maturity. Ollama's Anthropic API compatibility shipped in January 2026. Edge cases in streaming and tool calling are still being patched. Check Ollama's release notes for fixes relevant to Claude Code workflows.

Quick Reference

# --- ONE-TIME SETUP ---

# Install Ollama (or download from ollama.com/download)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull gpt-oss:20b

# Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
source ~/.bashrc

# --- DAILY USE ---

# Easiest method (Ollama v0.15+):
ollama launch claude

# Manual method:
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
cd ~/my-project
claude --model gpt-oss:20b

# --- MANAGEMENT ---
ollama list              # List installed models
ollama pull <model>      # Download a model
ollama rm <model>        # Remove a model
claude --version         # Check Claude Code version

Switching Back to Anthropic

To return to Anthropic's API for tasks that require a more capable model, unset the environment variables:

unset ANTHROPIC_BASE_URL
unset ANTHROPIC_AUTH_TOKEN
unset ANTHROPIC_API_KEY

Run claude without --model to use the default Anthropic backend. A practical workflow: use local models for routine development, switch to Anthropic for tasks where model quality is the bottleneck.

References

February 2026. Models and tooling evolve rapidly; verify versions against official documentation before following these steps.

← Back to journal