Local AI workflow

How to Run Claude Code with Local AI Models Without Breaking llama.cpp KV Cache

Local backends like llama.cpp, llama-server, and LM Studio are fastest when the beginning of the prompt stays the same. If Claude Code keeps adding changing metadata or git context, the backend may miss the KV cache and re-process the large system prompt again. The setup below keeps the prompt more stable.

TL;DR: use this version.

1. Start llama-server

/path/to/llama-server \
  --model /path/to/your-model.gguf \
  --jinja \
  --reasoning auto \
  --threads 12 \
  --n-gpu-layers 99 \
  --flash-attn on \
  --mlock \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-ram 24576 \
  --ctx-checkpoints 128 \
  --checkpoint-every-n-tokens 1024 \
  --slot-prompt-similarity 0.01 \
  --host 127.0.0.1 \
  --port 8080 \
  --parallel 1 \
  --cont-batching \
  --metrics \
  --slots

If you use a TurboQuant build, use:

--cache-type-v turbo4

2. Start Claude Code

ANTHROPIC_BASE_URL=http://127.0.0.1:8080 \
ANTHROPIC_API_KEY=no-key \
ANTHROPIC_MODEL=your-model.gguf \
CLAUDE_CODE_ATTRIBUTION_HEADER=0 \
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 \
DISABLE_TELEMETRY=1 \
DISABLE_ERROR_REPORTING=1 \
claude --bare \
  --model your-model.gguf \
  --dangerously-skip-permissions \
  --exclude-dynamic-system-prompt-sections \
  --settings '{"includeGitInstructions":false}' \
  --allowedTools "WebFetch,Read,Edit,Write,Bash(curl:*)"

Core idea: fewer dynamic prompt changes means better prefix reuse, less repeated prefill, and faster local Claude Code experiments.

I will keep updating this note as the setup changes.