[ türkçe ]
~ home > blog > local llm on rtx 5090

Pinning a Local LLM to an RTX 5090: Five Hours, Several Faceplants, One Solid Setup

may 5, 2026  |  afternoon, at my desk  |  ~local-llm

I am a 42-year-old engineer, and every time my internet drops, my Claude Code drops with it. I got tired of that dependency. One afternoon I sat down and built out a local AI stack. Not Anthropic-grade, but genuinely useful. Here is the run.

>> The Hardware

The 24 GB VRAM is the gravity well that shapes every decision. A 30B-class model fits cleanly. A 70B starts shoving itself out into RAM.

>> Initial Pulls

Ollama was already on the box (v0.15.4). Three models:

qwen3:32b              # general chat
qwen3-coder:30b        # code, agentic work
llama3.3:70b-q3_K_M    # max quality, slow

Roughly 70 GB and 80 minutes later, models on disk. At this stage I had power but no harness.

>> First Faceplant: Qwen-Code CLI

The Qwen team ships their own qwen-code, meant to be the Claude Code analogue. Installed it, pointed it at local Ollama, asked it to "analyze the src directory and write a bug report." It promptly hallucinated a "vim mode implementation" task and added it to a todo list. Then it started fetching GitHub URLs about Vim's source code. Zero connection to my actual request.

The technical why: qwen-code is a fork of Gemini CLI. Its tool registry uses names (todo, AskUserQuestion) that the Qwen models were never trained to call. Every step it tries the wrong tool name, the call fails, the model panics. Compounded by Ollama's flaky XML-to-OpenAI-tool-call translation for qwen3-coder. Domino collapse.

Lesson: "built for our model" does not mean "works on your stack." Format adapter maturity matters as much as the model itself.

>> Recovery: OpenCode

Switched to OpenCode. It uses Vercel's @ai-sdk/openai-compatible adapter, battle-tested across 50+ providers. Installed cleanly, worked on the first prompt.

curl -fsSL https://opencode.ai/install | bash

Configured ~/.config/opencode/opencode.json with the Ollama provider, set qwen3-coder:30b as default. First prompt did exactly what I asked. No phantom todos, no fake URLs. Same model, same Ollama, dramatically different outcome. Purely because the harness was built provider-agnostic from day one.

>> The Real Work: Memory Tuning

First try with 128K context. ollama ps told me:

SIZE: 32 GB    PROCESSOR: 24%/76% CPU/GPU

A 32 GB workload jammed into 24 GB VRAM. Ollama dutifully spilled 24% of the model into system RAM. Result: GPU sat at 15% utilization while my 24 CPU threads pegged at 100%. Throughput: ~10 tok/s.

Two fixes:

1. Drop context to 64K. Built a custom variant via Modelfile:

FROM qwen3-coder:30b
PARAMETER num_ctx 65536

Total 25 GB, 96% GPU split, 45 tok/s.

2. Enable Flash Attention + Q8 KV cache. systemctl edit ollama:

[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

KV cache (the model's "running memory" of the conversation) gets quantized from FP16 to 8-bit integer. Halves the cache footprint, with a perplexity hit measured in fractions of a percent. Bonus: long-context inference ~30% faster.

>> Final Config

I pushed context to 100K and it just barely fit:

qwen3-coder:30b-100k    24 GB    100% GPU    102400 ctx

That is the physical edge of 24 GB VRAM. 800 MB headroom, model fully GPU-resident, KV cache at Q8, Flash Attention on. Throughput: 50-60 tok/s. Practical capacity: ~8000 lines of code in a single context window.

>> Lessons Banked

1. Agent CLIs matured inside the Anthropic ecosystem. Outside it, most are marketing. OpenCode is the exception I have found, designed provider-agnostic from the ground up.

2. Advertised context size lies. "256K supported" does not mean "256K on your GPU." The real limit is hardware × KV cache format.

3. Q8 KV cache should be default. It is a free lunch. I have no idea why it is still opt-in.

4. Local LLMs are not Anthropic-grade. SWE-bench Verified: Opus 4.5 = 80%, Qwen3-Coder = 70%. But for my daily 80% it genuinely works, and it is free and offline.

5. 5090 mobile + 30B-class models is the sweet spot. A 70B at q3 runs but crawls; 30B class is what this hardware is actually built for.

>> Postscript

Total time invested: 5 hours, mostly waiting for downloads.
Monthly savings: maybe $50 in cloud tokens, but the real win is offline work and data privacy.
Regrets: none.

When I have internet, claude (cloud). When I do not, opencode (local). Both have their place. We used to fight settings.json in VS Code; today we fight Modelfile. The engineering is the same engineering.