f you want agentic coding without shipping prompts and repo paths to a hosted API, r/LocalLLaMA has become one of the loudest places to compare notes. In May 2026 two threads are pulling traffic together: a field report on an RTX 3090 that measures Qwen2.5-Coder, DeepSeek-Coder-V3, and a 70B quant through llama.cpp and Ollama, plus April’s “Best Local LLMs” megathread, where the Agentic / coding / tool use branch turns into a running inventory of Qwen 3.5 MoE sizes, Gemma 4, MiniMax M2.7, OpenCode, and function calling quirks. Both threads reward the same reader: someone trying to keep tool-use reliable on a 24GB card. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread)
The point is not that a local run equals a frontier API on every task. It is that open-weight stacks are now good enough to stress the harness for real: long context, parallel tool calls, and VRAM math that shows up in tok/s logs instead of slide decks. People stick around these threads because they want receipts: tokens per second, quant labels that fit VRAM, and whether the loop survives twenty tool calls without wandering. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread)
Primary sources: the 3090 write-up for timings and stack choices, April’s megathread for what locals actually run under agentic workloads, and llama.cpp’s function-calling doc because posters cite it when wiring parallel_tool_calls. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread, llama.cpp function-calling doc)
What shipped
From the 3090 thread author and neighbors:
- After months of iteration the OP describes a local setup aimed at daily coding loops instead of defaulting to cloud APIs. Hardware is RTX 3090 (24GB VRAM); models in scope include Qwen2.5-Coder 32B Q4_K_M, DeepSeek-Coder-V3 Q4, and Llama 3.3 70B Q3_K_M. (Source: r/LocalLLaMA 3090 thread)
- Inference is llama.cpp plus Ollama. For orchestration the OP names Kosuke ai as a model-agnostic layer so local checkpoints slot into an agentic workflow without rewriting glue each time you swap weights. (Source: r/LocalLLaMA 3090 thread)
- Posted speeds are roughly 18 tok/s for Qwen2.5-Coder 32B Q4_K_M versus about 11 tok/s for DeepSeek-Coder-V3 Q4 on that GPU; 70B at Q3 reads as too slow for tight interactive loops unless you move to dual GPUs. (Source: r/LocalLLaMA 3090 thread)
- The OP benchmarks throughput at 8k versus 32k context, self-correction loops, and whether memory holds across more than twenty tool calls, then argues the hard part is context management across agent steps, not only picking the biggest checkpoint. (Source: r/LocalLLaMA 3090 thread)
From April’s megathread and its agentic subtree:
- The megathread is moderator-led, open weights only, with hundreds of replies sorted into buckets including Agentic / Agentic Coding / Tool Use / Coding and a VRAM ladder from under 8GB to very large setups. (Source: r/LocalLLaMA Apr 2026 megathread)
- The opener highlights Qwen3.5 and Gemma4, calls out GLM-5.1 and MiniMax-M2.7 among notable releases, and treats local agentic coding as a normal category rather than a sideshow. (Source: r/LocalLLaMA Apr 2026 megathread)
- Under agent workloads posters keep returning to Qwen3.5: 27B class weights as a daily driver, 35B A3B MoE quants when you want expert routing without pretending 235B fits a single consumer GPU, and llama.cpp function-calling templates when people discuss stable tool JSON. (Sources: r/LocalLLaMA Apr 2026 megathread, llama.cpp function-calling doc)
- Several contributors say they turn thinking off for tool-heavy stretches so they do not pay extra latency on reasoning tokens when the tool schema already spells out callable actions; MiniMax M2.7 shows up in long-context recipes and OpenCode pairs with qwen3-coder-next when VRAM allows. (Source: r/LocalLLaMA Apr 2026 megathread)
Why these threads matter if you run tools locally
If your goal is coding agents at home, you care about three noisy facts at once: monthly API bills, data leaving your machine, and whether the model still follows instructions after the tenth grep. The 3090 post reads less like marketing and more like a bench log: tok/s, quant names, and explicit comparisons between coder-specialized weights. The megathread adds breadth so you can sanity-check whether your VRAM tier matches someone else’s recipe before you spend a weekend downloading the wrong GGUF. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread)
Operator note (first-hand): I verified both threads through Reddit’s public JSON feeds (old.reddit.com/.../.json), including the ~18 tok/s figure for Qwen2.5-Coder 32B Q4_K_M on the 3090 setup and the megathread’s Agentic subtree structure. Plain www.reddit.com HTML returned a block page from this environment, so JSON was the reproducible route. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread)
Hardware snapshot: what “3090 class” really means in those posts
A 24GB 3090 is still the shorthand card because it is common and tight enough that quant choices bite. The OP’s ~18 tok/s versus ~11 tok/s gap between two coder models is the difference between a loop that feels interactive and one that feels like watching batch jobs. Treat those numbers as tied to context length, batching, and offload settings, not as universal scores. (Source: r/LocalLLaMA 3090 thread)
People running two 3090s or a 3090 beside a newer GPU trade notes on tensor split with llama.cpp. Some see faster generation but worse prompt eval, which can hurt agents that spend large chunks of time in prefill. Inference: split is not a guaranteed win for every harness. (Source: r/LocalLLaMA Apr 2026 megathread)
Decision rule for teams: when two people quote wildly different tok/s on “the same” GGUF, check layer offload first. One thread discussion lands on -ngl set too low so most of the model stayed on CPU, then jumps tok/s after fixing offload. (Source: r/LocalLLaMA Apr 2026 megathread)
Model patterns that keep showing up for agentic work
Qwen3.5 is the workhorse name in April’s agentic replies: 27B for single-GPU workflows where you still want room to breathe, 35B A3B MoE when you want expert routing without 235B VRAM fantasy, and coder-next style builds when the session looks like an IDE marathon. None of that replaces your own eval on your repo, but it explains why the same family appears beside llama.cpp flags and OpenCode configs. (Source: r/LocalLLaMA Apr 2026 megathread)
Gemma 4 appears across categories; some builders like it for coding sessions while others warn about stability under agent harnesses when a crash mid-run is costly. MiniMax M2.7 tends to appear when people describe very long contexts or heavy tool automation; VRAM usually pushes those builds toward bigger tiers than a lone 24GB card. The megathread intro even nicknames MiniMax-M2.7 as accessible Sonnet at home, which is informal language but captures how people benchmark local stacks against hosted assistants. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread)
Tool calls, parallel tools, and memory outside the KV cache
Most “it broke” stories are not glamorous. They are bad JSON, partial arguments, wrong chat templates, and context that balloons step by step. That is why posters point to llama.cpp’s function-calling documentation instead of copying random OpenAI snippets that do not match your local server. (Sources: r/LocalLLaMA Apr 2026 megathread, llama.cpp function-calling doc)
When contributors emphasize parallel_tool_calls=true, they are describing batching tool work that would otherwise serialize into a slow staircase of round trips. Parallel calls only help if your harness can issue them safely; otherwise you still wait on shell and git speed, not decode speed. (Sources: r/LocalLLaMA Apr 2026 megathread, llama.cpp function-calling doc)
The 3090 author’s stress on context management lines up with comments asking for project memory beyond raw KV: once traces get long, summarization and external memory become the product problem. Hermes and similar harness names surface in that gap. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread)
The bottleneck named in the 3090 write-up is not pick the biggest GGUF you can download. It is keeping agent state honest across steps when tools mutate the tree and the model still needs to choose the next call without drifting into theater.
Context: how this sits next to MCP and vendor agent stacks
MCP shows up across AgenticWire as both the integration fabric teams want and a trust problem when STDIO servers blur the line between config and command execution. Framework vendors keep shipping graph workflows and tool contracts for teams that outgrow one-off scripts. Running local weights does not remove those concerns; it shifts where you spend review time. If you want parallel reading on harness versus sandbox language, see https://www.agenticwire.news/article/agents-sdk-harness-native-sandboxes . For MCP packaging from another major stack, see https://www.agenticwire.news/article/microsoft-agent-framework-1-0-workflows-mcp . For STDIO risk when tools become commands, see https://www.agenticwire.news/article/mcp-stdio-config-command-execution-risk . (AgenticWire read: same tooling debates apply whether weights are local or remote.)
Adoption notes: what to try before you blame the model
Decision rules for teams:
- If your failure mode is messy tool IO, fix templates and JSON validation before you chase a bigger parameter count. llama.cpp’s function-calling path is the boring reference that saves hours. (Sources: r/LocalLLaMA Apr 2026 megathread, llama.cpp function-calling doc)
- On 24GB, compare MoE A3B routing against dense coder models at Q4: the 3090 author’s Qwen2.5-Coder versus DeepSeek-Coder-V3 spread is a reminder that speed and quality do not move in lockstep. (Source: r/LocalLLaMA 3090 thread)
- If your UI exposes thinking toggles, treat them as a latency knob for tool-heavy runs: multiple 3090-class reports turn thinking off when the schema already defines tools, because extra reasoning tokens do not fix a bad harness. (Source: r/LocalLLaMA Apr 2026 megathread)
- Log tool traces before you chase speculative decoding. The 3090 OP asks about speculative decoding for agents, but long threads keep returning to state and context long before raw decode tricks. (Sources: r/LocalLLaMA 3090 thread, r/LocalLLaMA Apr 2026 megathread)
Related coverage
- https://www.agenticwire.news/article/agents-sdk-harness-native-sandboxes - Harness versus sandbox vocabulary for long-horizon agents, useful when your local stack is doing double duty as execution environment.
- https://www.agenticwire.news/article/microsoft-agent-framework-1-0-workflows-mcp - Graph workflows and MCP from Microsoft’s agent framing, a contrast point for self-hosted loops.
- https://www.agenticwire.news/article/mcp-stdio-config-command-execution-risk - MCP STDIO risks when tool wiring turns into command execution.
- https://www.agenticwire.news/article/mcp-adoption-accelerates-practical-implementation-guides - Why practical MCP guides took off as teams moved from demos to wiring.



