Use Local Models as a Compression Layer to Cut Your Claude Bill


You’re paying Claude $5/MTok input to read a 528 KB GitHub issue with 400 comments.

You’re paying Claude image-tokens to look at a screenshot when you really just need to know “what does the error say.”

For a huge class of work, you don’t need Claude’s reasoning to ingest the content — you need a smart-enough model to compress it into something Claude actually has to think about. Local (or cheap cloud) models do this for a fraction of the cost. Route the big content through the cheap model first, hand the summary to Claude, and save the expensive tokens for the actual work.

This post walks through two concrete patterns with real code and the token math.

The pattern

Big content (text, image)

Cheap model (Ollama, LM Studio, vLLM, OpenRouter...)

Compressed summary (1–5% of original size)

Claude does the work that needs Claude

The cheap model handles compression. Claude handles reasoning. They’re doing different jobs.

The economics work because there’s usually a large gap between “raw content size” and “what you actually need.” A GitHub issue with 400 comments is 528 KB of discussion, but the relevant signal — current state, recent activity, maintainer responses — fits in 2–3 KB. A screenshot is thousands of image tokens; “there’s a red TypeError: undefined is not a function in the top-left panel” is 20 tokens.

The bigger that gap, the more you save.

Pattern 1: Summarize large text

Use case: an external document arrives that’s mostly noise with a few important facts buried inside.

  • A GitHub issue with 100+ comments where you need recent activity and maintainer responses
  • A 50-page PDF where you need the conclusions and specific data points
  • A multi-megabyte log file where you need the errors and surrounding context
  • A long article where you need the gist plus any specific claims with numbers

The mechanic:

gh issue view 42796 --repo anthropics/claude-code --comments \
  | python3 summarize.py --focus="recent activity, maintainer responses, referenced PRs"

Here’s the full script — no dependencies beyond stdlib, 50 lines:

#!/usr/bin/env python3
"""
Summarize large text via Ollama Cloud API (direct, no local proxy).

Default model: deepseek-v3.2:cloud (top reasoning, handles long context).

Requires: OLLAMA_API_KEY env var  (https://ollama.com/settings/keys)
          OR ~/.ollama/api-key file

Usage:
  summarize.py <file>                        # summarize file
  summarize.py <file> --focus="X"            # focused summary
  summarize.py --model=glm-5:cloud <file>    # override model
  cat huge.txt | summarize.py                # stdin
"""

import argparse, json, os, sys, urllib.request, urllib.error
from pathlib import Path

CLOUD_ENDPOINT = "https://ollama.com/api/chat"

def get_api_key() -> str:
    key = os.environ.get("OLLAMA_API_KEY")
    if key:
        return key.strip()
    key_file = Path.home() / ".ollama" / "api-key"
    if key_file.exists():
        return key_file.read_text().strip()
    sys.stderr.write("OLLAMA_API_KEY not set. Get one at https://ollama.com/settings/keys\n")
    sys.exit(1)

def summarize(content: str, focus: str | None = None, model: str = "deepseek-v3.2:cloud") -> str:
    if focus:
        instruction = (
            f"Summarize the following content, focused on: {focus}.\n\n"
            "Be concise but preserve specific facts (names, numbers, dates, URLs, "
            "quotes from key people). Preserve identifiable structure (comments, sections)."
        )
    else:
        instruction = (
            "Summarize the following content concisely. Preserve specific facts "
            "(names, numbers, dates, URLs, quotes from key people). "
            "Preserve identifiable structure (comments, sections)."
        )
    req_body = json.dumps({
        "model": model,
        "messages": [{"role": "user", "content": f"{instruction}\n\nContent:\n{content}"}],
        "stream": False,
    }).encode()
    req = urllib.request.Request(
        CLOUD_ENDPOINT, data=req_body,
        headers={"Content-Type": "application/json",
                 "Authorization": f"Bearer {get_api_key()}"},
    )
    try:
        with urllib.request.urlopen(req, timeout=300) as resp:
            result = json.loads(resp.read())
    except urllib.error.HTTPError as e:
        sys.stderr.write(f"HTTP {e.code}: {e.read().decode(errors='replace')}\n")
        sys.exit(1)
    return result.get("message", {}).get("content", "").strip()

def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("file", nargs="?")
    parser.add_argument("--focus")
    parser.add_argument("--model", default="deepseek-v3.2:cloud")
    args = parser.parse_args()
    content = open(args.file).read() if args.file else sys.stdin.read()
    if not content.strip():
        sys.stderr.write("No content provided.\n"); sys.exit(1)
    print(summarize(content, args.focus, args.model))

if __name__ == "__main__":
    main()

The script wraps the content in a focused summarization prompt and routes to Ollama Cloud. The default model is deepseek-v3.2:cloud — strong reasoning, handles long context well, good at preserving specific facts (names, numbers, dates, URLs) rather than just vibes-summarizing everything into mush.

If you’re not using Ollama Cloud, the same pattern works with any OpenAI-compatible endpoint. Swap CLOUD_ENDPOINT and the auth header for your provider — the prompt logic stays identical. LM Studio, vLLM, and OpenRouter all work.

The token math

Real example from building this technique:

  • Input: GitHub issue with 407 comments → 528 KB → ~130,000 tokens
  • Compressed output: ~2,500 tokens (current state, recent comments, maintainer responses, referenced issues)
  • Compression ratio: ~98%

Cost comparison (as of 2026-04-14):

ApproachTokensClaude Sonnet 3.7 cost ($3/MTok input)
Feed raw to Claude~130,000 tokens$0.39 (uncached) / $0.039 (cached)
Compress first, feed summary~2,500 tokens$0.0075
Savings95–98%

The Ollama Cloud subscription cost is fixed — you’re not paying per token there. Once you’re on a plan, every compression call is effectively free relative to what you’d pay Claude.

If you’re running local models (llama.cpp, LM Studio, Ollama locally), your marginal cost is electricity.

At 100 “read this big thing” operations per month, you’re looking at $30–40/month in savings at Sonnet rates. More if you’re on Opus.

Pattern 2: Describe images before feeding them to Claude

Use case: you have an image you want to act on but don’t strictly need Claude to see it.

  • Screenshots in a batch QA workflow — 50 screenshots means 50 × image token costs
  • A log dashboard screenshot: you just need to know if anything is red
  • A diagram you want summarized into text
  • A PDF converted to images for processing
python3 describe.py screenshot.png --focus="any error messages or red UI elements"

Here’s the full script:

#!/usr/bin/env python3
"""
Describe an image via Ollama Cloud API (direct, no local proxy).

Default model: gemma4:cloud (fast, multimodal, good general image understanding).

Requires: OLLAMA_API_KEY env var  (https://ollama.com/settings/keys)
          OR ~/.ollama/api-key file

Usage:
  describe.py <image-path>
  describe.py <image-path> --focus="any errors or red UI elements"
  describe.py <image-path> --model=kimi-k2.5:cloud   # harder cases
"""

import argparse, base64, json, os, sys, urllib.request, urllib.error
from pathlib import Path

CLOUD_ENDPOINT = "https://ollama.com/api/chat"

def get_api_key() -> str:
    key = os.environ.get("OLLAMA_API_KEY")
    if key:
        return key.strip()
    key_file = Path.home() / ".ollama" / "api-key"
    if key_file.exists():
        return key_file.read_text().strip()
    sys.stderr.write("OLLAMA_API_KEY not set. Get one at https://ollama.com/settings/keys\n")
    sys.exit(1)

def describe(image_path: str, focus: str | None = None, model: str = "gemma4:cloud") -> str:
    if not os.path.exists(image_path):
        sys.stderr.write(f"Image not found: {image_path}\n"); sys.exit(1)
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode("ascii")
    if focus:
        instruction = (
            f"Describe this image, focused on: {focus}.\n\n"
            "Be specific. Include text content verbatim if visible. "
            "Note layout, colors, UI elements, errors. "
            "Use plain prose — this will be read by another AI."
        )
    else:
        instruction = (
            "Describe this image in detail.\n\n"
            "Include: visible text (verbatim), layout structure, UI elements, "
            "people/objects/scene, chart data. Be specific. "
            "Use plain prose — this will be read by another AI."
        )
    req_body = json.dumps({
        "model": model,
        "messages": [{"role": "user", "content": instruction, "images": [image_b64]}],
        "stream": False,
    }).encode()
    req = urllib.request.Request(
        CLOUD_ENDPOINT, data=req_body,
        headers={"Content-Type": "application/json",
                 "Authorization": f"Bearer {get_api_key()}"},
    )
    try:
        with urllib.request.urlopen(req, timeout=300) as resp:
            result = json.loads(resp.read())
    except urllib.error.HTTPError as e:
        sys.stderr.write(f"HTTP {e.code}: {e.read().decode(errors='replace')}\n")
        sys.exit(1)
    return result.get("message", {}).get("content", "").strip()

def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("image")
    parser.add_argument("--focus")
    parser.add_argument("--model", default="gemma4:cloud")
    args = parser.parse_args()
    print(describe(args.image, args.focus, args.model))

if __name__ == "__main__":
    main()

Default model is gemma4:cloud — fast, multimodal, handles general image understanding well. For harder cases (complex diagrams, ambiguous content, charts that need interpretation), override to kimi-k2.5:cloud.

Again, if you’re not on Ollama Cloud: LM Studio with LLaVA, a local ollama run llava, or any vision-capable model through OpenRouter will work. The pattern is the same — get text out of the image before it touches Claude’s context.

The critical caveat

This only helps if you invoke it before the image enters Claude’s context.

If you’ve already pasted a screenshot into a Claude chat window, the image tokens are already spent. This technique is for workflows where you control when content enters context:

  1. Batch processing scripts — loop over a directory of screenshots before any Claude call
  2. Subagent workflows — a UX review agent runs describe.py on each screenshot, builds a text report, then asks Claude to synthesize
  3. Pre-emptive description — you’re about to paste an image into chat, run the script first instead

Model selection heuristic

Most “compression” use cases map to a short list of model types (as of 2026-04-14):

TaskDefaultOverride when
Long-form text summarizationdeepseek-v3.2:cloudVery long or reasoning-heavy docs → glm-5:cloud
Image descriptiongemma4:cloudComplex diagrams, ambiguous visuals → kimi-k2.5:cloud
Code summarizationqwen3-coder-next:cloudPurpose-built for code
Quick classification (yes/no)nemotron-3-nano:cloud or ministral-3:cloudFast small models

One useful heuristic: with a flat-rate plan (Ollama Cloud, or your own hardware), there’s little reason to reach for a weaker model to save money. You’re already paying for the tier. Use the most capable model for the task category.

How to plug this in

Ad-hoc, before a Claude call:

# Summarize a big document, then use the output
SUMMARY=$(gh issue view 42796 --comments | python3 summarize.py --focus="state of the issue, blocking factors")
# $SUMMARY is now ~2,500 tokens instead of ~130,000
# Feed $SUMMARY into your Claude call

As a Claude Code skill:

Drop summarize.py into .claude/skills/summarize-large/scripts/ with a skill definition file:

---
name: summarize-large
description: Summarize large text via Ollama Cloud before feeding to Claude. Use for GitHub issues, PDFs, logs, articles.
allowed-tools: Bash
---

When you're about to read large external content into context, run it through
`python3 .claude/skills/summarize-large/scripts/summarize.py` first.
Pass `--focus` to target what matters for the current task.

Claude Code will invoke it automatically when the description matches. Instead of reading a 400-comment issue directly, it summarizes first and hands the compressed result back into context.

In a batch processing script:

#!/usr/bin/env bash
# QA screenshot review — describe all screenshots, then ask Claude to summarize
for img in qa-screenshots/*.png; do
  echo "=== $(basename $img) ==="
  python3 describe.py "$img" --focus="any UI errors, broken layouts, accessibility issues"
done > all-descriptions.txt

# all-descriptions.txt is now text — fast, cheap to feed to Claude

When NOT to use this

  • Small content — files under ~10 KB text or ~100 KB images. The token cost is already low; adding a round-trip to another model just adds latency.
  • Code where exact syntax matters — a summary of code is rarely useful. You need the actual source.
  • Real-time interactive use — if latency matters more than cost, the compression round-trip is the wrong trade.
  • Anything where Claude’s specific judgment is the point — subtle code review, nuanced analysis, creative work. Don’t compress; just let Claude do the work.

The rule: if “what you need from the content” is much smaller than the content itself, compress. If you need everything, don’t.

One more thing: this and caching aren’t competing

Prompt caching and compression are complementary. Caching saves you money when you repeatedly send the same prefix to Claude. Compression reduces the token count before it ever gets there. Use both:

  1. Run summarize.py on big inputs before they touch Claude
  2. Structure your prompts so the stable parts (system prompt, shared rules) cache across calls

Together they address different parts of your token bill — compression handles ingestion cost, caching handles repetition cost.

Getting the scripts

Both scripts are pure stdlib Python — no pip installs, no virtual environments. Copy them directly. The only requirement is an OLLAMA_API_KEY environment variable (or a key file at ~/.ollama/api-key). Get a key at ollama.com/settings/keys.

If you’re using a different provider, the scripts are straightforward to adapt — swap the endpoint URL and auth header, keep the prompt logic.


Sources

  1. Ollama Cloud models catalog — model availability and descriptions referenced for the selection table (as of 2026-04-14)
  2. Ollama Cloud models announcement — official launch, confirms cloud API endpoint and subscription model
  3. Anthropic API pricing — Claude Sonnet input rates used in the cost comparison ($3/MTok as of 2026-04-14)
  4. Anthropic prompt caching docs — context for the “caching and compression are complementary” closing section