Production Patterns Intermediate 7 min read

865,000 Tokens: Building NomFeed with Opus 4.6

March 3, 2026 · Ameno Osman

[ai][cli][architecture][prompt-engineering][opus-4.6]

865,000 Tokens: Building NomFeed with Opus 4.6

TL;DR

I built NomFeed, which handles URL conversion, YouTube ingestion, LLM extraction, MCP server, and Chrome extension, using 865,000 tokens with Opus 4.6. The 20x efficiency gain came from architectural discipline: separating human reasoning from model generation, with every specification decision made before the prompt was sent.

I burned through 6 billion tokens last month.

A single orchestrated prompt session consumed 20 million. I was operating at a scale I had never measured, let alone optimized. The environmental cost was real and growing, and I had treated it as external to my responsibility.

I set a constraint: build production software with genuine architectural discipline around token economics. Not less AI. Better AI.

I built NomFeed, which handles URL conversion, YouTube ingestion, LLM extraction, MCP server, and Chrome extension, in 865K tokens with Opus 4.6. 0.014% of baseline consumption.

865,000 tokens. 5.5 hours of active work. 1,850 lines of production code.

The Architectural Decision: Separate Reasoning from Generation

The efficiency gain came from one architectural decision: I owned the specification, Opus owned synthesis.

Large language models predict tokens that resemble reasoning. This is useful for exploration, when you don’t know the answer and need the model to fill gaps. It is expensive, slow, and wasteful when you already know what you want.

These models excel at code synthesis at scale. Humans cannot compete here. Given a complete specification, Opus 4.6 generates hundreds of lines of correct, working code in seconds.

I stopped conflating exploration with implementation. For NomFeed, I did the thinking myself: constraints, interfaces, failure modes, the full decision tree. Every architectural choice was made before the prompt was sent. Then I let Opus do what it does well: produce the artifact.

The 20x efficiency gain came from this division of labor, not from using less AI.

The Architecture Decisions

The constraint of minimal token overhead shaped every decision. Every choice served the requirement that each token deliver value.

Flat Files Over Database

Database access requires schema knowledge, SQL generation, migration handling, ORM context. Hundreds of tokens per interaction to establish baseline understanding.

Flat files require none of this:

readFileSync(): established API
JSON.parse(): established API

The storage layer (store.ts) is 200 lines. No dependencies. No migrations. When I needed extraction storage, the specification was minimal: “save a second file with .extraction.md suffix.” No ORM ceremony, no token overhead.

The 3-Strategy Cascade

URL conversion tries three strategies in order. Cloudflare first (fastest, returns text/markdown). Jina Reader second (handles JavaScript SPAs). Readability plus Turndown last (universal fallback). Each strategy is ~25 lines. First success wins. No agent coordination. No fallback discovery. The cascade is explicit because I specified it completely before implementation began.

Dumb Pipes Over Agents

LLM extraction runs 6 patterns in parallel via Promise.all. Each pattern is a single API call with a focused system prompt. No agent loop, no tool use, no state management, no chain-of-thought. The extraction layer is 70 lines.

The phrase “dumb pipe” encodes the architectural constraint: single input, single output, no side effects. This rules out agents, tool use, and the thousands of tokens per extraction that orchestration requires.

💡 Tip

Clear architectural constraints in the initial prompt prevent iterative waste. The specification is complete before the prompt is sent; the model only implements.

What I Built

NomFeed is a local-first CLI tool:

NomFeed Architecture: CLI entry point spawns HTTP Server and MCP Server; both use Core Functions (store, convert, youtube, extract, llm); extract uses Opus 4.6 for patterns; all write to ~/.nomfeed/

Core capabilities:

URL conversion via 3-strategy cascade (Cloudflare → Jina Reader → Readability)
YouTube ingestion via yt-dlp (no API key, full transcripts)
LLM extraction with 6 Fabric-inspired patterns running in parallel
Flat-file storage: JSON index + markdown files
MCP server for agent integration
Chrome extension via HTTP API

Nine source files. Five dependencies. No build step.

The Environmental Reality

865,000 tokens is roughly 260–865 watt-hours. 13–43 laptop charges. 0.5–2 miles of driving.

At 6 billion tokens per month, the environmental impact of architectural approach choice is substantial. A single 20-million-token sub-agent session consumes more electricity than this entire project.

The difference is not cost. It is the difference between intentional, directed computation and exploratory, continuous burn. Both have their place. Knowing which to use when is the skill.

I used to run sub-agent swarms without considering the token cost. The agents felt like free labor. At 6 billion tokens a month, they are not. Recognizing that changed how I architect systems.

The Prompts

Each prompt contains the complete specification. No agent discovers requirements, explores alternatives, or coordinates with peers.

Prompt 1: Storage Layer

“Flat-file storage: ~/.nomfeed/items.json for the index, ~/.nomfeed/content/{id}.md for markdown files. Each .md file should have YAML frontmatter with source, title, savedAt. Use nanoid for IDs. No database, no ORM.”

Result: 200 lines, zero dependencies, zero iteration.

Prompt 2: URL Conversion

“Three-strategy cascade for URL → markdown:

Try Cloudflare: fetch with Accept: text/markdown, check Content-Type header

Try Jina Reader: https://r.jina.ai/\{url\}, 30s timeout

Fallback to Readability: fetch HTML, parse with @mozilla/readability, convert with turndown Each strategy returns {title, markdown} or null. First success wins.”

Result: 80 lines, handles 95%+ of URLs, no iteration.

Prompt 3: LLM Extraction

“Dumb pipe extraction: each pattern is a system prompt that takes content and returns markdown. Run patterns in parallel with Promise.all. Patterns: extract_wisdom, video_chapters, analyze_claims, extract_references, summarize, rate_content. Save results to {id}.extraction.md.”

Result: 70 lines, 6 patterns, parallel execution.

Prompt 4: YouTube Ingestion

“YouTube via yt-dlp: —dump-json for metadata, —write-subs —write-auto-subs for VTT. Parse VTT to deduplicate overlapping cues. No API key needed. Return {title, channel, duration, transcript, transcriptPlain, markdown}.”

Result: 150 lines, no API credentials, works on any video with subtitles.

Note the specificity: exact yt-dlp flags, exact return structure, explicit constraints. This eliminates the exploration phase.

When Sub-Agents Still Make Sense

I did not eliminate sub-agents entirely. I used one for the skill file:

“I want to create a skill for NomFeed… Please generate the complete SKILL.md content following pi’s skill format with proper frontmatter…”

Appropriate because:

The output format (pi skill structure) was external knowledge I did not want to load into context
Single, self-contained document
No coordination with other components
Task was “generate content to spec” not “explore and implement”

Sub-agents excel at:

Exploration when you do not know the solution space
Research across large codebases or documentation
Content generation following external formats
Tasks that can fail independently without blocking progress

Direct prompting excels at:

Implementation when you know exactly what you want
Tight integration between components
Rapid iteration on architectural decisions
Constraints you want to enforce across the entire system

What I Learned

1. Own the Specification

Thinking models fill gaps with their own reasoning. This is powerful when you do not know the answer. It is waste when you do. By using Opus 4.6 with explicit constraints instead of extended thinking, I replaced model exploration with clear direction. The 20x token difference comes from who owns the specification.

2. Direct Prompting Burns No Idle Tokens

Sub-agent swarms run for hours. A 20-million-token session spans 4-6 hours of continuous computation. Direct prompting only consumes tokens when you hit enter. The gaps between prompts, where you review, decide, and live, are genuinely free.

3. Constraints Are Features

Flat files do not scale to millions of items. Our search is brute-force. These limitations saved thousands of tokens in complexity. Building for actual needs, not hypothetical scale, is an efficiency choice.

4. The Real Partnership

This is not about using less AI. It is about using AI for what it is genuinely good at. Humans provide judgment, direction, constraints. Models provide code at scale. Both operate at peak capability. That is the partnership.

The Activation Loop

NomFeed is a practical tool:

nomfeed add https://youtube.com/watch?v=xyz --extract
nomfeed read abc123 --extract

The environmental cost of that extraction (~2,000–4,000 tokens) is negligible compared to the value of structured, searchable, agent-accessible knowledge.

The Numbers

Metric	Value
Total tokens	~865,000
Active work time	5 hours 26 minutes
Tool calls	606
Files created/modified	41
Lines of production code	~1,850
Sub-agent calls	1 (for the skill file only)
Monthly token consumption	~6,000,000,000
This project as % of monthly	0.014%

5.5 hours of focused work. No multi-hour agent runs. No coordination overhead. No tokens burned while I slept.

Try It

git clone https://github.com/ameno-/nomfeed
cd nomfeed
bun install
bun link

# Save your first URL
nomfeed add https://example.com/article

# Search your library
nomfeed search "transformers"

GitHub → | Architecture docs →

865,000 tokens. 5.5 hours of active work. 1,850 lines of code. I am still a heavy consumer of AI. But I no longer treat the underlying constraint as someone else’s problem.

Key Takeaways

865,000 tokens vs 20M for comparable sub-agent workflows; the efficiency gain comes from who owns the specification
Opus 4.6 excels at code synthesis, not reasoning; optimal division is human judgment directing machine output
Flat files, 3-strategy cascades, and dumb pipes emerged from the constraint that every token must deliver value
At 6B tokens/month scale, architectural approach choice has substantial environmental impact
Direct prompting burns no idle tokens; sub-agent swarms run continuously for hours