How to Manage Context When You Keep Hitting Token Limits

• 30 min read

You open Claude Code to fix a small bug. The first 15 minutes are perfect—the model quickly reads code, offers precise solutions, writes tests. But an hour in, something breaks. The model starts responding in 30 seconds. Instead of fixing one line, it rewrites the entire file. It forgets the project rules you explained at the start.

Why? Because the model gets overwhelmed by the volume of information. Every request sends your entire conversation history, all system rules, tool schemas, and command results. This "context tax" grows with every step.

When context balloons to 150,000 tokens, three things happen: the model loses focus, you pay for garbage, and speed tanks.

This guide isn't about terminal commands. It's about managing the model's attention—giving it exactly what it needs for the current task, and filtering out everything else.

Result: The model becomes smart again, works twice as fast, and your API bills drop 60–80%.

Part 1: Measure First, Then Cut

Why You Need to Measure Context

You can't optimize what you can't see. Before deleting files or changing settings, understand exactly where your tokens leak. Often the problem isn't long code—it's a hidden plugin sending megabytes of logs with every request.

How to Fix It

What changes: You'll know your "attention budget." You'll see that disabling one unnecessary integration saves more than deleting a paragraph of rules. Optimization becomes surgical, not random.

Part 2: Stop Paying for "Always-On" Rules

Why CLAUDE.md Gets Bloated

When starting a project, you want the model to understand everything. You create CLAUDE.md with architecture, tech stack, naming rules, test instructions, deployment guides, and ownership info. Result: a 500-line document. It seems right—now the model knows everything.

Problem: CLAUDE.md is "always-on" context. The model reads it with every request. If you ask it to rename a variable, it first reads 500 lines about architecture and deployment. If you make 50 requests in a session, the model reads those 500 lines 50 times. You pay for this 50 times. Worse—because there are too many rules, the model starts ignoring them.

How to Fix It

Split information into two categories: rules and reference.

CLAUDE.md should contain only rules the model should apply to every line of code. This is your "constitutional minimum."

Everything else—reference info, architectural decisions, deployment guides—should live in separate files like docs/. When the model needs to know how to deploy, it will find and read docs/deployment.md by itself. It doesn't need to keep this in its head constantly.

What should stay in CLAUDE.md:

What should be removed:

What changes: Your CLAUDE.md shrinks from 500 to 100 lines. The model stops getting distracted by irrelevant information. You save thousands of tokens on every request.

Part 3: Don't Drag Around Your Life's History

Why Conversation History Matters

Working with the model, it's important it remembers what you discussed 10 minutes ago. You asked it to write a function, then add error handling, then write a test. The model remembers, creating a continuous dialogue feeling.

Problem: tasks change. The first hour you debugged a complex auth bug together. You tried five approaches, read dozens of files, checked logs. You fixed it.

Now you move to the next task—fix button styling. If you just continue the dialogue, the model drags all that auth garbage with it. When fixing the button, it still has tokens of old auth logic, logs, and failed attempts in its "head." It's like solving geometry while someone yells spelling rules in your ear. The model gets confused, suggests strange solutions, and speed tanks.

How to Fix It

Build a habit of "closing phases." Once one logical task is complete, clean the context. Use the /compact command. It doesn't just delete history—it asks the model itself to write a brief summary of what was done and current state.

Instead of 50 messages with logs and errors, context keeps one paragraph: "We fixed the auth bug by updating JWT token in auth.ts. Everything works, tests are green."

When to compress:

What changes: You start each new task with a "clean slate," but the model doesn't forget global context. Its answers become instant and precise again.

Part 4: Stop Feeding the Model Mountains of Logs

Why Tool Logs Are Toxic

For the model to fix errors, it needs to see what went wrong. It runs tests (npm test), compiles code, or searches files. Tools produce output, the model reads it, understands the problem.

Problem: most tools were built for humans, not neural networks. When you run npm test on a large project, the tool might output 5,000 lines. There are pretty progress bars, lists of all successful tests, warnings about old library versions, and somewhere at the very end—those 10 lines with the actual error.

Humans scroll down and look at the error. Models can't. They honestly read all 5,000 lines, spending your money, their "attention," and time. Worst—the error might get lost in volume, and the model draws wrong conclusions.

How to Fix It

Never let the model read raw tool output. Always filter information before it enters context.

What changes: Tool information shrinks 90–95%. The model gets only the concentrate: "One test failed on line 145, expected 200, got 500." Problem-solving speed multiplies.

Part 5: Delegate Grunt Work to the "Cheap Intern"

When You Need All Project Files

Sometimes a task requires broad context. For example, write documentation for all API endpoints. The model needs to read dozens of files to build the full picture.

Problem: you ask Claude Code to analyze 50 files. It loads them all into context—this could take 100,000 tokens. Claude Code (especially with powerful models like Sonnet) is expensive. Making it read 50 files to find endpoint lists is like hiring a $200/hour architect to copy-paste data between Excel sheets.

How to Fix It

Use the "cheap intern" approach. You have access to faster, cheaper models (like Claude Haiku). This model costs pennies and works instantly. It can't architect systems, but it's perfect for "read these 50 files and extract all API endpoints."

Instead of loading all files into the main session, open a separate terminal. Run the "intern" (Haiku) there with a simple instruction: "Read the src/api folder and summarize all routes into api_summary.md."

The intern does the dirty work, reads 100,000 tokens (costing pennies), and creates a compact file with needed data. Then you return to your main "senior" (Sonnet) and say: "Read api_summary.md and write documentation based on it."

What changes: You isolate "noisy" work from the main session. Your main context stays clean and focused. You get the same result but pay 10 times less.

Part 6: When Tools Hurt More Than They Help

Why Tool Overload Damages Quality

Modern AI assistants can connect to external systems. They can read GitHub, check Jira, watch Sentry, or call internal APIs. This is MCP (Model Context Protocol). It seems logical to connect everything. More tools = smarter model, right?

Wrong. Every connected tool is an instruction the model must keep in mind. "You have a Jira tool. To use it, send a task ID. It returns status and description." With 10 tools, the model constantly spends part of its attention remembering how to use them. Even if you're just asking it to fix CSS styling.

Worse—sometimes tools work "in the background" (hooks). Every request might automatically attach git branch info or latest file changes. This adds hidden garbage to every request. The model gets confused.

How to Fix It

Treat the model's tools like phone apps: delete everything you don't use daily.

If you're not working with Jira this session—disable it. If you don't need the model making its own commits—disable git integration. Keep only basics: file reading, terminal execution, code editing. Enable complex tools only when the task directly requires them.

What changes: The model stops hallucinating and trying to use tools where they're not needed. Instructions get shorter. It focuses better on actual code. You stop paying for "tool usage schemas" you don't need.

Part 7: Make the Model "Think" Only As Much As Needed

When Thinking Helps (and When It Hurts)

Sometimes a task looks simple but needs deep analysis. For example, "why does this component render twice?" The model needs to trace call chains, understand React lifecycle, and find non-obvious dependencies. For such cases, models can "think" before answering (reasoning effort). They generate hidden thinking tokens, analyze options, then deliver the result.

Problem: if you enable maximum thinking for all tasks, the model philosophizes where it should just act. You ask: "Rename function getUser to fetchUser." The model with high-thinking mode starts: "So the user wants to rename. Why? Maybe it's about network patterns. If I rename here, will it break elsewhere? Let me analyze all files..." Instead of 2 seconds and 1 cent, you wait 20 seconds and pay 10 cents.

How to Fix It

Control thinking level based on task type. For 80% of daily tasks (write test, add button, fix typo), the model doesn't need to "think" deeply. It already knows patterns.

Enable "deep thinking" only for:

What changes: You find balance between speed, cost, and quality. Simple tasks solve instantly and cheaply. Complex tasks solve thoughtfully and correctly.

Part 8: Make the Model Remember, Not Reread

Why Caching Matters

Every time you send a request, the model rereads your system rules, tool schemas, and basic context. And you pay each time. If your basic context weighs 10,000 tokens, then over 50 requests daily, the model rereads 500,000 tokens of just basic info. Not only expensive—it's slow. Reading 10,000 tokens each time takes server processing.

How to Fix It

Modern models support Prompt Caching. If part of your request doesn't change, the server saves it to fast memory. Next time it doesn't reread—just pulls from memory. This costs 90% less and works many times faster.

But for this to work, you need proper structure. The server only caches if information is at the request's very start and absolutely unchanging. Once the server sees changed text, it stops caching everything after that.

How to break caching: You put the current task ("fix bug in cart") at the start and system rules at the end. Since the task changes every time, the server sees changes at the very start and rereads everything. Cache fails.

How to fix caching: Always put the most stable information first:

  1. System rules (never change)
  2. Tool schemas (never change)
  3. Project architecture (rarely changes)
  4. Current task, error logs, conversation history (constantly change)

What changes: Your API bills drop because you pay 10% of cost for reading stable parts. Responses become nearly instant.

Part 9: When You Don't Need to Optimize Anything

Optimization Isn't Always the Answer

This guide tells you to trim, filter, and compress info. Seems like you should do this always. But chasing token savings can lose what matters—solution quality.

Say your production server crashes. Customers can't pay. Every minute costs thousands. You open Claude Code to find the cause. But strict optimization rules mean the model doesn't see full server logs, only filtered errors. It doesn't see database history. It can't find non-obvious links between recent commits and the crash. You saved 50 cents on tokens but lost hours of debugging.

When to Stop Optimizing

Context optimization is a tool for everyday routine. For 80% of tasks that are clear, predictable, and low-risk.

But there are times to "turn dials to max" and give the model absolutely everything.

Scenarios where you should disable filters:

What changes: You stop treating optimization like religion. It's a tool for budget and focus management. You switch modes: from "frugal intern for routine" to "expensive architect with full access" for critical tasks.

Part 10: How to Tell When Something Goes Wrong

Symptom 1: Model Writing Bad Code

How it looks: Yesterday the model wrote perfect components, today it suggests old approaches, forgets typing.

Real problem: You over-compressed. The model forgot important nuances because compression made them seem "minor."

Treatment: Add to CLAUDE.md those conclusions the model keeps forgetting. If it's task-specific—just describe it again in the current request.

Symptom 2: Responses Take a Minute

How it looks: You ask to add a console log, the model "thinks" for 60 seconds.

Real problem: Your caching broke, or you accidentally fed the model a huge file.

Treatment: Check if dynamic text snuck into the request start (like a timestamp). Check if some tool output a giant log the model is now digesting.

Symptom 3: Model Goes in Circles

How it looks: Model suggests a solution. Doesn't work. Apologizes, suggests another. Also doesn't work. Suggests the first one again.

Real problem: Context is flooded with failed attempts. The model sees so many errors it can't build clean logic anymore.

Treatment: Stop immediately. Do hard compression or start fresh. Describe the problem from scratch with only current code state and the latest error.

Symptom 4: Model Tries Non-Existent Tools

How it looks: Model writes "now I'll check the database" even though it has no database access.

Real problem: You left tool usage instructions in system prompt or CLAUDE.md but disabled the tools.

Treatment: Sync CLAUDE.md with the real set of enabled tools. If a tool is off—remove all mentions.

Symptom 5: You Forgot to Say Something Important

How it looks: You press Enter, the model starts generating, then you realize "Wait, I didn't mention the new API!"

Why dangerous: If you wait for the answer, wrong code gets logged plus your correction. Context doubles for nothing.

Treatment: Stop the model immediately with Esc. Better—use `/rewind` to roll back one step, as if the bad request never happened. For quick questions that shouldn't be in history, use `/btw`.

Conclusion

Working with an AI assistant isn't the same as coding or Googling solutions. It's like managing a very smart but easily-distracted employee.

Your main job as a developer isn't remembering JavaScript syntax anymore. Your job is managing context.

Whoever learns to clearly state tasks, cut noise, filter logs, and clean history on time gets an tireless partner who writes code at the speed of thought.

Whoever just dumps mountains of unreadable logs into the chat and hopes "AI figures it out" will burn company budgets and get slow, confused, broken code in return.

The choice is yours. Start cleaning your CLAUDE.md right now.