Agentic Mukbangers

a robot having a mukbang

With the release of Fable 5 (and its subsequent removal due to government intervention), the AI market has shifted. Those with a careful eye may have noticed that Fable 5 was only going to be available on subscription plans for a couple of weeks before becoming exclusive to API usage billing. This marks a new phase in the pricing model, indicating that frontier models may no longer be on the table for subsidized use upon release. Up to this point, the basic monthly subscription has allowed users to spend up to $200 per month and use the latest model available, all the way up to Opus 4.8. This pivot forces us to ask a new question: when do we need the most powerful model, and when can we be perfectly fine without it?

Now that I have been working with agentic systems in a collaborative space for some months now, it's become clear to me that certain tasks do not demand the best of the best. This became most obvious to me when Opus 4.8 dropped, and the Claude app automatically enabled a "thinking" mode for it. I asked it to do a fairly simple task, retrieving some data from an MCP server. This task was something Opus 4.6 and 4.7 never had any problems with, giving me what I wanted within a few seconds. However, 4.8 struggled significantly. I was watching it think in circles, trying to reason about the MCP server before using it. It had somehow overthought how to use the MCP server and failed to trigger an authentication flow. It then spent several minutes trying to figure out the best way to debug the "failing" MCP server. I cut in and told it to stop overthinking. This sent it into a spiraling loop of self-prompting that it needs to stop overthinking and execute the task. I was about to switch the model back to 4.7, which is when I found that "thinking" setting toggled on. Once toggled off, 4.8 had no problem anymore.

At work, I use Codex. I have access to GPT-5.4 mini, 5.4, and 5.5. I also have the ability to use "fast mode" and set the effort level of the model. From what I can tell, this effort level gives agents permission to use more tokens when doing their internal "reasoning," which may not be necessary. When I first joined, I wanted to experiment with pushing agentic systems to the max. But after a while, I realized that I was probably wasting a ton of money having the most expensive model do trivial things. I installed codeburn so I could see how much money I was spending on tokens each day (hypothetically, I assume our company has an enterprise deal that alters pricing to some degree). It's shocking to see that you can easily burn several hundred dollars in a single coding session that spans 30-60 minutes. What else is shocking is that I can get nearly the same quality output if I add slightly better detail to my inputs and use a cheaper model, such as GPT-5.4 mini instead of GPT-5.5. And because the model is smaller, it is also faster, which means I can disable the "fast mode" that I typically use for GPT-5.5.

Bigger models, bigger problems

Big models may produce fairly impressive results, but they're also more expensive, and I am unconvinced that they will ever replace knowledge workers. There is emerging evidence that the cost of AI compute for a team is starting to cost more than the employees on the team. The current pattern is that every new frontier model version costs twice as much as its predecessor. Fable 5 costs twice as much as Opus 4.8, and Opus 4.8 costs about 60% more than Sonnet. GPT-5.5 costs twice as much as GPT-5.4, and GPT-5.4 costs about 60% more than GPT-5.3-codex. Both companies offer a cheap and fast model. Anthropic's Haiku is 3x cheaper than Sonnet, and GPT-5.4 mini is about 3x cheaper than GPT-5.3-codex. OpenAI also offers "pro" models that are 6x more expensive than the already expensive GPT-5.5.

As the pricing of each new model skyrockets out of control, we must turn to optimization. Most of us probably use a single model for most of our work, and it makes sense why we do. The user experience of these tools often lends itself to this type of workflow. I noticed that Codex will select a higher model for me if I start a new session or restart the app. It defaults to "high-effort" and the GPT-5.5 model. I also notice that long conversations may start with tasks that make sense for a powerful model to perform, but evolve into performing tasks that may only necessitate a mini model that is much cheaper.

Agentic mukbanging

A great coworker of mine once said to me, "If you can name it, you can tame it." So I am putting a name to this habit, and I am calling it "Agentic mukbanging." Why? Because mukbangs are the ultimate symbol of casual and glorified gluttony. For those who have the pleasure of not knowing what a "mukbang" is, it's an online trend that started more innocently in South Korea in the early 2010s and was then taken to the extreme in the United States. People record themselves eating extremely large amounts of food, like one of every item from the entire McDonald's menu.

Real-world consequences of agentic mukbanging

Agentic mukbangers are people who will use the most expensive model for any and every task, without concern for resource consumption. The reason I am calling this out is that it has material consequences. These larger models need more compute, and all that compute also needs lots of electricity. At the center of these two constraints is the buildout of many, many data centers. This stirs a lot of controversy, as data centers put significant strain on local power grids and consume large amounts of water for cooling. New data centers often come with some plan to pay for their own power and build new electrical infrastructure specifically for the data center, as well as "closed loop" water systems that do not pull a continuous supply of water from local sources. But there are still issues with noise and light pollution that can cause issues depending on where data centers are positioned. One of the most absurd data center proposals I have ever seen was made recently by a company called DC Blox, which proposed to build a data center including 2 buildings (combined 260k+ square feet) and a substation, right next to the Nashville Zoo. For anyone wondering how that is going for them, the Nashville Zoo created a petition that now has over 412,000 signatures, and the recent Metro Planning Commission meeting hit the building's maximum capacity with residents who went to protest. As a result, the city is considering a moratorium that would allow enough time for this bill to pass, effectively blocking the data center from being built since it is too close to a zoo. According to the Nashville Zoo, this is important because it has the potential to disrupt conservation efforts for endangered species, including the Clouded Leopard breeding program, which the Nashville Zoo helps lead global efforts on:

Beyond their heavy resource consumption, researchers caution that data centers also contribute to noise pollution, light pollution, and threaten water quality in surrounding communities. For the Zoo’s 3,000 animals and a neighborhood already facing economic challenges, this proposed development is especially concerning. Constant noise from cooling systems and generators, and light pollution from bright security and operational lighting can dramatically affect animal behavior, disrupting their natural photo periods and rhythms. Stress on the animals from these factors can be detrimental to our conservation efforts, especially our clouded leopard breeding program.

The Lean Agent

With costs exploding and data center buildouts causing material consequences across the country, it's time to optimize. The optimization in this case means two things: using the smallest feasible model for the task at hand, and only using models when it makes sense to use them.

Understanding which size model to use for a given task requires careful thought about the task you are working on. How ambiguous is the problem? How clear are the success criteria? What portion of the success criteria can be validated through tool calls versus internal reasoning from the model? Less ambiguous tasks already lend toward smaller models. Sonnet or GPT-5.3-codex are still very capable models that can go far. If success criteria is something you cannot clearly define, what you may need to do is take a step back and do some further refinement of the task. Once you have the success criteria defined, how is the agent going to check alignment? Are there static analysis tools, unit tests, end-to-end and integration tests, or other deterministic tools that the agent can rely on to validate the results are aligned with your standards? Not just that it implemented the task, but that the quality of the implementation also meets codebase requirements. The more scaffolding you create in and around your codebase to give the agent more constraints, the simpler the agent can be.

It's also worth considering if you should use a model at all, depending on the task. In my last blog post, I talked about an experiment I ran to have an agent create a recipe for me and then a grocery list that was ordered based on the layout of my local Publix. It was an interesting use case, but also completely unnecessary. I have to tell the agent the store layout anyway, which implies that I already know the layout of the store. And if I already know the layout of the store, then I can just make the list myself. I also asked the agent to research real recipes on the internet, because I don't trust something without a sense of taste to invent recipes. This reduces the scope of this agent down to a glorified Google search results aggregator.

Another consideration is that no effort is wasted. This recent video from Hank Green is a reinforcement of something that I value. Whenever you do something, whether it has a big obvious payoff or feels completely pointless, you still gain experience and silent skills or insights that you will probably utilize later. It is easy to reach for an agent for every little search, task, or otherwise. But realize that in doing so, you are losing the opportunity to build a new skill or refine and strengthen an existing one by doing it yourself. That's not to say you aren't building new skills and refining existing ones by interacting with agents. I believe that the way you communicate with an agent to achieve your desired outcome comes with its own learning curve. But that comes at the expense of learning more about the thing that the agent is now doing for you.

We are still in a subsidized era of agentic workflows. We can still pay a flat rate for a remarkable amount of usage. We can use this time to find the shortcuts and optimizations that will minimize our cost when the subsidies go away. But another dimension to this you should consider: what are the tasks I'll wish I did myself when it becomes financially infeasible to have an agent do them for me?

Keep sharpening the saw.