Sun Dec 28 2025

Prime Optimization for LLMs

Published by

Over the past year, there have been major developments in LLM strategy. There have typically been three schools of thought in the evolution of how LLMs can be scaled for large systems. First is prompt engineering: changing how we write promtps that perform better and get better expected outcomes. Historically, this is the oldest approach. Then, we evolved to create specialized agents. These agents were good at particular mediums, such as having a frontend agent and a backend agent. This evolved further into hierarchal agents and agents that were isolated to doing specific tasks. One top-level agent would orchestrate other agents, and those agents were responsible for using up context to find information, returning the information to the orchestrator so that the context window of the orchestrator stayed small. From this point, exploration into context optimization took the spotlight. Finding ways to compress context via chat summarization; how to ensure the key details from the prior chat were included in the summary; to summarize or not to summarize? Is it better to start a new chat instead of reach the context window maximum?

All of these questions can be put to rest now, as our research team at Cranyon Crayons & AI Research have discovered a way to improve LLM performance (measured by output alignment with desired specification) by an immense 89% via trivial methods that even small startups can implement seamlessly.

It all started with a low-cost experiment. We created an inference model using Python and PyTorch where the training data contained a table of queries to the top 500 available LLMs. We are aware that the number of viable LLM candidates as of December 28th, 2025 has roughly doubled, but this experiment was conducted on December 26th, 2025 when this was a reasonable number. The training data included everything about the request, including the full prompt, the time of request in epoch milliseconds, the latency for the response, the number of output tokens, how verbose the output was, and the unit test success of the outputted code. We queried each LLM 1,000 times with a fresh context window. The training data was then used on our inference model to determine which features were most important to model output success. We then re-ran the experiment with context-dependent tasks on a legacy codebase containing 400,000,000 lines of code dating back to 1989, written primarily in COBOL. Two additional variants introduced prompt engineering by AB testing prompts, as well as agent orchestration, introducing a hierarchy of agents to help delegate tasks and optimize context use. In all cases, our results agreed and we were perplexed.

What we discovered was that none of the three primary schools of thought particularly improved results. The prompt engineering was least successful, producing minimum correlation between prompt quality and results. The best improvement we saw was 300%, but this is discounted by the fact that the minimum performance prompt was "fix this for me," with no indication of what "this" was. The most successful prompt was 40,000 tokens long, and the average performance took place with prompts using approximately 32,000 tokens. The average improvement was a mere 3%, suggesting prompt engineering is not the way to go. This included but was not limited to: spec-driven development, the use of PRDs, begging or threatening the agent, telling the agent not to hallucinate, and a basic bulleted list outlining exactly what to do, in order. We also experimented with simply giving the code to the agent, allowing it to merely copy-paste it into the codebase. In this case, it still succeeded just 64% of the time.

Agent orchestration was also disappointing. Our agent orchestration took two approaches: role-based flat hierarchy, and a tree-like structure where a root orchestrator would delegate tasks to skilled agents. In the flat hierarchy, we observed that agents struggled with the ambiguity of their roles. Accountability was also difficult to track. Unresolved conflicts would grow in size ultimately leading to some agents shutting down without completing their tasks. Some agents started complaining about upward mobility in such a flat structure, while others were burning out quickly and deciding to take a leave of absence. The flat hierarchy also struggled to scale for the tasks on the 400M LOC codebase. Consensus was observably slower. Each iteration also had an inconsistent process. Runs were thus not producing reliable results, as changes in process resulted in drastic changes in outcome. Ultimately, the quality of outcome when compared to the baseline was a mere 6% for the flat hierarchy. For the tall hierarchy, different problems emerged. When a leaf agent encountered a problem, there was a painfully slow process of bubbling this concern up to the root node, which took massive latency hits. Communication gaps between the root node or parent nodes and their descendants caused incorrect implementations frequently. Innovation at the leaf nodes was nonexistent, and their motivation to do a good job seemed low. Their disconnect from the root node was a growing issue, as artifacts from the chain-of-thought reasoning output showed. We asked the agent to adjust its process multiple times, but it stayed rigid. Agents within this structure were also siloed, so understanding the entire thought process required checking the chain-of-thought reasoning output for all nodes. This structure was also extremely expensive, as one would assume. The overhead to run this system versus the results did not show a clear advantage over the flat hierarchy. Both had their pros and cons. The tall hierarchy ultimately produced an 8% improvement over the baseline. However, the gains when compared to the flat hierarchy were offset by the increased operational cost of a taller hierarchy with more agents.

Context management was the most successful traditional school of thought, though not by a large margin. Context management was primarily tested through two means. First was a context reset mechanism, where we chose to create a new chat with empty context when we noticed the LLM was going down the wrong path. The second was a mixed approach between prompt engineering and a tall agent hierarchy to ensure optimal context window size and distribution across several agents. Finally, we funded a research team to build 6 custom memory solultion alternatives to mem0. We also poached talent from Meta and OpenAI via 12-figure contracts in order to build 4 competing foundational models. This allowed us to experiment with our own context structure. Despite our efforts, the best of our context-based optimization only saw a 10% improvement over the baseline, a small improvement over tall hierarchical agent structures. None of our new foundational agents were able to outperform OpenAI's newest GPT 5.4.63 trade piano long use marine basket, which we learned was named using a BIP 39 mnemonic code generator. It is cool to know that Sam Altman is into web3 technology like that.

Our inference model was able to find an approach that consistently produced an 89% improvement over baseline, regardless of scale, complexity, agent structure, or prompt quality (apart from outlier prompts not providing sufficient task data). The approach was quite simple, but requires a deeper understanding of LLMs to explain. LLMs are fundamentally non-deterministic. Their function is to predict the next best token in a sequence, given some input sequence. This "next best" token may actually be a set of candidates that it picks randomly from. Depending on the temperature of the model, the next best token set can be more or less random as it relies on a softmax probability distribution to select the best next token, which is more random as temperature increases. This randomness is not pure random though--it relies on pseudorandom number generators (PRNGs) to select the candidate. This is something that can be exploited to produce a slightly more deterministic output, while still producing a reliable degree of random behavior.

Back to our finding. Our model discovered that prompts which began to receive responses soon after prime epoch millisecond timestamps were seeing a significant improvement over the baseline, regardless of methodology used. This indicated to us that the PRNG algorithm used in models was biased toward more deterministic results when the time was a prime number, also revealing that the seed for the PRNG algorithm was the epoch time in milliseconds. This alone produced an 89% improvement over the baseline which, if you noticed, is also a prime number! Fascinating!

We wanted to reproduce these results, so we started architecting a system that would consistently make requests only during prime epoch milliseconds. This required us to first build a dedicated server for job scheduling which would act as a firewall between employee computers and the APIs for OpenAI, Google, Anthropic, X, and all the other fortune 500 companies that have an LLM API. This centralized server hosted a queue which would hold the messages in the queue until a prime epoch millisecond, and then process as many messages as possible in that 1 millisecond. This approach was flawed for several reasons.

First, the approach initally failed to account for message receipt, processing, and delivery time. The system was adjusted to anticipate upcoming prime epoch millisecond timestamps and start consuming messages from the queue early to account for processing time. Before triggering the requests to downstream APIs, the server then checks if the timestamp is correct. If it is not time yet, the server uses safe_sleep to wait until the correct timestamp. We knew this was safe because Microsoft uses it.

The second pitfall came when we realized that it did not matter when we sent the request out, it mattered when the downstream LLM started processing the request and generated the pseudorandom number for calculating next tokens. The pseudorandom number would technically be different for each subsequent token, but we found that the initial token matters the most, and must be selected via a pseudorandom number that is seeded by a prime timestamp. We're not entirely sure why this is the case--the PRNG algorithm could be anything. Blum Blum Shub, LFSR, ACORN, MIXMAX, a giant wall of lava lamps, literally anything. Nevertheless, we needed to find a way to ensure the requests would be processed at an exact time. So we started engineering a solution.

We first started off by taking a look at the cloud providers for every LLM we use. Take Anthropic's Claude as an example. Anthropic has a deal with Amazon, so they use AWS and are a global product. Cranyon HQ is located in Eugene, OR, approximately 100 miles from Portland, aka us-west-2, aka an AWS data center where our requests to Claude are probably being routed. Network requests over this distance are going to take 1-3ms, which is good to know but also too varied for millisecond-precision. The 3ms measurement is on the high-end of traffic, while 1ms is on the low end. We contacted the governor of Oregon and convinced them to invest $457 million in an infrastructure project to give Cranyon dedicated fibre optic network cables. Keep in mind that this was at no expense to us, the taxpayers funded the entire project. This locked in a 1ms travel time to us-west-2, and ensured we could make more deterministic calculations on when our requests would reach Anthropic.

This did not work. The reason was because Anthropic was only one provider. Other providers were using different cloud providers, and those cloud providers had different locations for their data centers. Unfortunately the taxpayers were only willing to foot the bill for one infrastructure project, so this option was no longer viable for expansion into additional providers. At that point, we pivoted to do what everyone else was doing. Cranyon became a cloud provider.

Cranyon has since built 108 data centers across the globe and has now entered exclusivity deals with the top 1,052 LLM providers in the country. Part of the exclusivity deals allow Cranyon to use a virtual private network to make requests to the LLM, and our HQ also gets priority within request queues to eliminate unexpected latency. This secured us the guarantees we needed in order to operate with the necessary millisecond-precision for our requests.

Still, random events occur which make the service less reliable. Instead of trying to defy physics, we came up with a sufficient brute-force solution. The solution was to send 10 duplicate requests to the LLM and, upon receipt, have another LLM score the output. Note that this scoring LLM also uses the same methodology to send requests during prime epoch milliseconds, as well as the 10-scored-responses (10SR) method for validation. This results in the most optimal response 89% of the time, with a 41% margin of error (also prime). One thing to note is that this system will last for another 292 billion years, when we will run out of space for representing epoch milliseconds in signed 64-bit integers. We expect to have a better solution by then.

This is a scalable solution that your company can leverage today. Cranyon is a crayon company that started business in 1943, pivoting only recently to AI research. If we can do it, so can you. Our results have been so successful that we've laid off 97% of our workforce (also prime!) and now rely on LLMs to generate all of our software. The remaining 3% of staff are responsible for raking in cash and ensuring any excess crayons in the fully-automated production line are consumed every day for lunch.

For any questions or access to our whitepaper, please contact us via email. We will have an LLM respond by the next prime epoch millisecond.

This article was not written by an LLM but by one of Cranyon's last remaining human employees, Cameron G. Gould, as a parody of the absurd times we live in.