I’m Calling It: AI Processes Can’t Scale

Niv Nissenson
Oct 8
5 min read

I'm calling it: AI, in its current form, cannot scale to meet the needs of high-volume processes that have a low tolerance for errors.

A recent Forbes Article cited an MIT report that claimed that 95% of enterprise Generative AI pilots fail to deliver measurable value and remain stuck in the pilot phase. The report mentions that the models "collapse the moment they hit real organizational texture, compliance, politics, data quality, and human judgment." The piece attributes the tech side of the failure to lack of memory, learning and context. These challenges, they surmise can be overcome primarily by learning from "friction" (allowing the system enough time to learn and adapt). I believe the source of the problem is much deeper and fundamental to the current form of Large Language Models (LLM) AI.

Scaling is a fundamental concept in business and technology, defining an organization's or system's capacity to handle increased volume while maintaining or improving efficiency, performance, and profitability.

Scaling is not just about increasing quantity; it is about increasing predictable, valuable output while minimizing variance. If a printing press (one of the first "scale" producing machines) were to produce versions that are 5% different than the original, its utility would drop considerably. This is the fundamental, inescapable limitation of today’s LLM AI.

The Probabilistic Foundation: AI as a "Test Taker"

The foundation of modern AI is the probabilistic model. In essence, an LLM is a prediction model that uses a huge number of data points to statistically guess the next most probable token (word or part of a word). This means that by definition, AI will get an X% of answers wrong.

A recent article from OpenAI (see our piece on it here) discussed this issue, arguing that LLMs are optimized to be good "test-takers." Like a student facing a multiple-choice question, the model is rewarded for guessing when uncertain, as opposed to abstaining or declaring it doesn't know. This phenomenon is often called "Hallucinations". From the perspective of the model builder, the model has successfully "scaled" its ability to answer tests.

However, enterprise and financial processes are not designed for "test-takers." Scaling requires high volume and low variance.

Process	Required Variance Tolerance	AI Use Case Fit
Payroll Calculation	Extremely Low	High Risk / Low Scalability
Tax Filing	Extremely Low	High Risk / Low Scalability
Drafting an Internal Email	High (easy to verify and correct)	Low Risk / High Utility
General Web Search/Summary	Medium/High	Low Risk / High Utility

The Compounding Variance Problem

The current discussion around scaling AI focuses heavily on Agentic AI. The connecting of multiple AI models to prompt each other to complete a complex task.

Consider a CFO asking a future AI Payroll Agent to run September's payroll for 100 employees. The main Payroll AI agent must prompt the Timekeeping Agent, the Expense Report Agent, the Commission Agent, the HR/401K Agent and the Tax Agent.

If we assume even a 5% variance on each of these iterations, the overall probability of a flawless, end-to-end outcome drops dramatically and this is before we even add the volume of doing 100 paychecks in multiple jurisdictions.

This compounding variance is intolerable for processes like payroll, where the tolerance for error is near zero. As my friend, a Partner at a PCAOB accounting firm, confirmed, "I wouldn't trust AI with payroll calculations for 100 employees."

It also looks like the no-one is really offering mission critical AI agents or complex AI in finance. The finance software providers focus on the single-issue, low-variance agents. See our reviews of recent AI roll outs by companies like Quickbooks and Tipalti (e.g., one agent for OCR, another for payment reminders) are probably designed to avoid variance compounding.

Even if we can tolerate some variation (up to a point) we must consider that a side effect of a high volume is that the cost of verification and quality assurance is much higher. If we have an important email to draft and the AI hallucinates we have a good chance of catching it. However if the AI generated thousands of transactions/email even with small errors the cost of verification becomes prohibitive. This is another indication why AI works better at low volumes even if the error tolerance is low.

Using AI agents can compound variance the more agents are used. For tasks with low error tolerance this would mean they are not scalable.

Variance even at "Zero Temperature"

The variability of AI is not just a function of the model design. Even when you attempt to eliminate randomness by setting the "temperature" to zero (which forces the LLM to select the single, highest-probability token every time), the output will still vary. And in the case of Zero temperature may cause 100% of answers to be wrong as it guesses the most probable answer.

Seed round funding record holder, Thinking Machines Labs, published a recent article by Horace He, which highlighted this phenomenon and demonstrated that identical inputs can still yield different outputs. The culprit? Subtle numerical differences deep within the hardware and software layers, specifically "floating-point non-associativity" and a lack of "batch invariance" in the GPU operations.

As He notes, "Modern software systems contain many layers of abstractions... In machine learning, when we run into nondeterminism and subtle numerical differences it can often be tempting to paper over them. After all, our systems are already 'probabilistic', so what’s wrong with a little more nondeterminism?"

The fact that Thinking Machines Labs, the company which is based on ex-OpenAI staff and that has raised the highest seed round in history, is focusing on reducing variability in LLM is telling.

Conclusion: Our Personal Scaling Machine

Looking at the actual uses cases of AI today and you see content generation, data analysis and customer engagement. This all leads to a profound conclusion about AI's most successful use cases:

AI is great for low-volume work where we have a high tolerance for errors.

We're seeing that highest use cases around the highest AI utility in drafting a single email, generating a single piece of art, summarizing a webpage or even creating a software demo through vibe coding. Our error tolerance is high in these tasks or conversely they are easy to verify because of the low volume. I may also add that BI is also a low volume task that may have a higher degree of error tolerance because it's not difficult to double check.

A while back I saw a survey (see below) in which 67% of SMB leaders feel that "AI reduces staff pressure" but only 14% believed "AI could replace an employee." I initially thought this was a paradox but now I feel it makes sense.

AI is brilliant at saving time on low-volume, high error tolerance tasks. It cannot, however, replace an a significant portion of employees because employees, though they also produce varying results, are accountable, predictable within a process, and still possess the contextual judgment necessary to operate where the tolerance for error is low or zero. The higher the tolerance level for errors, the more we can "scale" AI.

Key AI adoption metrics from Thryv SMB survey 2025

The CFO AI All posts

I’m Calling It: AI Processes Can’t Scale

The Probabilistic Foundation: AI as a "Test Taker"

The Compounding Variance Problem

Variance even at "Zero Temperature"

Conclusion: Our Personal Scaling Machine

Recent Posts

A Finance Executive's AI Journey