top of page

The AI Trial Balance Trial: low effort, refusals and hallucinations

The AI Robots of Accounting: the low effort, the refuser and the try hard
The AI Robots of Accounting: the low effort, the refuser and the try hard

The 3 AI chat bots and accounting: The low effort, the won't even try and the try-hard.


As part of my ongoing exploration into how AI can assist finance executives, I’ve tested models on consolidations, data entry, and Excel AI add-ons. This time, I wanted to see if AI can handle another basic accounting tasks: turning a trial balance into a properly formatted P&L and balance sheet.


On the surface, this should be simple for a computer as it's almost pure logic. Trial balances summarize debits and credits, all that’s left is classifying accounts into the right financial sections of the statements (the classifications is also a given and doesn't need interpretation). However, for 3 of the major AIs it wasn't uneasy, it was impossible.


The Test Setup

I started with a real life clean and short trial balance: 52 classified accounts in debit/credit format, nothing fancy. I asked each major AI model to produce a full P&L and balance sheet.


ChatGPT: Confident, Wrong, and Low Effort

ChatGPT began promisingly, asking smart questions about my account numbering system. Then it produced its very low effort “reports”:

  • A P&L with three lines — revenue, expenses, and net income.

  • A balance sheet with three lines — assets, liabilities, and equity.

And it didn’t flow the net income into equity and instead offered this reassuring note:

“The small rounding difference matches the P&L net income, which means your trial balance ties out correctly.”

The “rounding difference” was $1,000, maybe it's not big but it's no rounding diffrence.


When I asked for a properly formatted report, ChatGPT re-generated the same data — this time splitting it into a “P&L trial balance” and a “balance-sheet trial balance.” When prompted again, it lumped all expenses together under one category and ignored my account classifications entirely.


ChatGPT asked to create a P&L from a simple trial balance and produces very low quality work

ChatGPT asked to create a P&L from a simple trial balance and produces very low quality work

Verdict: Looked confident, delivered little. Failed the task.



Gemini: Won't even try

Gemini was more transparent. It immediately explained that it wouldn’t attempt to build full reports but could walk me through how to do it. Technically honest, but not helpful if you’re testing automation.


Gemini won't even try to make a basic trial balance into financial reports.

Verdict: Next!



Claude: Try-Hard but failed harder

Claude, on the other hand, tried to actually do the job — and at first glance, it seemed to understand accounting flow better than the others. It even recognized that net income should flow to equity.


Claude understands the assignment on how to turn a basic trial balance into a P&L and Balance Sheet.

Then things went off the rails with major hallucinations and errors.

Claude invented new accounts and changed existing ones:

  • Added a mysterious COGS account with $644,000 that never existed.

  • Re-labeled my AR account (11050) as 21050 — converting it from an asset to a liability.

  • Fabricated new account numbers like 21150 for inventory, out of thin air.

  • Departed from GAAP structure entirely.

Not surprisingly, the resulting balance sheet didn’t balance.

To its credit, Claude noted this itself:

“The balance sheet doesn’t balance — you should investigate further.”

Claude generated an incorrect balance sheet that didn't balance and had hallucinated accounts.

Proofing Claude's work on P&L creation. Many errors and wholesale hallucinations.


When I called out the wrong accounts, Claude apologized and produced a new report… which was still wrong. It even “verified” its own corrections while leaving the same fake inventory account untouched.


Claude apologizes for $644,318 fabricated account.

At the end Claude needed me to tell if a specific account is actually in the trial balance. It wasn't, but this underscores how poorly AI is doing in these detailed orientated tasks. It doesn't know if it fabricated something or not!


Verdict: Earnest effort, catastrophic execution.


The Results: A Pattern Emerges

Model

Approach

Result

ChatGPT - The low effort

Confidently simplified everything

Failed — incomplete and inaccurate

Gemini - Won't try

Refused to execute, explained theory

Not useful for automation

Claude - Try hard failure

Tried to perform end-to-end accounting

Failed with fabricated data and logic errors



What This Means for Finance Teams

This test aligns with what we’ve seen in earlier experiments on consolidations and data integrity: general-purpose AI is not ready for real accounting.


Even when AI “knows” the logic of a task, it often applies it inconsistently, misreads numeric context, or confidently fills gaps with fabricated data.


In accounting, that’s not a small problem it’s an existential one. Numbers must tie. Trial balances must balance. You can’t “hallucinate” your way to GAAP.


AI models today are powerful text generators, not accountants. They can help explain, summarize, or assist with logic but when it comes to building accurate financial statements, they’re unreliable.


If you want to automate financial reporting, you need specialized finance tools built for accounting, based on structured data, preferably your ERP.


I’ll plan to keep testing maybe Elkar or domain-specific models will fare better. But for now, it’s clear: General AI chat bots may seem to speak accounting, but they don't yet understand it.


ree

bottom of page