AI in construction cost review: what to trust and what to verify

AI is now a real tool in construction cost review, and it is also the single most over-promised technology in the industry. The honest version is that large language models are very good at the parts of cost review that look like reading and summarizing, and quite bad at the parts that look like arithmetic and judgement. A practical deployment puts deterministic rules in front of the LLM, uses the LLM for the residue, and ties both to a human review step that owns the final disposition. The trust-but-verify cadence is what separates a useful AI workflow from an expensive one that quietly produces wrong numbers.

What LLMs are actually good at

The strengths of a Claude-class model in cost review are narrow and consistent. They are worth listing precisely, because most of the marketing around AI in construction conflates them with capabilities the models do not have.

LLMs read narrative scope language well. Given a draw package with two hundred line items written by twenty different subs in twenty different styles, the model identifies the items that describe the same physical work in different words. The bath fan duct that the HVAC sub described as “exhaust ductwork” and the plumber described as “vent stack labor” show up to the model as overlapping scope, even though the cost codes, vendors, and amounts are all different. This is the kind of cross-trade duplicate that deterministic rules cannot catch unless someone has already coded the specific overlap.

LLMs flag quantity outliers when given a benchmark. Given the typical range of drywall sheets per 1,000 SF and a draw line that bills 300 sheets on a 1,784 SF build, the model produces a clean note that explains the expected range, the billed quantity, and the percentage deviation. The note is the kind of thing a senior estimator would write and a junior reviewer would not, and the model produces it for every flagged line in the budget in roughly thirty seconds.

LLMs summarize variance reports. Given a budget-to-actual report with forty cost codes and three months of activity, the model produces a short narrative that calls out the three or four cost codes driving most of the variance, with the dollar amount and the apparent cause. That summary is a faster read than the underlying spreadsheet, and it is the right artifact to share with a lender or a buyer.

LLMs draft review notes. The note that goes into the audit trail when a flagged line is approved or rejected is structured prose, and the model writes a passable first draft of every one of those notes. The reviewer edits the draft and signs off; the time saved is the time the reviewer would have spent staring at a blank text box.

What LLMs are bad at

The weaknesses are where most of the AI risk in cost review lives.

LLMs cannot compute exact totals reliably. Asked to sum forty line items, a current model will produce an answer that is correct most of the time and wrong some of the time, with no warning when it is wrong. Any cost review that depends on the model arithmetic is a cost review with a hidden error rate. The fix is to never ask the model for a total. The model identifies the lines that need attention, and the deterministic system computes the totals.

LLMs struggle to distinguish a legitimate scope change from an error. A framing line that comes in 18% above the budget might be a real change order (the buyer added a bonus room), or it might be a vendor billing for work that was never authorized. The model can describe the variance and ask the question. It cannot answer the question without information the model does not have, which is the change-order log, the buyer correspondence, and the PM’s memory of what was approved on site. The human reviewer owns that judgement.

LLMs cannot judge whether a price is fair for the local market. A plumbing rough-in priced at $7 per SF in Sweetwater, Tennessee, is cheap. The same price in San Francisco is impossibly low and almost certainly indicates a missing scope. The model can compare the price to a published benchmark, but it does not know the local market the way a senior PM does, and it will accept a number that the PM would immediately question.

LLMs hallucinate when they are asked to fabricate. Given a vague prompt and an incomplete document, a model will produce a confident narrative about line items that do not exist, vendors that did not bill, and totals that do not match. The hallucination risk is real and is the single most cited reason that AI cost review is treated with suspicion. The mitigation is to constrain the model to the actual budget data and to require citations for every claim, but the risk does not go to zero, which is why the human review step is non-negotiable.

The rule-engine + LLM hybrid

The right architecture is deterministic rules first, LLM for the residue. The deterministic engine encodes everything that has a clean rule (vendor name normalization, cost-code overlap, quantity ranges per project size, duplicate transactions, retainage math, draw-percent math). The engine produces a list of flagged items with an exact reason for each flag, and the engine cannot hallucinate.

The LLM runs after the deterministic pass, on the items the engine could not classify. The model reads the line scopes for narrative duplicates, checks the cost-code allocation against the project description, looks for missing trades the rule engine did not know to require, and writes a short note for each finding. The output is a list of additional flags with prose explanations, which join the deterministic flags on the same validation queue.

Layer	Strength	Failure mode	What it produces
Deterministic rules	Exact, fast, repeatable	Misses anything not pre-encoded	Flag with rule reference and expected value
LLM cross-check	Reads narrative, finds novel patterns	Hallucinates, misses arithmetic	Flag with prose explanation and cited line
Human review	Judges market context and authorization	Slow, expensive, fatigues at scale	Final disposition with audit trail

The human review step

The human review is the part that ties the system together, and it is the part that gets shortchanged in most AI deployments. The reviewer (project accountant, office manager, or PM) sees a unified queue of flagged items from both layers, with the rule reference or prose note attached, and takes one of three actions per item. Approve means the line is correct as billed and the reasoning gets logged. Edit means the line is wrong, the correction gets posted, and the original error is captured for trend tracking. Reject means the line is fraudulent or misallocated and gets sent back to the vendor with a written note.

The dispositions feed back into both layers. An approved item with a documented reason updates the deterministic tolerance band for that cost code on that project type. A rejected item generates a vendor-level note that informs future bills from that sub. The LLM sees the disposition history when it runs its next pass and adjusts its findings accordingly. The system gets more accurate as the history accumulates, in the same way a senior estimator gets more accurate over a career.

Hallucination risk and the citation rule

The single most important guardrail on the LLM layer is the requirement that every claim cite a specific line item by ID. The model cannot say “the framing budget looks high” without pointing to the exact framing line, the exact billed amount, and the exact expected range. The citation requirement is a constraint that reduces but does not eliminate hallucination, because a model can still cite a real line item and produce a wrong narrative about it.

The second guardrail is to compare the model output to the deterministic engine. If the model flags a line that the engine did not flag, the human reviewer reads both. If the model flags a line that the engine flagged for a different reason, the disagreement itself is a useful signal. If the model produces a finding that cites a line item that does not exist in the budget, the finding is discarded and the model interaction is logged for review.

The third guardrail is the reviewer’s own judgement. Every flag is a draft for human review, never a final answer. The reviewer’s sign-off is what moves the line from flagged to resolved, and the sign-off is the artifact that goes into the audit trail.

The trust-but-verify cadence

Treat the LLM output as a draft for human review. The reviewer reads the model finding, checks the cited line in the budget, applies local knowledge, and accepts or rejects the finding with a written note. The cadence is fast on the items the reviewer agrees with and slow on the items the reviewer questions, which is the right distribution of attention.

Two anti-patterns are worth naming. The first is to skip the human review step on the basis that the model is usually right. The model is usually right; the times it is wrong are the times that matter, and skipping review is how a hallucinated finding ends up in a lender draw package. The second is to demand human review on every deterministic flag, which fatigues the reviewer and reduces attention on the cases that need it. Deterministic flags with a rule reference do not require prose review; they require a yes or no.

What changes in 2026 with longer-context models

The long-context models available in 2026 (Claude Opus at one million tokens of context being the relevant example) change one thing about cost review: the entire project history fits in a single prompt. A small residential build with three years of transactions, forty draws, two hundred change orders, and the full chart of accounts is well under a million tokens, which means the model can review the entire project at once.

That capability surfaces patterns that a draw-by-draw review cannot. The model sees that a particular vendor billed an unusual mix of cost codes across the project history, that a particular cost code ran consistently above benchmark across multiple draws, or that a change order from month four was never reflected in the schedule of values used for month nine’s draw. The findings are still flags for human review, but the lookback is wider than any human reviewer would do under normal time pressure.

The other capability worth noting is cross-project review. Given the history of five closed projects from the same builder, the model identifies the cost codes that consistently come in over budget, the vendors that consistently bill near the high end of the scope, and the change-order patterns that repeat across projects. That output is a useful input to estimating on the next bid.

Worked example: 926 Stratford

926 Stratford is a 1,784 SF spec build in Sweetwater, Tennessee, at $430,250. The deterministic engine flags two items on draw four. The first is an HVAC line for a 5-ton system, which is outside the 2.4–3.6 ton band for a 1,784 SF single-story TN build. The second is a duplicate cost-code allocation between drywall finish and paint prep on the same accent walls.

The LLM cross-check, running overnight, produces three additional findings. The first observes that the duct schedule on the approved plans calls for a 14-inch supply trunk consistent with a 3-ton system, which corroborates the deterministic flag on the HVAC line. The second notes that the framing line includes an entry for a bonus room above the garage that is not on the approved plans. The third flags that the trim carpentry scope language overlaps with the framer’s window-set labor on a way that suggests double billing.

The reviewer works through the queue. The HVAC line is corrected to 3 tons (a vendor typo). The cost-code overlap is consolidated into a single line under drywall finish. The bonus-room framing line is a real change order from two weeks ago that was never logged in the change-order system; the reviewer logs it retroactively and approves the line. The trim-carpentry overlap is rejected and sent back to the trim sub with a note. The full pass takes the reviewer roughly forty minutes, against an estimated four hours for the same work without the AI assist.

How BuilderGrid implements this

The deterministic engine runs on every transaction post and every draw line. The LLM cross-check runs nightly across the full project and writes its findings to the same validation queue. Every finding from the LLM cites a specific line item ID. The human reviewer works one queue with both kinds of findings, takes a disposition, and signs off. The dispositions feed back into the deterministic rule store and the LLM’s prompt context for the next run. The long-context model reviews the entire project history when an anomaly cluster is detected, not on every nightly pass. The system is auditable end to end, and no flag becomes a resolution without a human signature.