Where do the benchmark ranges come from?

A combination of NAHB cost-per-square-foot data, RSMeans construction cost data, and BuilderGrid customer benchmarks aggregated by region and project size. The benchmark ranges are surfaced inside the validation engine and refined as more projects flow through.

How is this different from a budget-versus-actual variance check?

Budget-versus-actual flags drift between what was planned and what was billed. Quantity anomaly detection flags budgets that look wrong before any billing happens, by comparing the planned quantities to historical norms.

Does the LLM cross-check replace the rule engine?

No. The rule engine catches the deterministic patterns reliably and cheaply. The LLM catches narrative anomalies that no rule was written for yet, including patterns that emerge across line items rather than within a single line. Both layers run in parallel.

Quantity anomaly detection in construction budgets

A quantity anomaly is a billed quantity that falls outside the historical range for that trade and project size. It is the kind of error that survives a price review because the price per unit looks normal, the vendor is legitimate, and the math on the invoice is correct. The only thing wrong is that the work could not possibly have required that much material. These are the errors that cause draw rejections, tax-time surprises, and the uncomfortable end-of-project conversation about where the money went.

Why quantity errors are different from price errors

Price variance is normal. A 2x4 stud might cost $3.10 today and $3.85 next month, and a 20% swing on a single line item is inside the noise of a year of lumber prices. Reviewing every price line on a budget produces a long list of small variances, almost all of which are fine, and the eye glazes over by line forty.

Quantity is different. The number of drywall sheets a 1,784 SF house needs is bounded by physics. Whether the sheets cost $14 or $18 is a market question; whether the project required 180 sheets or 320 sheets is an arithmetic question with one correct answer. A 200% quantity miss should not be possible. When it appears on a draw, the cause is almost always a clerical error, a measurement error, or a duplicated line that nobody caught. Catching these before the lender sees them is the entire game.

Where the benchmark numbers come from

Anomaly detection requires a benchmark, and there are three reasonable sources for residential construction:

NAHB cost-per-SF data. Annual surveys of single-family construction costs broken down by trade and region. Useful for the high-level sanity check on whether a trade’s share of the budget looks normal.
RSMeans. Detailed unit-cost data with quantity factors for typical residential assemblies. The reference most cost estimators already trust, and the source for the typical drywall, concrete, and framing ranges that anchor the rules below.
In-house historical data. The most accurate benchmark for any builder with three or more closed projects. The drywall sheets billed on the last six 1,500–2,000 SF builds is a tighter range than any published source can produce, because it reflects the builder’s specific framing methodology, ceiling heights, and finish standards.

New builders rely on the first two. Established builders rely mostly on the third, with the published sources as a backstop. The detection engine should support all three.

Three pattern categories

Anomalies show up at three different scopes, and the detection logic is different for each.

Within-project anomalies

A quantity that is outside the typical range for that trade on this specific project. The framing budget on a 1,784 SF house is set during estimating; if the framing sub bills 30% more board feet than estimated without a documented change order, the line is anomalous against the project’s own plan. This is the cheapest check to run and the one that catches the most errors in practice.

Cross-project anomalies

A quantity that is outside the typical range relative to the builder’s own portfolio. The cross-project check compares the billed quantity to the same trade on similar projects (same SF range, same general layout, same regional conditions). A drywall bill that looks reasonable in isolation may be 25% above the builder’s average for that house size, and the cross-project check is the only one that surfaces it.

Cross-builder anomalies

A quantity that is outside industry norms for that trade and project size. This is the most useful category for new builders without their own history, and the one that benefits most from a shared dataset. A first-time builder running a 1,784 SF spec build does not know that 320 drywall sheets is anomalous; the cross-builder check tells them, with a reference to the published or aggregated range.

Trade-specific ranges

These are the ranges we use as a starting point for residential builds between 1,400 and 2,400 SF. Every builder should adjust to their own history, but these are the order-of-magnitude figures the rule engine ships with.

Trade	Unit	Typical range per 1,000 SF	Flag if outside
Drywall	4×8 sheets	90–130	<75 or >150
Concrete (slab + footing)	cubic yards	12–16	<10 or >20
Roofing (6/12 gable)	squares	11–14	<9 or >17
Electrical wire	linear feet	1,500–2,000	<1,200 or >2,400
HVAC cooling	tons	1.6–2.0	<1.3 or >2.5
Insulation (R-13 walls)	square feet	950–1,150	<850 or >1,250

Translated to a 1,784 SF house: drywall should land between 165 and 230 sheets, concrete between 22 and 28 yards, roofing between 20 and 24 squares on a simple gable, and HVAC at roughly 3 tons of cooling. A draw that bills 320 drywall sheets, 35 yards of concrete, or 5 tons of cooling is outside the reasonable band and deserves a second look before anyone signs the check.

The rule-engine layer

Deterministic rules with project-specific tolerances are the first line of defence. A rule says: for project type X, between square footage Y and Z, expect quantity Q in unit U, with tolerance T%. The rule fires on every transaction post and every draw line. The output is a list of flagged items with the rule reference and the expected range.

Rules are good at exactly one thing, which is enforcing a range a human wrote down. They miss anything the human did not anticipate, and they produce false positives when the project has a legitimate reason to fall outside the band (a vaulted ceiling on the great room pushes drywall sheets up by 15%). The rule engine has to be configurable per project and per cost code, or the false-positive rate makes the office stop looking at the alerts.

The LLM cross-check layer

A Claude-style review reads the budget the way a senior estimator would. It compares line items to each other, checks that the trade scopes line up with the project description, and flags narrative anomalies the rule engine cannot encode. The LLM catches things like a framing bill that is consistent with the floor area but inconsistent with the porch and garage scope on the plans, or an electrical bill that is consistent with the conditioned space but missing the panel upgrade implied by the appliance schedule.

The LLM layer runs nightly against the full budget and outputs a short narrative note for each finding. The output is reviewed by the project manager the next morning, and accepted findings either generate a request for clarification to the trade or a new rule to add to the deterministic engine.

Why both layers matter

The rules catch the obvious; the LLM catches the patterns nobody coded a rule for yet. Running only the rules misses every novel error pattern because every novel pattern is one nobody has written a rule for. Running only the LLM is expensive, slow, and produces a small number of well-thought-out findings instead of the immediate flag that prevents an anomalous line from posting in the first place. The combination gives you immediate feedback at transaction entry and a slower, deeper review running in the background.

Human review and the feedback loop

Flagged items end up on a validation queue reviewed by the project accountant or office manager. The reviewer has three actions: approve (the line is correct, the project has a legitimate reason to fall outside the band), edit (the line is wrong, the quantity gets corrected and the transaction reposted), or reject (the line is fraudulent or misallocated and gets sent back to the vendor).

The disposition feeds back into the rule engine. An approved item with a written reason updates the project’s tolerance band for that cost code. An edited item is logged as a near-miss for trend tracking. A rejected item triggers a vendor-level note that informs future bills from that sub. Over six months, the rule engine becomes a much more accurate model of the builder’s specific operation than any published benchmark.

Worked example: 926 Stratford

926 Stratford is a 1,784 SF single-story spec build in Sweetwater, Tennessee, contracted at $430,250. The HVAC sub bills draw four with a line for a 5-ton cooling system. The price per ton is in the normal range and the vendor is the regular HVAC partner. The rule engine flags the line because 5 tons is outside the 2.4–3.6 ton band for a single-story TN build at that square footage.

The validation queue surfaces the flag with the expected range and the rule reference. The PM looks at the plans and sees the system specified is a 3-ton heat pump, not a 5-ton. The vendor invoice was a typo: 5 tons of equipment was ordered for two separate jobs and the larger unit got billed to the wrong project. A correction memo goes out the same day, the line is edited to 3 tons, and the project saves roughly $3,400 that would otherwise have been buried in a draw the lender approved.

The follow-up: the LLM cross-check, running overnight, also flagged the line, with the additional observation that the duct schedule on the approved plans called for a 14-inch supply trunk consistent with a 3-ton system. Two independent layers caught the same error from different angles, which is how the system is meant to work.

How BuilderGrid wires this in

The validation page surfaces flagged items as they post. The AI cross-check runs nightly across the full budget and writes its findings to the same queue. The rule engine is configurable in admin: every cost code has a typical-range record by project size and project type, and the office manager can update tolerances based on the builder’s history. Approvals, edits, and rejections all feed back into the rule store, and the longer the system runs the more accurate the bands become for that specific builder.