We Ran AEC-Bench. Here Are the Full Results.

News 2026-04-13

AEC-Bench is the first open-source benchmark designed to test whether AI agents can actually work with construction documents. Not summarise them. Not chat about them. Work with them: navigate drawing sets, trace cross-references, verify submittals against specifications, catch coordination failures across hundreds of pages.

The benchmark was published in March 2026, and it is freely available for any company to run. 196 real task instances, nine task families, three complexity levels, scored against ground truth defects placed by domain experts. It tests the work your team does every day.

We ran it. As of today, our Document Agents hold the highest score in every complexity category on AEC-Bench, and on several task types the gap is not close. Here are the full results.

Why we ran it

If you have sat through a demo of an AI tool for construction and thought "that looks great, but does it actually work on a real drawing set?" then you understand why AEC-Bench matters.

The AEC industry does not have a shortage of AI claims. What it has is a shortage of evidence. Every vendor can cherry-pick a result, define their own evaluation criteria, and report whatever makes the pitch look good. AEC-Bench removes that option. The tasks are defined. The ground truth is fixed. The scoring is automated. You either get it right or you do not.

We ran it because we wanted to know, with certainty, where our Document Agents stand.

What AEC-Bench tests

The benchmark covers nine task types across three complexity levels.

Scope	Task Type	Tasks	What It Tests
Intra-Sheet	Detail Technical Review	14	Localised technical questions about a single detail view
	Detail Title Accuracy	15	Whether detail titles match what is actually drawn
	Note Callout Accuracy	14	Whether callout text corresponds to referenced elements
Intra-Drawing	Cross-Reference Resolution	51	Finding cross-references that point to non-existent targets
	Cross-Reference Tracing	24	Tracing all locations that reference a given detail
	Sheet Index Consistency	14	Checking sheet index entries match actual sheets
Intra-Project	Drawing Navigation	12	Locating the correct file, sheet, and detail for a query
	Spec-Drawing Sync	16	Finding conflicts between specifications and drawings
	Submittal Review	36	Evaluating submittals for spec and drawing compliance

If you have ever spent a Thursday evening tracing a detail reference across a 300-page drawing set, you will recognise these tasks.

Our results

Overall: 84.7% across all 196 tasks

Best configuration shown per agent family. Nomic Agent published aggregate scope scores only (no per-task breakdown). "Next Best" is the highest score from any configuration in the paper. Results as of 12 April 2026.

By complexity scope

Best configuration shown per agent family. Results as of 12 April 2026.

The widest gap is on Intra-Project tasks: cross-document coordination, submittal review, spec-drawing sync. These require holding multiple documents in view simultaneously, a drawing set, a specification, a submittal package, and catching where they diverge. That is the work that takes a coordinator days, and where errors carry the most consequence.

Submittal review: the standout result

With 36 task instances, submittal review is the largest task family in the benchmark and the most commercially relevant. Evaluating submitted product data against project specifications and drawings is high-stakes coordination work: a false positive wastes your team's time, a false negative creates real risk.

The highest score in the AEC-Bench paper across all models and configurations was 23.1%. Our Document Agents scored 75.0%.

The paper explains why this task is so difficult. Agents "tend to over-generate findings, leading to a high rate of false positives." The goal is not coverage. The goal is accuracy you can trust.

What this means if you are evaluating AI for your practice

AEC-Bench gives the industry something it has not had before: a common, reproducible test. If you are comparing AI tools for your firm, you now have sharper questions to ask:

"What is your AEC-Bench score?" If a vendor has not run it, or will not publish the results, that tells you something.
"How do you handle cross-document tasks?" Single-page text extraction is largely solved. The hard problems are the ones that require holding a full drawing set, a specification, and a submittal package in view simultaneously.
"Can you show me the agent's reasoning?" A number on a benchmark is a starting point. What matters is whether you can see how the tool arrived at its answer and verify it yourself.

If you want to see how our Document Agents handle your specific document types, book a demo.

Source: AEC-Bench: A Benchmark for AI Agents in Architecture, Engineering, and Construction (arXiv 2603.29199, March 2026)

Read the full paper →

Want to see how Structured AI can work for your team?

Book a Demo

We Ran AEC-Bench. Here Are the Full Results.

Why we ran it

What AEC-Bench tests

Our results

Overall: 84.7% across all 196 tasks

By complexity scope

Submittal review: the standout result

What this means if you are evaluating AI for your practice

Related Articles

Structured AI Founders Named to Forbes 30 Under 30 Europe 2026

Structured AI Listed in AEC Magazine AI Directory for Automated Drawing Review

AI Agents That Catch Compliance Failures Before Sign-Off

Navigation

Resources

Social

Contact

Legal