All Posts

We Ran AEC-Bench. Here Are the Full Results.

News 2026-04-13

AEC-Bench is the first open-source benchmark designed to test whether AI agents can actually work with construction documents. Not summarise them. Not chat about them. Work with them: navigate drawing sets, trace cross-references, verify submittals against specifications, catch coordination failures across hundreds of pages.

The benchmark was published in March 2026, and it is freely available for any company to run. 196 real task instances, nine task families, three complexity levels, scored against ground truth defects placed by domain experts. It tests the work your team does every day.

We ran it. As of today, our Document Agents hold the highest score in every complexity category on AEC-Bench, and on several task types the gap is not close. Here are the full results.

Why we ran it

If you have sat through a demo of an AI tool for construction and thought "that looks great, but does it actually work on a real drawing set?" then you understand why AEC-Bench matters.

The AEC industry does not have a shortage of AI claims. What it has is a shortage of evidence. Every vendor can cherry-pick a result, define their own evaluation criteria, and report whatever makes the pitch look good. AEC-Bench removes that option. The tasks are defined. The ground truth is fixed. The scoring is automated. You either get it right or you do not.

We ran it because we wanted to know, with certainty, where our Document Agents stand.

What AEC-Bench tests

The benchmark covers nine task types across three complexity levels.

ScopeTask TypeTasksWhat It Tests
Intra-SheetDetail Technical Review14Localised technical questions about a single detail view
Detail Title Accuracy15Whether detail titles match what is actually drawn
Note Callout Accuracy14Whether callout text corresponds to referenced elements
Intra-DrawingCross-Reference Resolution51Finding cross-references that point to non-existent targets
Cross-Reference Tracing24Tracing all locations that reference a given detail
Sheet Index Consistency14Checking sheet index entries match actual sheets
Intra-ProjectDrawing Navigation12Locating the correct file, sheet, and detail for a query
Spec-Drawing Sync16Finding conflicts between specifications and drawings
Submittal Review36Evaluating submittals for spec and drawing compliance

If you have ever spent a Thursday evening tracing a detail reference across a 300-page drawing set, you will recognise these tasks.

Our results

Overall: 84.7% across all 196 tasks

Best configuration shown per agent family. Nomic Agent published aggregate scope scores only (no per-task breakdown). "Next Best" is the highest score from any configuration in the paper. Results as of 12 April 2026.

By complexity scope

Best configuration shown per agent family. Results as of 12 April 2026.

The widest gap is on Intra-Project tasks: cross-document coordination, submittal review, spec-drawing sync. These require holding multiple documents in view simultaneously, a drawing set, a specification, a submittal package, and catching where they diverge. That is the work that takes a coordinator days, and where errors carry the most consequence.

Submittal review: the standout result

With 36 task instances, submittal review is the largest task family in the benchmark and the most commercially relevant. Evaluating submitted product data against project specifications and drawings is high-stakes coordination work: a false positive wastes your team's time, a false negative creates real risk.

The highest score in the AEC-Bench paper across all models and configurations was 23.1%. Our Document Agents scored 75.0%.

The paper explains why this task is so difficult. Agents "tend to over-generate findings, leading to a high rate of false positives." The goal is not coverage. The goal is accuracy you can trust.

What this means if you are evaluating AI for your practice

AEC-Bench gives the industry something it has not had before: a common, reproducible test. If you are comparing AI tools for your firm, you now have sharper questions to ask:

  • "What is your AEC-Bench score?" If a vendor has not run it, or will not publish the results, that tells you something.
  • "How do you handle cross-document tasks?" Single-page text extraction is largely solved. The hard problems are the ones that require holding a full drawing set, a specification, and a submittal package in view simultaneously.
  • "Can you show me the agent's reasoning?" A number on a benchmark is a starting point. What matters is whether you can see how the tool arrived at its answer and verify it yourself.

If you want to see how our Document Agents handle your specific document types, book a demo.

Want to see how Structured AI can work for your team?

Book a Demo
Structured logo
STRUCTURED

Copyright 2026 Structured AI. All rights reserved.