AI document parameter extractor
Artificial Intelligence
Client company:Nelissen
Floris van dun
Gracia Mamgani
Joep de Kock
Project description
The main challenge is to develop an AI system capable of accurately extracting structured parameters from complex PDF documents used in building construction. These documents contain detailed technical information on architecture, heating, electricity, and water systems, presented in various formats such as text, tables, and diagrams. The key questions are: How can we design a system that can effectively parse and interpret this unstructured data? What methods can ensure high accuracy and adaptability to variations in document layouts? Additionally, the system must transform the extracted data into a structured format suitable for use in datasets, while minimizing manual intervention.
Context
The project operates within the domain of building construction and design, specifically focusing on the creation and analysis of "bestek" documents. These documents are essential in the construction industry, serving as comprehensive technical blueprints that detail the architectural design, heating and cooling systems, electrical installations, and water management for buildings.
The industry is characterized by diverse stakeholders, including architects, engineers, contractors, and regulators, who rely on these documents for project execution and compliance. A "bestek" typically combines unstructured and semi-structured data formats, such as textual descriptions, tables, diagrams, and annotations. This complexity poses significant challenges for data extraction and standardization, especially when manual interpretation is labor-intensive and prone to error.
Results
The pipeline begins with the input of a zip file containing all project documents, such as architectural plans, specifications, and schematics. These files are unpacked, and relevant documents are automatically identified using classification techniques based on filenames.
Relevant documents are divided into smaller, logical chunks to isolate meaningful sections while preserving context. These chunks are compared to predefined parameters, such as dimensions, insulation values, or electrical specifications. Using semantic similarity techniques powered by embeddings, the system retains only the chunks that closely match the parameters, filtering out irrelevant content.
The filtered chunks are then processed by a large language model (LLM). The LLM extracts specific parameters from the text, such as thermal resistance values or cabling specifications.
Finally, the extracted parameters are output in a structured JSON format, ready for integration into databases or further analysis. This streamlined pipeline automates and optimizes parameter extraction, reducing manual effort and enhancing accuracy.
About the project group
Our group consists of 3 students that have a background in media and software. We spent the past 5 months working on this project in collaboration with Nelissen.