Skip to content

Data Analysis Tools

The worker containers come pre-installed with a comprehensive set of tools for navigating and analyzing civil engineering project data.

Python Libraries

A Python 3 virtual environment (/opt/venv) is available with the following packages:

IFC Models

PackagePurpose
IfcOpenShellRead and query IFC models (supports IFC2X3 and IFC4 schemas)

Example agent usage:

python
import ifcopenshell

model = ifcopenshell.open("/data/PID_Karavanke/.../model.ifc")
elements = model.by_type("IfcBuildingElementProxy")

for el in elements:
    psets = ifcopenshell.util.element.get_psets(el)
    klasifikacija = psets.get("KAR_Klasifikacija", {})
    print(klasifikacija.get("ElementTip"), klasifikacija.get("Funkcija"))

IFC Schema Versions

The Karavanke dataset contains two IFC schemas: IFC2X3 (209 files) and IFC4 (64 files). IfcOpenShell handles both transparently.

Excel Tables

PackagePurpose
openpyxlRead and write Excel files (.xlsx)
pandasTabular data analysis, filtering, aggregation

Example agent usage:

python
import pandas as pd

df = pd.read_excel("/data/.../ListaKampad_Elea.xlsx", sheet_name="Blockbuch")
kpp = df[df["Tip kampade"] == "KPP"]
print(f"Found {len(kpp)} KPP kampadas")

PDF Documents

PackagePurpose
pdfplumberExtract text and tables from PDF files
PyMuPDF (fitz)Fast PDF rendering, text extraction, and OCR support
pytesseractOCR engine for scanned documents

The system includes Tesseract OCR for handling scanned PDFs that don't contain selectable text.

Example agent usage:

python
import pdfplumber

with pdfplumber.open("/data/.../technical_report.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()

System Tools

The following command-line tools are available for fast data navigation:

ToolPurposeExample
ripgrep (rg)Fast text search across filesrg "kampada A271" /data/ --type pdf
jqJSON processing and filteringjq '.elements[] | .name' output.json
sqlite3SQLite database queriessqlite3 /artefacts/state/session_registry.db '.tables'
gitVersion control (for agent workspace)git log --oneline

ripgrep for Fast File Discovery

ripgrep is particularly useful for quickly finding relevant files across the 1,672-file dataset:

bash
# Find all files mentioning a specific kampada
rg -l "A271" /data/

# Search for a term in PDF-extracted text
rg -l "stropna plosca" /data/ --type-add 'txt:*.txt'

# Count occurrences across file types
rg -c "KPP" /data/ --type-add 'xlsx:*.xlsx'

Tool Availability by Container

ToolControllerWorker
Node.js 22YesYes
Python 3 + venvNoYes
IfcOpenShellNoYes
pandas, openpyxlNoYes
pdfplumber, PyMuPDFNoYes
pytesseract + TesseractNoYes
ripgrep, jq, sqlite3YesYes
gitYesYes

INFO

All AI query execution happens on workers, which have the full toolchain. The controller only serves the web UI and routes requests.