Extractor AI

Private markets documents.
Structured data, automatically.

Extractor AI turns quarterly portco reports, GP performance reports, capital account statements, financial documents, and LP letters into structured, validated, analytics-ready data. No re-keying. No spreadsheet bridges. Full audit trail back to the source document.

How it works
99%
Extraction accuracy, human-checked on exceptions
99% less
Time to process a fund’s report pack, from days to under an hour
Any cloud
Run in our cloud, your cloud, or on your own infrastructure
5
Week proof of concept to go-live
Computer Vision AI, Generative AI, search, and programming combined
Parallel LLM validation catches hallucinations before human review
Three tiers: extraction only, managed validation, or full BI
Works with your existing schemas -- outputs to Excel, CSV, or DataHub
The problem

Private markets reporting resists standardisation

Every GP structures their quarterly report differently. Every financial statement uses different table formats, calculation conventions, and terminology. Extraction alone cannot solve this -- data must be structured, reconciled, and calculated consistently before it can be used.

Inconsistent reporting formats
Quarterly reports, financial statements, and investment schedules all differ by manager
Narrative-heavy documents
Most reports contain large volumes of commentary with limited structured data
Mixed fund and asset level data
Cash, bridging loans, and GP carry often appear at fund level but affect portfolio exposure
Different calculation logic
Valuation metrics and multiples vary across managers and need normalisation
Reconciliation requirements
Portfolio data must reconcile to fund NAV before it can feed analytics
The implication for any extraction tool
Extraction alone is not sufficient. Data must be structured, reconciled, and calculated consistently before portfolio analytics can be applied. This is why Extractor AI separates extraction, DataHub calculations, and analytics into three distinct controlled layers -- rather than treating them as a single black box.
Technology approach

Why single-model extraction doesn't work for private markets

Most AI extraction tools rely on a single large language model. That works for simple, consistent documents. It doesn't work for the messy, inconsistent, narrative-heavy documents that private markets produces.

Manual or semi-automated extraction Traditional
Single extraction model (OCR, Generative AI LLMs) Partial solution
Computer Vision AI + Generative AI + Search + Programming + Human-in-the-loop Extractor AI
Why the combination matters
Computer Vision AI handles document layout, table detection, and structure recognition. Generative AI extracts meaning from context. Search and programming handle normalisation and standardisation. Parallel LLM models run alongside each other to validate outputs and flag potential errors before any human sees the data. The result is accuracy that single-model approaches cannot achieve on private markets documents.
Processing workflow

Ingestion. Extraction. Validation. Enrichment.

Four stages. Every document moves through the same controlled pipeline. Click a stage to see what happens at each step.

1
Document ingestion
Email, folder upload, or API
Input formats
  • Scanned PDFs and native PDFs
  • Multi-industry and multi-language documents
  • XLS and CSV files via API
  • Documents loaded via the daappa portal, email, or folder sync
What happens
  • Document type is automatically detected and classified
  • Pack structure is identified (single document vs multi-document pack)
  • Document is queued for extraction engine processing
  • Ingestion is timestamped and logged in the audit trail
2
Extraction engine
Asset level and fund level extraction in parallel
Document analysis
  • Layout analysis and page structure mapping
  • Table detection and table recognition
  • Section analysis and classification
  • Entity labelling across the document
  • Document decomposition into extractable units
Extraction and processing
  • Table cell, key-value pair, and in-text data extraction
  • Metrics and KPI standardisation against your schemas
  • Data normalisation across inconsistent formats
  • Search and extract across document sections
  • Parallel LLM validation to identify and correct potential errors
3
Human-in-the-loop validation
Review, override, and approve
Validation interface
  • Every extracted value linked back to its exact location in the source document
  • Preview in context -- see the extracted value alongside the original table or text
  • Edit or override individual values with commentary
  • Flag exceptions for escalation to a second reviewer
  • Accept or reject the full extracted file
What gets recorded
  • All user validation decisions and timestamps
  • Every edit and override with before/after values
  • Escalation to second reviewer with reason
  • Inline comments and validation notes
  • Full user activity log throughout the workflow
4
Enrichment and output
Approved data flows to DataHub and downstream systems
What gets produced
  • Structured data exactly to your schema in Excel or CSV
  • JSON output for API integration
  • Direct DataHub import for Studio+ analytics
  • Ready to load into your existing models and templates
What happens in DataHub
  • Reconciled dataset: fund and portfolio data consolidated
  • Core calculations performed: hold period, realised and unrealised values
  • Derived metrics calculated: EV/EBITDA, multiples, IRR
  • Analytics layer unlocked for dashboards and portfolio reporting
Controlled architecture

Extraction, calculations, and analytics are separated by design

To ensure consistency and auditability, Extractor AI separates three distinct layers. Each layer has its own controlled logic. Nothing bleeds between them.

Layer 01
Extraction
Extracts structured inputs from source documents. Raw data exactly as it appears in the document, validated against source location.
What it handles
  • Table cells and key-value pairs
  • Data in narrative text sections
  • Document-level metadata and classifications
  • Multi-language and multi-format documents
Layer 02
DataHub
Performs core portfolio calculations on validated extracted data. Reconciles fund and portfolio company datasets into a single source of truth.
Calculations performed
  • Hold period calculation
  • Realised and unrealised investment values
  • Other net assets reconciliation
  • Fund-level and portfolio-level consolidation
Layer 03
Analytics
Generates derived metrics and portfolio analytics from the consolidated DataHub dataset. Analytics calculations use controlled MDX logic maintained by the daappa support team.
Metrics produced
  • EV / EBITDA multiples
  • Total equity value by investment
  • Unrealised value / ownership percentage
  • Portfolio performance dashboards and IRR
Three ways to use Extractor AI

Start where you are. Scale as you need.

Extractor AI offers three service tiers. You can start with extraction only and upgrade as your confidence and requirements grow.

Tier 1
Extraction only
Automated extraction into your existing workflow
Extractor AI processes your documents and delivers structured data in your schemas. Your team handles the validation and your existing models receive the output. Minimal disruption to current processes on day one.
  • Automated extraction from PDFs into your schemas
  • Validation interface for your team to review and override
  • Output to Excel or CSV, ready for existing models
  • Powered by Zanran extraction engine, operated by daappa
  • Full audit trail from value to source document location
Start a PoC
Tier 3
Extraction plus business intelligence
Full Studio+ integration with analytics and dashboards
Everything in Tier 2, plus validated data flows directly into daappa Studio+ for analytics, dashboards, portfolio trends, and oversight reporting. The full Extract-Transform-Inform pipeline operating as a single managed workflow.
  • Everything in Tier 2
  • Data persisted into Studio+ DataHub
  • Portfolio calculations and derived metrics in DataHub
  • Analytics dashboards for performance, KPIs, and portfolio trends
  • Compliance monitoring and Look Through access
  • Integration feeds into your wider reporting stack if required
See the full platform
Data lineage and auditability

Every value traceable to its source

Extractor AI preserves four distinct stages of the data pipeline. At any point you can trace any value back to the exact location in the original document.

Document-level traceability

Every extracted value links back to its source location

Every data point extracted by Extractor AI is linked directly to the source location in the original document. Users can view the extracted value in context, trace the field back to the source table or text section, and verify the original document reference.

This is not a summary or a reference number. It is a direct visual link between the value in your dataset and the cell, row, or paragraph it came from.

Dataset transparency

Four preserved stages from document to analytics

Raw extraction
Data captured directly from the document
Validated dataset
Confirmed and enriched data following review
Consolidated dataset
Fund and portfolio data consolidated in DataHub
Final dataset
Single source of truth in Analytics with calculated metrics and KPIs
Why this matters for audit and regulatory oversight
Investment committees, external auditors, and regulators increasingly require evidence of data lineage -- not just the number, but where it came from and who approved it. Extractor AI provides that evidence automatically, without any additional work from your team. The audit trail is built into the workflow, not added afterwards.
Tier 2 managed service

Operational support with agreed service levels

For firms that want daappa's operations team to handle first-level validation, enrichment, and exception handling, Tier 2 provides a managed service with agreed turnaround times and full client visibility throughout.

Extraction validation against source PDFs
Data enrichment where fields are missing from documents
Exception handling and reconciliation
Approval prior to DataHub ingestion
Validation workflow tracking, inline commentary, user activity logs, audit trail
Validation turnaround
Standard
Validation turnaround from document ingestion to approved dataset, agreed per engagement
Priority
Faster turnaround available for time-sensitive reporting cycles
Client oversight throughout
You retain full visibility through the platform at every stage: validation workflow tracking, inline commentary, user activity logs, and audit trail of all changes. The managed service operates transparently -- you see everything the daappa team does.
Getting started

Five weeks from first conversation to go-live

The Extractor AI proof of concept is designed to deliver results against your actual documents -- not a generic demo. You test on your real reports, with your real schemas, and measure against your current process.

Weeks 1 to 2
Discover and configure
Joint workshops to map your current document processing workflow and configure Extractor AI to your schemas.
Map how documents arrive, where they are stored, and who extracts what
Identify your key schemas, metrics, and quality checks
Configure extraction templates for each document type
Define validation rules and exception categories
Agree the target operating model and sign-off owners
Weeks 3 to 4
Live proof of concept
Extractor AI runs alongside your current manual process on a defined set of funds and documents.
A defined set of funds and document types
Mix of portco reports, financial statements, and capital account statements
Your schemas and data points as configured today
Compare time, effort, and accuracy against current process
Analyst feedback on the validation interface
Week 5
Review and rollout
Review results against agreed success measures and confirm the production rollout scope.
Percentage of schema fields populated correctly
Time and effort saved per document pack
Ease of review and sign-off in the validation interface
Agree production rollout for Tier 1 (or Tier 2 / 3 if required)
Proof of concept

Test on your actual documents

The PoC uses your real quarterly reports, financial statements, and capital account statements -- not synthetic data. You measure against your current process and decide based on real results. If Extractor AI doesn't deliver, you haven't committed to anything.

Weeks 1-2
Onboarding and configuration
Workshops, schema configuration, validation rule setup
Weeks 3-4
Live PoC on your documents
Parallel run against current process, analyst feedback
Week 5
Review and agree rollout
Results review, production scope confirmation
"
We use daappa to capture, store, consolidate, analyse and extract data. With daappa we are capable of calculating IRR on FoF, GP and portfolio company levels, and we are able to deliver high quality reports to our investors with more transparency compared to our competitors.
Director of Investor Relations  ·  Global VC
Common questions

Extractor AI -- frequently asked

What types of documents does Extractor AI process?
Extractor AI is designed for private markets documents: quarterly portco reports, GP financial statements, capital account statements, LP letters, investment schedules, audited financials, and banking statements in PDF or CSV format. It handles scanned PDFs, native PDFs, multi-language documents, and multi-industry formats.
How is Extractor AI different from other extraction tools?
Most extraction tools rely on a single large language model. Extractor AI combines Computer Vision AI, Generative AI, search, and programming, with parallel LLM models running alongside each other to validate outputs and flag potential errors before human review. It also separates extraction, DataHub calculations, and analytics into three distinct controlled layers -- ensuring consistency and auditability that single-model approaches cannot provide.
How accurate is Extractor AI in practice?
It means the large majority of schema fields extracted from complex PE documents are populated correctly before the human validation step. The remainder are flagged for human review -- they are not silently wrong. The validation interface shows analysts exactly which values need attention and links each one back to its source location in the original document.
Does it work with our existing schemas and models?
Yes. Tier 1 extraction uses your existing schemas and data points as configured today. Outputs arrive in Excel or CSV ready to load into your existing models. You do not need to change your downstream workflow on day one. Schema configuration is part of the onboarding workshops in weeks 1 and 2 of the proof of concept.
What is the Tier 2 managed validation service?
Tier 2 means daappa's operations team performs first-level validation, data enrichment where fields are missing, exception handling, and approval prior to DataHub ingestion. Agreed service levels are 72-hour standard turnaround and 48-hour priority. You retain full visibility through validation workflow tracking, inline commentary, user activity logs, and a complete audit trail. You do not lose control -- the managed service operates transparently through the platform.
What happens after extraction -- does the data connect to analytics?
Tier 3 connects validated extracted data directly to daappa Studio+ DataHub for portfolio calculations and analytics. DataHub performs core calculations (hold period, realised and unrealised values), and the Analytics layer generates derived metrics (EV/EBITDA, multiples, IRR) using controlled MDX logic. This is the full Extract-Transform-Inform pipeline in a single managed workflow.

Test Extractor AI on your documents

Tell us about the documents you process, the schemas you use, and what the current process costs you in time. We will design a proof of concept around your actual quarterly cycle.

See it in action

Validation in action

daappa platform
Extractor AI validation screen with extracted values linked to the source document
Every extracted value linked back to its source document, with human review on exceptions.
Inside Extractor AI

From document to validated data

Extractor AI
Document processing
Document processing
Extractor AI
Extraction output and review
Extraction output and review
Your data, your infrastructure

Multi-cloud, or fully on your own infrastructure

Run daappa in our cloud, your cloud, or entirely on-premise inside your own environment. Your data is never shared with external AI engines unless you choose to. Built for enterprises and regulated managers who run their own AI and data infrastructure, or who simply cannot let client data leave their control.