Extractor AI — Private Markets Document Data Extraction

Name: Extractor AI
Brand: daappa

The problem

Private markets reporting resists standardisation

Every GP structures their quarterly report differently. Every financial statement uses different table formats, calculation conventions, and terminology. Extraction alone cannot solve this -- data must be structured, reconciled, and calculated consistently before it can be used.

Inconsistent reporting formats

Quarterly reports, financial statements, and investment schedules all differ by manager

Narrative-heavy documents

Most reports contain large volumes of commentary with limited structured data

Mixed fund and asset level data

Cash, bridging loans, and GP carry often appear at fund level but affect portfolio exposure

Different calculation logic

Valuation metrics and multiples vary across managers and need normalisation

Reconciliation requirements

Portfolio data must reconcile to fund NAV before it can feed analytics

The implication for any extraction tool

Extraction alone is not sufficient. Data must be structured, reconciled, and calculated consistently before portfolio analytics can be applied. This is why Extractor AI separates extraction, DataHub calculations, and analytics into three distinct controlled layers -- rather than treating them as a single black box.

Technology approach

Why single-model extraction doesn't work for private markets

Most AI extraction tools rely on a single large language model. That works for simple, consistent documents. It doesn't work for the messy, inconsistent, narrative-heavy documents that private markets produces.

Manual or semi-automated extraction Traditional

▼

Single extraction model (OCR, Generative AI LLMs) Partial solution

▼

Computer Vision AI + Generative AI + Search + Programming + Human-in-the-loop Extractor AI

Why the combination matters

Computer Vision AI handles document layout, table detection, and structure recognition. Generative AI extracts meaning from context. Search and programming handle normalisation and standardisation. Parallel LLM models run alongside each other to validate outputs and flag potential errors before any human sees the data. The result is accuracy that single-model approaches cannot achieve on private markets documents.

Processing workflow

Ingestion. Extraction. Validation. Enrichment.

Four stages. Every document moves through the same controlled pipeline. Click a stage to see what happens at each step.

1

Document ingestion

Email, folder upload, or API

▾

Input formats

Scanned PDFs and native PDFs
Multi-industry and multi-language documents
XLS and CSV files via API
Documents loaded via the daappa portal, email, or folder sync

What happens

Document type is automatically detected and classified
Pack structure is identified (single document vs multi-document pack)
Document is queued for extraction engine processing
Ingestion is timestamped and logged in the audit trail

2

Extraction engine

Asset level and fund level extraction in parallel

▾

Document analysis

Layout analysis and page structure mapping
Table detection and table recognition
Section analysis and classification
Entity labelling across the document
Document decomposition into extractable units

Extraction and processing

Table cell, key-value pair, and in-text data extraction
Metrics and KPI standardisation against your schemas
Data normalisation across inconsistent formats
Search and extract across document sections
Parallel LLM validation to identify and correct potential errors

3

Human-in-the-loop validation

Review, override, and approve

▾

Validation interface

Every extracted value linked back to its exact location in the source document
Preview in context -- see the extracted value alongside the original table or text
Edit or override individual values with commentary
Flag exceptions for escalation to a second reviewer
Accept or reject the full extracted file

What gets recorded

All user validation decisions and timestamps
Every edit and override with before/after values
Escalation to second reviewer with reason
Inline comments and validation notes
Full user activity log throughout the workflow

4

Enrichment and output

Approved data flows to DataHub and downstream systems

▾

What gets produced

Structured data exactly to your schema in Excel or CSV
JSON output for API integration
Direct DataHub import for Studio+ analytics
Ready to load into your existing models and templates

What happens in DataHub

Reconciled dataset: fund and portfolio data consolidated
Core calculations performed: hold period, realised and unrealised values
Derived metrics calculated: EV/EBITDA, multiples, IRR
Analytics layer unlocked for dashboards and portfolio reporting

Controlled architecture

Extraction, calculations, and analytics are separated by design

To ensure consistency and auditability, Extractor AI separates three distinct layers. Each layer has its own controlled logic. Nothing bleeds between them.

Layer 01

Extraction

Extracts structured inputs from source documents. Raw data exactly as it appears in the document, validated against source location.

What it handles

Table cells and key-value pairs
Data in narrative text sections
Document-level metadata and classifications
Multi-language and multi-format documents

Layer 02

DataHub

Performs core portfolio calculations on validated extracted data. Reconciles fund and portfolio company datasets into a single source of truth.

Calculations performed

Hold period calculation
Realised and unrealised investment values
Other net assets reconciliation
Fund-level and portfolio-level consolidation

Layer 03

Analytics

Generates derived metrics and portfolio analytics from the consolidated DataHub dataset. Analytics calculations use controlled MDX logic maintained by the daappa support team.

Metrics produced

EV / EBITDA multiples
Total equity value by investment
Unrealised value / ownership percentage
Portfolio performance dashboards and IRR

Three ways to use Extractor AI

Start where you are. Scale as you need.

Extractor AI offers three service tiers. You can start with extraction only and upgrade as your confidence and requirements grow.

Tier 1

Extraction only

Automated extraction into your existing workflow

Extractor AI processes your documents and delivers structured data in your schemas. Your team handles the validation and your existing models receive the output. Minimal disruption to current processes on day one.

Automated extraction from PDFs into your schemas
Validation interface for your team to review and override
Output to Excel or CSV, ready for existing models
Powered by Zanran extraction engine, operated by daappa
Full audit trail from value to source document location

Start a PoC

Tier 2

Extraction with managed validation

Human-in-the-loop service with agreed SLAs

Everything in Tier 1, plus daappa's operations team performs first-level validation, data enrichment, exception handling, and approval prior to DataHub ingestion. Agreed turnaround times ensure your quarterly cycle is met.

Everything in Tier 1
daappa team performs first-level review and quality checks
Data enrichment where fields are missing from documents
Exception handling and reconciliation before approval
Standard SLA: 72-hour turnaround. Priority: 48-hour
Optional on-the-ground support for complex mandates
Client retains full visibility through validation workflow tracking

Talk to us about Tier 2

Tier 3

Extraction plus business intelligence

Full Studio+ integration with analytics and dashboards

Everything in Tier 2, plus validated data flows directly into daappa Studio+ for analytics, dashboards, portfolio trends, and oversight reporting. The full Extract-Transform-Inform pipeline operating as a single managed workflow.

Everything in Tier 2
Data persisted into Studio+ DataHub
Portfolio calculations and derived metrics in DataHub
Analytics dashboards for performance, KPIs, and portfolio trends
Compliance monitoring and Look Through access
Integration feeds into your wider reporting stack if required

See the full platform

Data lineage and auditability

Every value traceable to its source

Extractor AI preserves four distinct stages of the data pipeline. At any point you can trace any value back to the exact location in the original document.

Document-level traceability

Every extracted value links back to its source location

Every data point extracted by Extractor AI is linked directly to the source location in the original document. Users can view the extracted value in context, trace the field back to the source table or text section, and verify the original document reference.

This is not a summary or a reference number. It is a direct visual link between the value in your dataset and the cell, row, or paragraph it came from.

Dataset transparency

Four preserved stages from document to analytics

Raw extraction

Data captured directly from the document

Validated dataset

Confirmed and enriched data following review

Consolidated dataset

Fund and portfolio data consolidated in DataHub

Final dataset

Single source of truth in Analytics with calculated metrics and KPIs

Why this matters for audit and regulatory oversight

Investment committees, external auditors, and regulators increasingly require evidence of data lineage -- not just the number, but where it came from and who approved it. Extractor AI provides that evidence automatically, without any additional work from your team. The audit trail is built into the workflow, not added afterwards.

Tier 2 managed service

Operational support with agreed service levels

For firms that want daappa's operations team to handle first-level validation, enrichment, and exception handling, Tier 2 provides a managed service with agreed turnaround times and full client visibility throughout.

Extraction validation against source PDFs

Data enrichment where fields are missing from documents

Exception handling and reconciliation

Approval prior to DataHub ingestion

Validation workflow tracking, inline commentary, user activity logs, audit trail

Validation turnaround

Standard

Validation turnaround from document ingestion to approved dataset, agreed per engagement

Priority

Faster turnaround available for time-sensitive reporting cycles

Client oversight throughout

You retain full visibility through the platform at every stage: validation workflow tracking, inline commentary, user activity logs, and audit trail of all changes. The managed service operates transparently -- you see everything the daappa team does.

Getting started

Five weeks from first conversation to go-live

The Extractor AI proof of concept is designed to deliver results against your actual documents -- not a generic demo. You test on your real reports, with your real schemas, and measure against your current process.

Weeks 1 to 2

Discover and configure

Joint workshops to map your current document processing workflow and configure Extractor AI to your schemas.

Map how documents arrive, where they are stored, and who extracts what

Identify your key schemas, metrics, and quality checks

Configure extraction templates for each document type

Define validation rules and exception categories

Agree the target operating model and sign-off owners

Weeks 3 to 4

Live proof of concept

Extractor AI runs alongside your current manual process on a defined set of funds and documents.

A defined set of funds and document types

Mix of portco reports, financial statements, and capital account statements

Your schemas and data points as configured today

Compare time, effort, and accuracy against current process

Analyst feedback on the validation interface

Week 5

Review and rollout

Review results against agreed success measures and confirm the production rollout scope.

Percentage of schema fields populated correctly

Time and effort saved per document pack

Ease of review and sign-off in the validation interface

Agree production rollout for Tier 1 (or Tier 2 / 3 if required)

Test on your actual documents

The PoC uses your real quarterly reports, financial statements, and capital account statements -- not synthetic data. You measure against your current process and decide based on real results. If Extractor AI doesn't deliver, you haven't committed to anything.

Weeks 1-2

Onboarding and configuration

Workshops, schema configuration, validation rule setup

Weeks 3-4

Live PoC on your documents

Parallel run against current process, analyst feedback

Week 5

Review and agree rollout

Results review, production scope confirmation

"

We use daappa to capture, store, consolidate, analyse and extract data. With daappa we are capable of calculating IRR on FoF, GP and portfolio company levels, and we are able to deliver high quality reports to our investors with more transparency compared to our competitors.

Director of Investor Relations · Global VC

Common questions

Extractor AI -- frequently asked

What types of documents does Extractor AI process?

Extractor AI is designed for private markets documents: quarterly portco reports, GP financial statements, capital account statements, LP letters, investment schedules, audited financials, and banking statements in PDF or CSV format. It handles scanned PDFs, native PDFs, multi-language documents, and multi-industry formats.

How is Extractor AI different from other extraction tools?

Most extraction tools rely on a single large language model. Extractor AI combines Computer Vision AI, Generative AI, search, and programming, with parallel LLM models running alongside each other to validate outputs and flag potential errors before human review. It also separates extraction, DataHub calculations, and analytics into three distinct controlled layers -- ensuring consistency and auditability that single-model approaches cannot provide.

How accurate is Extractor AI in practice?

It means the large majority of schema fields extracted from complex PE documents are populated correctly before the human validation step. The remainder are flagged for human review -- they are not silently wrong. The validation interface shows analysts exactly which values need attention and links each one back to its source location in the original document.

Does it work with our existing schemas and models?

Yes. Tier 1 extraction uses your existing schemas and data points as configured today. Outputs arrive in Excel or CSV ready to load into your existing models. You do not need to change your downstream workflow on day one. Schema configuration is part of the onboarding workshops in weeks 1 and 2 of the proof of concept.

What is the Tier 2 managed validation service?

Tier 2 means daappa's operations team performs first-level validation, data enrichment where fields are missing, exception handling, and approval prior to DataHub ingestion. Agreed service levels are 72-hour standard turnaround and 48-hour priority. You retain full visibility through validation workflow tracking, inline commentary, user activity logs, and a complete audit trail. You do not lose control -- the managed service operates transparently through the platform.

What happens after extraction -- does the data connect to analytics?

Tier 3 connects validated extracted data directly to daappa Studio+ DataHub for portfolio calculations and analytics. DataHub performs core calculations (hold period, realised and unrealised values), and the Analytics layer generates derived metrics (EV/EBITDA, multiples, IRR) using controlled MDX logic. This is the full Extract-Transform-Inform pipeline in a single managed workflow.

Inside Extractor AI

From document to validated data

Private markets documents.
Structured data, automatically.

Private markets reporting resists standardisation

Why single-model extraction doesn't work for private markets

Ingestion. Extraction. Validation. Enrichment.

Extraction, calculations, and analytics are separated by design

Start where you are. Scale as you need.

Every value traceable to its source

Every extracted value links back to its source location

Four preserved stages from document to analytics

Operational support with agreed service levels

Five weeks from first conversation to go-live

Test on your actual documents

Extractor AI -- frequently asked

Test Extractor AI on your documents

Validation in action

From document to validated data

Multi-cloud, or fully on your own infrastructure

Private markets documents.Structured data, automatically.

Private markets reporting resists standardisation

Why single-model extraction doesn't work for private markets

Ingestion. Extraction. Validation. Enrichment.

Extraction, calculations, and analytics are separated by design

Start where you are. Scale as you need.

Every value traceable to its source

Every extracted value links back to its source location

Four preserved stages from document to analytics

Operational support with agreed service levels

Five weeks from first conversation to go-live

Test on your actual documents

Extractor AI -- frequently asked

Test Extractor AI on your documents

Validation in action

From document to validated data

Multi-cloud, or fully on your own infrastructure

Private markets documents.
Structured data, automatically.