Document Extraction Workflow
Human-in-the-Loop AI Extraction with Compliance Checkpoints
Overview
RayRay implements a multi-stage extraction workflow that enforces human oversight at critical decision points. This architecture ensures M-25-21 compliance by requiring human verification before any AI-generated data is committed to the system of record.
Workflow Stages
Stage 1: Document Ingestion
PDF documents are uploaded via the REST API and stored with cryptographic integrity verification. The system accepts documents up to 50MB.
POST /api/documents/upload
Content-Type: multipart/form-data
Response:
{
"id": 123,
"filename": "contract.pdf",
"status": "uploaded",
"created_at": "2026-03-12T10:00:00Z"
}Stage 2: Spatial Text Extraction
PyMuPDF extracts text blocks with bounding box coordinates. Each text block retains its spatial position on the page, enabling source evidence linking.
| Attribute | Type | Description |
|---|---|---|
text | string | Extracted text content |
x0, y0 | float | Top-left corner coordinates |
x1, y1 | float | Bottom-right corner coordinates |
page_number | int | Page index (1-based) |
Stage 3: AI-Assisted Extraction
An LLM processes the document text and extracts structured data. Each extraction includes confidence scores and links to source text blocks.
{
"id": "ext_123_0",
"type": "field",
"label": "contract_value",
"value": "$2,500,000.00",
"confidence": 0.92,
"source_text_block_ids": ["block_45", "block_46"]
}Stage 4: Redundancy Validation
Before human review, extractions undergo automated quality checks. Flagged items are highlighted for additional scrutiny.
| Check Type | Severity | Trigger Condition |
|---|---|---|
| low_confidence | High | Confidence score below 60% |
| medium_confidence | Medium | Confidence score between 60-85% |
| potential_duplicate | Medium | Similar values for same label |
| no_source | Medium | No linked text block evidence |
| empty_value | Medium | Placeholder or null values |
Stage 5: Human Review Checkpoint
M-25-21 Compliance Point: The workflow pauses at this stage. No data is committed to the database until a human reviewer approves, modifies, or rejects the extractions.
Available review actions:
- Approve — Accept all extractions as-is
- Modify — Edit values before approval (original values preserved)
- Reject — Decline extractions, no database write
Stage 6: Database Commit
Upon human approval, extractions are written to PostgreSQL with full audit metadata including reviewer ID, timestamp, and any modification history.
Role-Based Workflow Permissions
| Role | Upload | Extract | Review (Draft) | Commit |
|---|---|---|---|---|
| Observer | — | — | — | — |
| Analyst | ✓ | ✓ | ✓ | — |
| Reviewer | ✓ | ✓ | ✓ | ✓ |
| Admin | ✓ | ✓ | ✓ | ✓ |
Note: Analysts may mark extractions as draft-approved. Only Reviewers and Admins can perform final commit operations.
Audit Trail
Every workflow transition is logged to the immutable audit system with SHA-256 hash chain verification. See Compliance for details.
Related
- API Reference — Extraction endpoints
- Architecture — System design
- Compliance — Audit requirements