Document Extraction Workflow

Human-in-the-Loop AI Extraction with Compliance Checkpoints

Overview

RayRay implements a multi-stage extraction workflow that enforces human oversight at critical decision points. This architecture ensures M-25-21 compliance by requiring human verification before any AI-generated data is committed to the system of record.

Workflow Stages

Stage 1: Document Ingestion

PDF documents are uploaded via the REST API and stored with cryptographic integrity verification. The system accepts documents up to 50MB.

POST /api/documents/upload
Content-Type: multipart/form-data

Response:
{
  "id": 123,
  "filename": "contract.pdf",
  "status": "uploaded",
  "created_at": "2026-03-12T10:00:00Z"
}

Stage 2: Spatial Text Extraction

PyMuPDF extracts text blocks with bounding box coordinates. Each text block retains its spatial position on the page, enabling source evidence linking.

AttributeTypeDescription
textstringExtracted text content
x0, y0floatTop-left corner coordinates
x1, y1floatBottom-right corner coordinates
page_numberintPage index (1-based)

Stage 3: AI-Assisted Extraction

An LLM processes the document text and extracts structured data. Each extraction includes confidence scores and links to source text blocks.

{
  "id": "ext_123_0",
  "type": "field",
  "label": "contract_value",
  "value": "$2,500,000.00",
  "confidence": 0.92,
  "source_text_block_ids": ["block_45", "block_46"]
}

Stage 4: Redundancy Validation

Before human review, extractions undergo automated quality checks. Flagged items are highlighted for additional scrutiny.

Check TypeSeverityTrigger Condition
low_confidenceHighConfidence score below 60%
medium_confidenceMediumConfidence score between 60-85%
potential_duplicateMediumSimilar values for same label
no_sourceMediumNo linked text block evidence
empty_valueMediumPlaceholder or null values

Stage 5: Human Review Checkpoint

M-25-21 Compliance Point: The workflow pauses at this stage. No data is committed to the database until a human reviewer approves, modifies, or rejects the extractions.

Available review actions:

  • Approve — Accept all extractions as-is
  • Modify — Edit values before approval (original values preserved)
  • Reject — Decline extractions, no database write

Stage 6: Database Commit

Upon human approval, extractions are written to PostgreSQL with full audit metadata including reviewer ID, timestamp, and any modification history.

Role-Based Workflow Permissions

RoleUploadExtractReview (Draft)Commit
Observer
Analyst
Reviewer
Admin

Note: Analysts may mark extractions as draft-approved. Only Reviewers and Admins can perform final commit operations.

Audit Trail

Every workflow transition is logged to the immutable audit system with SHA-256 hash chain verification. See Compliance for details.

Related