System Architecture
Technical design, data flow, and infrastructure overview
Design Principles
RayRay is architected around four core principles that ensure compliance, auditability, and operational efficiency:
- Spatial Evidence Traceability — Every AI extraction maintains coordinate-based links to source text via bounding boxes
- Mandatory Human-in-the-Loop — LangGraph checkpoints enforce human review before any database write
- Immutable Audit Logs — Blockchain-style hash chains provide tamper-evident event history
- Labor Optimization Metrics — Automated time-savings calculation for agency ROI reporting
System Overview
┌─────────────────────────────────────────────────────────────────────┐
│ RAYRAY PLATFORM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Frontend │────▶│ Backend │────▶│ PostgreSQL │ │
│ │ Next.js 14 │ │ FastAPI │ │ Database │ │
│ │ React 18 │◀────│ Python 3.11 │◀────│ (Async) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ LangGraph │ │ │
│ │ │ Workflow │ │ │
│ │ └──────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ LLM APIs │ │ │
│ │ │ Anthropic/ │ │ │
│ │ │ OpenAI │ │ │
│ │ └──────────────┘ │ │
│ │ │ │
│ └────────────▶┌──────────────┐◀──────────┘ │
│ │ Audit Log │ │
│ │ (Append-Only│ │
│ │ SHA-256) │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Extraction Workflow
The extraction pipeline enforces M-25-21 compliance through a multi-stage workflow with mandatory human checkpoints:
Stage 1: UPLOAD
└── PDF received → Stored with metadata → Audit logged
Stage 2: PARSE
└── PyMuPDF extracts text blocks with bounding boxes
└── Each block: {id, text, x0, y0, x1, y1, page_number}
Stage 3: EXTRACT (AI-Assisted)
└── LLM processes full document text
└── Returns structured extractions with confidence scores
└── Each extraction linked to source text blocks
Stage 4: VALIDATE (Redundancy Check)
└── Confidence threshold check (< 60% = flagged)
└── Duplicate detection (similarity > 80%)
└── Required field validation
└── Source coverage check
Stage 5: CHECKPOINT (Human Review) ◀── M-25-21 MANDATORY
└── Workflow pauses
└── Human reviews flagged and unflagged items
└── Actions: APPROVE / MODIFY / REJECT
Stage 6: COMMIT
└── Only after human approval
└── Original values preserved if modified
└── Audit log updated with reviewer info
Component Architecture
Frontend Stack
| Component | Technology | Purpose |
|---|
| Framework | Next.js 14 (App Router) | SSR, routing, API routes |
| UI Library | React 18 | Component model |
| Styling | Tailwind CSS | Utility-first CSS |
| PDF Viewer | react-pdf-viewer | Document display with overlays |
| State | React Context | Auth state, theme |
Backend Stack
| Component | Technology | Purpose |
|---|
| Framework | FastAPI | Async API server |
| ORM | SQLAlchemy 2.0 | Async database access |
| Workflow | LangGraph | HITL checkpoint enforcement |
| PDF Processing | PyMuPDF | Text extraction with coordinates |
| Auth | python-jose + passlib | JWT + bcrypt |
| MFA | pyotp | TOTP (RFC 6238) |
Data Model
Core Entities
Document
├── id: int (PK)
├── filename: str
├── file_path: str
├── status: enum [pending, processing, reviewed, committed]
├── uploaded_by: int (FK → User)
└── created_at: datetime
Page
├── id: int (PK)
├── document_id: int (FK)
├── page_number: int
└── dimensions: {width, height}
TextBlock
├── id: int (PK)
├── page_id: int (FK)
├── text: str
├── x0, y0, x1, y1: float ◀── Bounding box coordinates
└── confidence: float
Extraction
├── id: int (PK)
├── document_id: int (FK)
├── label: str
├── value: str
├── original_value: str ◀── AI output (preserved)
├── confidence: float
├── review_status: enum [pending, approved, rejected]
├── reviewed_by: int (FK) ◀── M-25-21: Human verifier
├── reviewed_at: datetime
├── workflow_run_id: str ◀── Traceability
└── workflow_checkpoint: str
ExtractionSource (link table)
├── extraction_id: int (FK)
├── text_block_id: int (FK) ◀── Spatial evidence link
└── relevance_score: float
AuditLog
├── _id: uuid
├── _timestamp: datetime
├── _prev_hash: str ◀── Chain integrity
├── _hash: str
├── event_type: str
├── user_id: int
├── resource_type: str
├── resource_id: int
├── outcome: enum [success, failure]
└── details: json
Relationship Diagram
Document 1──────* Page 1──────* TextBlock
│ │
│ │
└──────1* Extraction *────────┘
│ (via ExtractionSource)
│
*1
Reviewer (User)
User ──────* AuditLog
LangGraph Workflow Implementation
from langgraph.graph import StateGraph
from langgraph.checkpoint.memory import MemorySaver
# Define state
class ExtractionState(TypedDict):
document_id: int
extractions: List[Dict]
review_status: str # pending, approved, rejected
reviewer_id: Optional[str]
...
# Build workflow
workflow = StateGraph(ExtractionState)
workflow.add_node("extract", extract_from_pdf)
workflow.add_node("redundancy_check", redundancy_check_node)
workflow.add_node("human_review", human_review_node) # CHECKPOINT
workflow.add_node("commit", commit_to_database)
# Edges
workflow.add_edge("extract", "redundancy_check")
workflow.add_edge("redundancy_check", "human_review")
workflow.add_conditional_edges("human_review", should_commit, {
"commit": "commit",
"reject": END,
"pending": END, # Pause for human input
})
workflow.add_edge("commit", END)
# Compile with checkpointing
checkpointer = MemorySaver() # Use PostgresSaver in production
app = workflow.compile(checkpointer=checkpointer)
Key Implementation Details
- Checkpoint Enforcement: The workflow pauses at
human_review and cannot proceed until external input is provided via the review API - State Persistence: Production deployments should use
PostgresSaver for durable checkpoint storage - Redundancy Checks: Pre-review validation flags low-confidence items (thresholds configurable)
Frontend Route Structure
apps/web/src/app/
├── layout.tsx # Root layout (fonts, providers)
├── (marketing)/ # Public pages — no auth required
│ ├── page.tsx # Landing (/)
│ └── login/page.tsx # Sign in (/login)
├── (app)/ # Protected pages — auth required
│ ├── layout.tsx # App layout with nav + footer
│ ├── dashboard/ # Main dashboard
│ ├── documents/ # Document list + detail
│ ├── compare/ # Document comparison
│ ├── settings/ # User settings + MFA
│ └── profile/ # User profile
└── docs/ # Documentation (this site)
Security Architecture
Authentication Flow
1. User submits credentials (email + password)
2. Backend verifies against bcrypt hash
3. If MFA enabled:
a. Return partial JWT with mfa=false
b. User submits TOTP code
c. Verify against stored secret
d. Issue full JWT with mfa=true
4. JWT stored in httpOnly cookie + localStorage
5. Subsequent requests include Authorization header
Authorization Model
| Role | Can Commit | Can Manage Users | NIST Mapping |
|---|
| Observer | No | No | AC-3 (Read-only) |
| Analyst | No | No | AC-3 (Drafts only) |
| Reviewer | Yes | No | AC-3 (Full CRUD) |
| Admin | Yes | Yes | AC-3 (Privileged) |
Infrastructure Requirements
Development
# Minimum requirements
- Node.js 18+
- Python 3.11+
- PostgreSQL 15+
- 4GB RAM
# Start services
docker-compose up -d db
cd apps/api && uvicorn app.main:app --reload
cd apps/web && npm run dev
Production
| Component | Recommendation | Notes |
|---|
| Database | Managed PostgreSQL (RDS, Cloud SQL) | Enable point-in-time recovery |
| API Hosting | Containerized (ECS, Cloud Run) | Auto-scaling recommended |
| Frontend | Vercel, Cloudflare Pages | Edge caching for static assets |
| File Storage | S3, GCS with encryption | Server-side encryption required |
| Secrets | Vault, AWS Secrets Manager | Never commit to code |
Related Documentation