Spreadsheet Processing
Extract structured data from Excel and CSV files with full cell-level traceability
Supported Formats
| Format | Extension | Notes |
|---|---|---|
| Excel 2007+ | .xlsx | Full support, multi-sheet extraction |
| Excel 97-2003 | .xls | Limited support via openpyxl compatibility layer |
| Comma-Separated Values | .csv | UTF-8 encoding, single sheet |
Extraction Architecture
The SpreadsheetExtractor class processes spreadsheet files and produces a standardized output format compatible with the RayRay extraction pipeline. Each cell is extracted with its row/column coordinates encoded as a bounding box for full traceability.
Cell Coordinate System
Cell positions are encoded using a 0-indexed coordinate system where bounding boxes represent cell boundaries:
{
"text": "450 km",
"bbox": {
"x0": 2, // Column index (0-indexed)
"y0": 5, // Row index (0-indexed)
"x1": 3, // Column + 1
"y1": 6 // Row + 1
}
}This coordinate system ensures every extracted value can be traced back to its exact source location, satisfying audit requirements for spatial evidence traceability.
Output Structure
{
"page_count": 3, // Number of sheets
"file_hash": "sha256:abc...",
"sheets": [
{
"sheet_number": 1,
"sheet_name": "Radar Specs",
"width": 12, // Column count
"height": 50, // Row count
"text": "Full text for LLM processing...",
"text_blocks": [
{ "text": "System Name", "bbox": {...} },
{ "text": "Range (km)", "bbox": {...} }
]
}
],
"pages": [...] // Alias for compatibility
}Data Sanitization
All cell values are sanitized to remove null bytes and invalid UTF-8 sequences before extraction. This ensures clean data for LLM processing and prevents encoding-related failures.
# Sanitization process
1. Convert to string representation
2. Remove null bytes (\x00)
3. Strip control characters (preserve \n, \t, \r)
4. Re-encode as valid UTF-8Processing Examples
Single CSV File
# Upload CSV via API
curl -X POST "http://localhost:8000/api/documents/upload" \
-H "Authorization: Bearer eyJ..." \
-F "file=@radar-specs.csv"
# Response includes sheet structure
{
"id": 101,
"filename": "radar-specs.csv",
"status": "uploaded",
"page_count": 1
}Multi-Sheet Excel Workbook
# Upload Excel file
curl -X POST "http://localhost:8000/api/documents/upload" \
-H "Authorization: Bearer eyJ..." \
-F "file=@comparison-matrix.xlsx"
# Each sheet becomes a separate page for extraction
{
"id": 102,
"filename": "comparison-matrix.xlsx",
"status": "uploaded",
"page_count": 4, // 4 sheets
"sheets": ["Summary", "Technical Specs", "Pricing", "Notes"]
}Text Representation for LLM
Each sheet produces a text representation optimized for LLM processing. Row values are joined with pipe delimiters, and rows are separated by newlines:
# Original spreadsheet:
| System A | 450 km | $1.2M |
| System B | 380 km | $0.9M |
# Text representation:
System A | 450 km | $1.2M
System B | 380 km | $0.9MThis format preserves tabular structure while remaining token-efficient for Claude and other LLM providers.
Integration with Extraction Workflow
Spreadsheets follow the same Human-in-the-Loop extraction workflow as PDFs:
- Upload — Document submitted, hash computed
- Extract — Cell values extracted with coordinates
- AI Processing — Claude extracts structured attributes
- Human Review — Analyst verifies extracted values
- Commit — Approved data written to database
Cell Click Navigation
During review, clicking an extraction highlights the source cell in the spreadsheet viewer. The bounding box coordinates enable precise source verification for audit compliance.
NIST Control Alignment
| Control | Implementation |
|---|---|
AU-3 | Cell coordinates provide source location for all extracted values |
AU-9 | File hash (SHA-256) ensures document integrity |
AU-11 | Extraction records retained per audit policy |
API Reference
SpreadsheetExtractor Class
from app.services.spreadsheet_extractor import SpreadsheetExtractor
# Initialize with file path
extractor = SpreadsheetExtractor("/path/to/file.xlsx")
# Extract all content
result = extractor.extract()
# Access sheets
for sheet in result["sheets"]:
print(f"Sheet: {sheet['sheet_name']}")
print(f"Dimensions: {sheet['width']} x {sheet['height']}")
for block in sheet["text_blocks"]:
print(f" Cell [{block['bbox']['y0']},{block['bbox']['x0']}]: {block['text']}")
# Clean up
extractor.close()Context Manager Support
with SpreadsheetExtractor("/path/to/file.csv") as extractor:
result = extractor.extract()
# Process result...
# Automatically closedPerformance Considerations
| File Size | Processing Time | Memory |
|---|---|---|
| < 1 MB | < 1 second | Minimal |
| 1-10 MB | 1-5 seconds | Moderate |
| > 10 MB | 5+ seconds | Consider chunking |
Excel files use read_only=True mode to minimize memory consumption for large workbooks. Very large spreadsheets may require chunked processing.
Error Handling
| Error | Cause | Resolution |
|---|---|---|
ValueError | Unsupported file extension | Convert to .xlsx, .xls, or .csv |
FileNotFoundError | File deleted before processing | Re-upload document |
| Encoding errors | Non-UTF-8 CSV | Convert to UTF-8 encoding |
| Corrupted file | Invalid Excel structure | Repair or re-export source file |
Related
- Extraction Workflow — Human-in-the-Loop process
- API Reference — Document upload endpoints
- Audit System — Traceability requirements