Spreadsheet Processing

Extract structured data from Excel and CSV files with full cell-level traceability

Supported Formats

Format	Extension	Notes
Excel 2007+	`.xlsx`	Full support, multi-sheet extraction
Excel 97-2003	`.xls`	Limited support via openpyxl compatibility layer
Comma-Separated Values	`.csv`	UTF-8 encoding, single sheet

Extraction Architecture

The SpreadsheetExtractor class processes spreadsheet files and produces a standardized output format compatible with the RayRay extraction pipeline. Each cell is extracted with its row/column coordinates encoded as a bounding box for full traceability.

Cell Coordinate System

Cell positions are encoded using a 0-indexed coordinate system where bounding boxes represent cell boundaries:

{
  "text": "450 km",
  "bbox": {
    "x0": 2,   // Column index (0-indexed)
    "y0": 5,   // Row index (0-indexed)
    "x1": 3,   // Column + 1
    "y1": 6    // Row + 1
  }
}

This coordinate system ensures every extracted value can be traced back to its exact source location, satisfying audit requirements for spatial evidence traceability.

Output Structure

{
  "page_count": 3,           // Number of sheets
  "file_hash": "sha256:abc...",
  "sheets": [
    {
      "sheet_number": 1,
      "sheet_name": "Radar Specs",
      "width": 12,            // Column count
      "height": 50,           // Row count
      "text": "Full text for LLM processing...",
      "text_blocks": [
        { "text": "System Name", "bbox": {...} },
        { "text": "Range (km)", "bbox": {...} }
      ]
    }
  ],
  "pages": [...]             // Alias for compatibility
}

Data Sanitization

All cell values are sanitized to remove null bytes and invalid UTF-8 sequences before extraction. This ensures clean data for LLM processing and prevents encoding-related failures.

# Sanitization process
1. Convert to string representation
2. Remove null bytes (\x00)
3. Strip control characters (preserve \n, \t, \r)
4. Re-encode as valid UTF-8

Processing Examples

Single CSV File

# Upload CSV via API
curl -X POST "http://localhost:8000/api/documents/upload" \
  -H "Authorization: Bearer eyJ..." \
  -F "file=@radar-specs.csv"

# Response includes sheet structure
{
  "id": 101,
  "filename": "radar-specs.csv",
  "status": "uploaded",
  "page_count": 1
}

Multi-Sheet Excel Workbook

# Upload Excel file
curl -X POST "http://localhost:8000/api/documents/upload" \
  -H "Authorization: Bearer eyJ..." \
  -F "file=@comparison-matrix.xlsx"

# Each sheet becomes a separate page for extraction
{
  "id": 102,
  "filename": "comparison-matrix.xlsx",
  "status": "uploaded",
  "page_count": 4,  // 4 sheets
  "sheets": ["Summary", "Technical Specs", "Pricing", "Notes"]
}

Text Representation for LLM

Each sheet produces a text representation optimized for LLM processing. Row values are joined with pipe delimiters, and rows are separated by newlines:

# Original spreadsheet:
| System A | 450 km | $1.2M |
| System B | 380 km | $0.9M |

# Text representation:
System A | 450 km | $1.2M
System B | 380 km | $0.9M

This format preserves tabular structure while remaining token-efficient for Claude and other LLM providers.

Integration with Extraction Workflow

Spreadsheets follow the same Human-in-the-Loop extraction workflow as PDFs:

Upload — Document submitted, hash computed
Extract — Cell values extracted with coordinates
AI Processing — Claude extracts structured attributes
Human Review — Analyst verifies extracted values
Commit — Approved data written to database

Cell Click Navigation

During review, clicking an extraction highlights the source cell in the spreadsheet viewer. The bounding box coordinates enable precise source verification for audit compliance.

NIST Control Alignment

Control	Implementation
`AU-3`	Cell coordinates provide source location for all extracted values
`AU-9`	File hash (SHA-256) ensures document integrity
`AU-11`	Extraction records retained per audit policy

API Reference

SpreadsheetExtractor Class

from app.services.spreadsheet_extractor import SpreadsheetExtractor

# Initialize with file path
extractor = SpreadsheetExtractor("/path/to/file.xlsx")

# Extract all content
result = extractor.extract()

# Access sheets
for sheet in result["sheets"]:
    print(f"Sheet: {sheet['sheet_name']}")
    print(f"Dimensions: {sheet['width']} x {sheet['height']}")
    for block in sheet["text_blocks"]:
        print(f"  Cell [{block['bbox']['y0']},{block['bbox']['x0']}]: {block['text']}")

# Clean up
extractor.close()

Context Manager Support

with SpreadsheetExtractor("/path/to/file.csv") as extractor:
    result = extractor.extract()
    # Process result...
# Automatically closed

Performance Considerations

File Size	Processing Time	Memory
< 1 MB	< 1 second	Minimal
1-10 MB	1-5 seconds	Moderate
> 10 MB	5+ seconds	Consider chunking

Excel files use read_only=True mode to minimize memory consumption for large workbooks. Very large spreadsheets may require chunked processing.

Error Handling

Error	Cause	Resolution
`ValueError`	Unsupported file extension	Convert to .xlsx, .xls, or .csv
`FileNotFoundError`	File deleted before processing	Re-upload document
Encoding errors	Non-UTF-8 CSV	Convert to UTF-8 encoding
Corrupted file	Invalid Excel structure	Repair or re-export source file

Extraction Workflow — Human-in-the-Loop process
API Reference — Document upload endpoints
Audit System — Traceability requirements