Spreadsheet Processing

Extract structured data from Excel and CSV files with full cell-level traceability

Supported Formats

FormatExtensionNotes
Excel 2007+.xlsxFull support, multi-sheet extraction
Excel 97-2003.xlsLimited support via openpyxl compatibility layer
Comma-Separated Values.csvUTF-8 encoding, single sheet

Extraction Architecture

The SpreadsheetExtractor class processes spreadsheet files and produces a standardized output format compatible with the RayRay extraction pipeline. Each cell is extracted with its row/column coordinates encoded as a bounding box for full traceability.

Cell Coordinate System

Cell positions are encoded using a 0-indexed coordinate system where bounding boxes represent cell boundaries:

{
  "text": "450 km",
  "bbox": {
    "x0": 2,   // Column index (0-indexed)
    "y0": 5,   // Row index (0-indexed)
    "x1": 3,   // Column + 1
    "y1": 6    // Row + 1
  }
}

This coordinate system ensures every extracted value can be traced back to its exact source location, satisfying audit requirements for spatial evidence traceability.

Output Structure

{
  "page_count": 3,           // Number of sheets
  "file_hash": "sha256:abc...",
  "sheets": [
    {
      "sheet_number": 1,
      "sheet_name": "Radar Specs",
      "width": 12,            // Column count
      "height": 50,           // Row count
      "text": "Full text for LLM processing...",
      "text_blocks": [
        { "text": "System Name", "bbox": {...} },
        { "text": "Range (km)", "bbox": {...} }
      ]
    }
  ],
  "pages": [...]             // Alias for compatibility
}

Data Sanitization

All cell values are sanitized to remove null bytes and invalid UTF-8 sequences before extraction. This ensures clean data for LLM processing and prevents encoding-related failures.

# Sanitization process
1. Convert to string representation
2. Remove null bytes (\x00)
3. Strip control characters (preserve \n, \t, \r)
4. Re-encode as valid UTF-8

Processing Examples

Single CSV File

# Upload CSV via API
curl -X POST "http://localhost:8000/api/documents/upload" \
  -H "Authorization: Bearer eyJ..." \
  -F "file=@radar-specs.csv"

# Response includes sheet structure
{
  "id": 101,
  "filename": "radar-specs.csv",
  "status": "uploaded",
  "page_count": 1
}

Multi-Sheet Excel Workbook

# Upload Excel file
curl -X POST "http://localhost:8000/api/documents/upload" \
  -H "Authorization: Bearer eyJ..." \
  -F "file=@comparison-matrix.xlsx"

# Each sheet becomes a separate page for extraction
{
  "id": 102,
  "filename": "comparison-matrix.xlsx",
  "status": "uploaded",
  "page_count": 4,  // 4 sheets
  "sheets": ["Summary", "Technical Specs", "Pricing", "Notes"]
}

Text Representation for LLM

Each sheet produces a text representation optimized for LLM processing. Row values are joined with pipe delimiters, and rows are separated by newlines:

# Original spreadsheet:
| System A | 450 km | $1.2M |
| System B | 380 km | $0.9M |

# Text representation:
System A | 450 km | $1.2M
System B | 380 km | $0.9M

This format preserves tabular structure while remaining token-efficient for Claude and other LLM providers.

Integration with Extraction Workflow

Spreadsheets follow the same Human-in-the-Loop extraction workflow as PDFs:

  1. Upload — Document submitted, hash computed
  2. Extract — Cell values extracted with coordinates
  3. AI Processing — Claude extracts structured attributes
  4. Human Review — Analyst verifies extracted values
  5. Commit — Approved data written to database

Cell Click Navigation

During review, clicking an extraction highlights the source cell in the spreadsheet viewer. The bounding box coordinates enable precise source verification for audit compliance.

NIST Control Alignment

ControlImplementation
AU-3Cell coordinates provide source location for all extracted values
AU-9File hash (SHA-256) ensures document integrity
AU-11Extraction records retained per audit policy

API Reference

SpreadsheetExtractor Class

from app.services.spreadsheet_extractor import SpreadsheetExtractor

# Initialize with file path
extractor = SpreadsheetExtractor("/path/to/file.xlsx")

# Extract all content
result = extractor.extract()

# Access sheets
for sheet in result["sheets"]:
    print(f"Sheet: {sheet['sheet_name']}")
    print(f"Dimensions: {sheet['width']} x {sheet['height']}")
    for block in sheet["text_blocks"]:
        print(f"  Cell [{block['bbox']['y0']},{block['bbox']['x0']}]: {block['text']}")

# Clean up
extractor.close()

Context Manager Support

with SpreadsheetExtractor("/path/to/file.csv") as extractor:
    result = extractor.extract()
    # Process result...
# Automatically closed

Performance Considerations

File SizeProcessing TimeMemory
< 1 MB< 1 secondMinimal
1-10 MB1-5 secondsModerate
> 10 MB5+ secondsConsider chunking

Excel files use read_only=True mode to minimize memory consumption for large workbooks. Very large spreadsheets may require chunked processing.

Error Handling

ErrorCauseResolution
ValueErrorUnsupported file extensionConvert to .xlsx, .xls, or .csv
FileNotFoundErrorFile deleted before processingRe-upload document
Encoding errorsNon-UTF-8 CSVConvert to UTF-8 encoding
Corrupted fileInvalid Excel structureRepair or re-export source file

Related