OCR Service (Port 8137)
Status: ✅ Implemented | Version: 1.0.0
Overview
The OCR service provides AI-enhanced optical character recognition for the CORTX platform, extracting structured data and text from scanned documents, PDFs, and images using a multi-tier processing pipeline.
Multi-Tier OCR Pipeline
Tier 1: Tesseract OCR (Free, Fast)
- Use Case: Modern, clear documents with printed text
- Speed: 100-300ms per page
- Accuracy: 85-95% on clean documents
- Cost: Free
- Confidence Threshold: 80%
Tier 2: Claude 3.5 Sonnet Vision (AI-Enhanced)
- Use Case: Historical documents, handwritten text, complex layouts
- Speed: 2-5 seconds per page
- Accuracy: 90-98% on complex documents
- Cost: ~$0.003 per image
- Confidence Threshold: 85%
- Requires:
ANTHROPIC_API_KEY
environment variable
Tier 3: Human Review (Manual, 100% Accuracy)
- Use Case: Critical documents requiring perfect accuracy
- Process: Documents below AI threshold flagged for human review
- Result: Corrections stored and used to improve future processing
Auto-Escalation Logic
Document → Cache Check → Tesseract
↓ (if confidence < 80%)
Claude Vision
↓ (if confidence < 85%)
Awaiting Human Review
↓
Completed
Key Features
Smart Caching
- SHA-256 hashing prevents reprocessing identical documents
- Hit count tracking identifies frequently processed documents
- Automatic cache invalidation based on access patterns
Document Preprocessing
- Adaptive thresholding for low-quality scans
- Denoising with OpenCV
- Automatic deskewing
- Multi-page PDF support
Job Tracking
- Asynchronous background processing
- Real-time status updates
- Comprehensive audit trail
- Error logging and diagnostics
Structured Data Extraction
- Raw text extraction
- Field mapping to JSON schema
- Confidence scores per field
- Multi-language support (Tesseract)
API Endpoints
Health & Status
GET /
- Service informationGET /health
- Detailed health check (Tesseract, Claude, DB)GET /healthz
- Kubernetes liveness probeGET /readyz
- Kubernetes readiness probeGET /stats
- Service statistics (jobs, cache hits, tier usage)
OCR Processing
POST /extract
- Submit document for OCR- Multipart file upload
- Optional tier selection
- Confidence threshold override
GET /jobs/{job_id}
- Get job status and resultsPUT /jobs/{job_id}/review
- Submit human review corrections
Caching
GET /cache/{document_hash}
- Check for cached results
Database Schema
OCR Jobs (ocr_jobs
)
- Job tracking with status progression
- Multi-tenant isolation
- Processing metadata (time, tier used, confidence)
- Extracted text and structured fields
- Error logging
OCR Reviews (ocr_reviews
)
- Human review corrections
- Reviewer tracking
- Review time and confidence
- Training data for ML improvement
OCR Cache (ocr_cache
)
- Document hash-based lookup
- Tier and confidence tracking
- Hit count analytics
- Last accessed timestamp
Configuration
Environment Variables
# Database (required)
POSTGRES_URL=postgresql://cortx:cortx_dev_password@localhost:5432/cortx
# Anthropic API (optional, enables Claude Vision tier)
ANTHROPIC_API_KEY=sk-ant-xxx
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022
# Confidence Thresholds
OCR_TESSERACT_THRESHOLD=80.0
OCR_AI_THRESHOLD=85.0
# Service
PORT=8137
LOG_LEVEL=INFO
System Dependencies
# macOS
brew install tesseract poppler
# Ubuntu/Debian
apt-get install tesseract-ocr poppler-utils
# Python packages (see requirements.txt)
pip install -r requirements.txt
Usage Examples
Submit Document for OCR
curl -X POST http://localhost:8137/extract \
-H "X-Tenant-ID: tenant-123" \
-F "file=@invoice.pdf" \
-F "tier=tesseract" \
-F "confidence_threshold=75.0"
Check Job Status
curl http://localhost:8137/jobs/{job_id} \
-H "X-Tenant-ID: tenant-123"
Submit Human Review
curl -X PUT http://localhost:8137/jobs/{job_id}/review \
-H "X-Tenant-ID: tenant-123" \
-H "Content-Type: application/json" \
-d '{
"corrected_text": "Corrected invoice text...",
"corrected_fields": {"invoice_number": "INV-2024-001"},
"reviewer_id": "user-456"
}'
Check Cache
curl http://localhost:8137/cache/{sha256_hash} \
-H "X-Tenant-ID: tenant-123"
Integration
Gateway Routes
The Gateway service proxies OCR requests:
POST /api/v1/ocr/extract → http://ocr:8137/extract
GET /api/v1/ocr/jobs/:id → http://ocr:8137/jobs/:id
RAG Integration
OCR-extracted text can be ingested into the RAG service for document search and retrieval.
Compliance Integration
OCR job metadata logged to Ledger service for audit compliance.
Performance
Throughput
- Tesseract: ~10-20 pages/second (parallel processing)
- Claude Vision: ~15-30 pages/minute (API rate limits)
- Cache Hit: <10ms response time
Scalability
- Horizontal scaling via multiple service instances
- Database connection pooling (10 connections, 20 overflow)
- Background async processing with task queues
Architecture Diagram
flowchart TB
Gateway[Gateway Service] -->|POST /api/v1/ocr/extract| OCR[OCR Service]
OCR -->|1. Check Cache| Cache[(OCR Cache)]
Cache -->|Cache Miss| Processor[OCR Processor]
Processor -->|2. Tesseract| Tesseract[Tesseract OCR]
Processor -->|3. If Low Confidence| Claude[Claude 3.5 Sonnet Vision]
Processor -->|4. If Still Low| Review[Human Review Queue]
OCR -->|Store Results| DB[(PostgreSQL)]
OCR -->|Log Events| Ledger[Ledger Service]
OCR -->|Index Text| RAG[RAG Service]
Monitoring & Logging
Health Checks
- Database connectivity
- Tesseract availability
- Anthropic API status
Metrics
- Jobs processed by tier
- Average confidence scores
- Cache hit rate
- Processing time per tier
- Error rates
Logging
- Structured JSON logs
- Request correlation IDs
- Error stack traces
- Performance metrics
Security
Multi-Tenancy
- Tenant-scoped job queries
- Row-level security on database
- Tenant ID validation on all endpoints
Data Protection
- Document hashing for deduplication
- PII/PHI redaction support
- Secure file handling (temp file cleanup)
Error Handling
Automatic Retry
- Network failures: 3 retries with exponential backoff
- API rate limits: Queue-based retry
- Tesseract errors: Auto-escalate to AI tier
Degraded Mode
- Tesseract unavailable: Skip to Claude Vision
- Claude API unavailable: Flag for human review
- Database down: Return 503 with retry-after header
Development
Local Setup
cd /Users/michael/Development/sinergysolutionsllc/services/ocr
pip install -r requirements.txt
uvicorn app.main:app --host 0.0.0.0 --port 8137 --reload
Docker
docker build -f services/ocr/Dockerfile -t cortx-ocr:1.0.0 .
docker run -p 8137:8137 \
-e POSTGRES_URL="postgresql://..." \
-e ANTHROPIC_API_KEY="sk-ant-xxx" \
cortx-ocr:1.0.0
Documentation
- OpenAPI Spec: openapi.yaml
- Implementation Details:
/services/ocr/IMPLEMENTATION_SUMMARY.md
- Quick Start:
/services/ocr/QUICKSTART.md
- Source Code:
/services/ocr/app/
Support
For issues or questions:
- GitHub Issues: sinergysolutionsllc/sinergysolutionsllc
- Internal Documentation:
/docs/services/ocr/