text_processor/ARCHITECTURE.md

# Architecture Documentation

## Hexagonal Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                         INCOMING ADAPTERS                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  FastAPI Routes (HTTP)                                       │   │
│  │  - ProcessDocumentRequest → API Schemas                      │   │
│  │  - ExtractAndChunkRequest → API Schemas                      │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         CORE DOMAIN                                 │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  PORTS (Interfaces)                                          │   │
│  │  ┌────────────────────┐    ┌───────────────────────────┐    │   │
│  │  │  Incoming Ports    │    │  Outgoing Ports           │    │   │
│  │  │  - ITextProcessor  │    │  - IExtractor             │    │   │
│  │  │                    │    │  - IChunker               │    │   │
│  │  │                    │    │  - IDocumentRepository    │    │   │
│  │  └────────────────────┘    └───────────────────────────┘    │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  SERVICES (Business Logic)                                   │   │
│  │  - DocumentProcessorService                                  │   │
│  │    • Orchestrates Extract → Clean → Chunk → Save            │   │
│  │    • Depends ONLY on Port interfaces                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  DOMAIN MODELS (Rich Entities)                               │   │
│  │  - Document (with validation & business methods)             │   │
│  │  - Chunk (immutable value object)                            │   │
│  │  - ChunkingStrategy (configuration)                          │   │
│  │  - DocumentMetadata                                          │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  DOMAIN LOGIC (Pure Functions)                               │   │
│  │  - normalize_whitespace()                                    │   │
│  │  - clean_text()                                              │   │
│  │  - split_into_paragraphs()                                   │   │
│  │  - find_sentence_boundary_before()                           │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  EXCEPTIONS (Domain Errors)                                  │   │
│  │  - ExtractionError, ChunkingError, ProcessingError          │   │
│  │  - ValidationError, RepositoryError                          │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         OUTGOING ADAPTERS                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  EXTRACTORS (Implements IExtractor)                          │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │   │
│  │  │ PDFExtractor│  │DocxExtractor│ │TxtExtractor│             │   │
│  │  │  (PyPDF2)   │  │(python-docx)│ │ (built-in) │             │   │
│  │  └────────────┘  └────────────┘  └────────────┘             │   │
│  │  - Managed by ExtractorFactory (Factory Pattern)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  CHUNKERS (Implements IChunker)                              │   │
│  │  ┌─────────────────┐  ┌──────────────────┐                  │   │
│  │  │ FixedSizeChunker│  │ParagraphChunker  │                  │   │
│  │  │  - Fixed chunks │  │ - Respect        │                  │   │
│  │  │  - With overlap │  │   paragraphs     │                  │   │
│  │  └─────────────────┘  └──────────────────┘                  │   │
│  │  - Managed by ChunkingContext (Strategy Pattern)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  REPOSITORY (Implements IDocumentRepository)                 │   │
│  │  ┌──────────────────────────────────┐                        │   │
│  │  │  InMemoryDocumentRepository      │                        │   │
│  │  │  - Thread-safe Dict storage      │                        │   │
│  │  │  - Easy to swap for PostgreSQL   │                        │   │
│  │  └──────────────────────────────────┘                        │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                         BOOTSTRAP (Wiring)                          │
│  ApplicationContainer:                                              │
│    - Creates all adapters                                           │
│    - Injects dependencies into core                                 │
│    - ONLY place where adapters are instantiated                     │
└─────────────────────────────────────────────────────────────────────┘
```

## Data Flow: Process Document

```
1. HTTP Request
   │
   ▼
2. FastAPI Route (Incoming Adapter)
   │ - Validates request schema
   ▼
3. DocumentProcessorService (Core)
   │ - Calls ExtractorFactory
   ▼
4. PDFExtractor (Outgoing Adapter)
   │ - Extracts text using PyPDF2
   │ - Maps PyPDF2 exceptions → Domain exceptions
   ▼
5. DocumentProcessorService
   │ - Cleans text using domain logic utils
   │ - Validates Document
   ▼
6. InMemoryRepository (Outgoing Adapter)
   │ - Saves Document
   ▼
7. DocumentProcessorService
   │ - Returns Document
   ▼
8. FastAPI Route
   │ - Converts Document → DocumentResponse
   ▼
9. HTTP Response
```

## Data Flow: Extract and Chunk

```
1. HTTP Request
   │
   ▼
2. FastAPI Route
   │ - Validates request
   ▼
3. DocumentProcessorService
   │ - Gets extractor from factory
   │ - Extracts text
   ▼
4. Extractor (PDF/DOCX/TXT)
   │ - Returns Document
   ▼
5. DocumentProcessorService
   │ - Cleans text
   │ - Calls ChunkingContext
   ▼
6. ChunkingContext (Strategy Pattern)
   │ - Selects appropriate chunker
   ▼
7. Chunker (FixedSize/Paragraph)
   │ - Splits text into segments
   │ - Creates Chunk entities
   ▼
8. DocumentProcessorService
   │ - Returns List[Chunk]
   ▼
9. FastAPI Route
   │ - Converts Chunks → ChunkResponse[]
   ▼
10. HTTP Response
```

## Dependency Rules

### ✅ ALLOWED Dependencies

```
Incoming Adapters → Core Ports (Incoming)
Core Services → Core Ports (Outgoing)
Core → Core (Domain Models, Logic Utils, Exceptions)
Bootstrap → Everything (Wiring only)
```

### ❌ FORBIDDEN Dependencies

```
Core → Adapters (NEVER!)
Core → External Libraries (Only in Adapters)
Domain Models → Services
Domain Models → Ports
```

## Key Design Patterns

### 1. Hexagonal Architecture (Ports & Adapters)
- **Purpose**: Isolate core business logic from external concerns
- **Implementation**:
  - Ports: Interface definitions (ITextProcessor, IExtractor, etc.)
  - Adapters: Concrete implementations (PDFExtractor, FastAPI routes)

### 2. Factory Pattern
- **Class**: `ExtractorFactory`
- **Purpose**: Create appropriate extractor based on file extension
- **Benefit**: Centralized extractor management, easy to add new types

### 3. Strategy Pattern
- **Class**: `ChunkingContext`
- **Purpose**: Switch between chunking strategies at runtime
- **Strategies**: FixedSizeChunker, ParagraphChunker
- **Benefit**: Easy to add new chunking algorithms

### 4. Repository Pattern
- **Interface**: `IDocumentRepository`
- **Implementation**: `InMemoryDocumentRepository`
- **Purpose**: Abstract data persistence
- **Benefit**: Easy to swap storage (memory → PostgreSQL → MongoDB)

### 5. Dependency Injection
- **Class**: `ApplicationContainer`
- **Purpose**: Wire all dependencies at startup
- **Benefit**: Loose coupling, easy testing

### 6. Template Method Pattern
- **Classes**: `BaseExtractor`, `BaseChunker`
- **Purpose**: Define algorithm skeleton, let subclasses fill in details
- **Benefit**: Code reuse, consistent behavior

## SOLID Principles Application

### Single Responsibility Principle (SRP)
- Each extractor handles ONE file type
- Each chunker handles ONE strategy
- Each service method does ONE thing
- Functions are max 15-20 lines

### Open/Closed Principle (OCP)
- Add new extractors without modifying core
- Add new chunkers without modifying service
- Extend via interfaces, not modification

### Liskov Substitution Principle (LSP)
- All IExtractor implementations are interchangeable
- All IChunker implementations are interchangeable
- Polymorphism works correctly

### Interface Segregation Principle (ISP)
- Small, focused interfaces
- IExtractor: Only extraction concerns
- IChunker: Only chunking concerns
- No fat interfaces

### Dependency Inversion Principle (DIP)
- Core depends on IExtractor (abstraction)
- Core does NOT depend on PDFExtractor (concrete)
- High-level modules don't depend on low-level modules

## Error Handling Strategy

### Domain Exceptions
All external errors are caught and wrapped in domain exceptions:

```python
try:
    PyPDF2.PdfReader(file)  # External library
except PyPDF2.errors.PdfReadError as e:
    raise ExtractionError(  # Domain exception
        message="Invalid PDF",
        details=str(e),
    )
```

### Exception Hierarchy
```
DomainException (Base)
├── ExtractionError
│   ├── UnsupportedFileTypeError
│   └── EmptyContentError
├── ChunkingError
├── ProcessingError
├── ValidationError
└── RepositoryError
    └── DocumentNotFoundError
```

### HTTP Error Mapping
FastAPI adapter maps domain exceptions to HTTP status codes:
- `UnsupportedFileTypeError` → 400 Bad Request
- `ExtractionError` → 422 Unprocessable Entity
- `DocumentNotFoundError` → 404 Not Found
- `ProcessingError` → 500 Internal Server Error

## Testing Strategy

### Unit Tests (Core)
- Test domain models in isolation
- Test logic utils (pure functions)
- Test services with mock ports

### Integration Tests (Adapters)
- Test extractors with real files
- Test chunkers with real text
- Test repository operations

### API Tests (End-to-End)
- Test FastAPI routes
- Test complete workflows
- Test error scenarios

### Example Test Structure
```python
def test_document_processor_service():
    # Arrange: Create mocks
    mock_repository = MockRepository()
    mock_factory = MockExtractorFactory()
    mock_context = MockChunkingContext()

    # Act: Inject mocks
    service = DocumentProcessorService(
        extractor_factory=mock_factory,
        chunking_context=mock_context,
        repository=mock_repository,
    )

    # Assert: Test behavior
    result = service.process_document(...)
    assert result.is_processed
```

## Extensibility Examples

### Adding a New Extractor (HTML)
1. Create `html_extractor.py`:
```python
class HTMLExtractor(BaseExtractor):
    def __init__(self):
        super().__init__(supported_extensions=['html', 'htm'])

    def _extract_text(self, file_path: Path) -> str:
        from bs4 import BeautifulSoup
        html = file_path.read_text()
        soup = BeautifulSoup(html, 'html.parser')
        return soup.get_text()
```

2. Register in `bootstrap.py`:
```python
factory.register_extractor(HTMLExtractor())
```

### Adding a New Chunking Strategy (Sentence)
1. Create `sentence_chunker.py`:
```python
class SentenceChunker(BaseChunker):
    def __init__(self):
        super().__init__(strategy_name="sentence")

    def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
        # Use NLTK to split into sentences
        sentences = nltk.sent_tokenize(text)
        # Group sentences to reach chunk_size
        return grouped_segments
```

2. Register in `bootstrap.py`:
```python
context.register_chunker(SentenceChunker())
```

### Adding Database Persistence
1. Create `postgres_repository.py`:
```python
class PostgresDocumentRepository(IDocumentRepository):
    def __init__(self, connection_string: str):
        self.engine = create_engine(connection_string)

    def save(self, document: Document) -> Document:
        # Save to PostgreSQL
        pass
```

2. Swap in `bootstrap.py`:
```python
def _create_repository(self):
    return PostgresDocumentRepository("postgresql://...")
```

## Performance Considerations

### Current Implementation
- In-memory storage: O(1) lookups, limited by RAM
- Synchronous processing: Sequential file processing
- Thread-safe: Uses locks for concurrent access

### Future Optimizations
- **Async Processing**: Use `asyncio` for concurrent document processing
- **Caching**: Add Redis for frequently accessed documents
- **Streaming**: Process large files in chunks
- **Database**: Use PostgreSQL with indexes for better queries
- **Message Queue**: Use Celery/RabbitMQ for background processing

## Deployment Considerations

### Configuration
- Use environment variables for settings
- Externalize file paths, database connections
- Use `pydantic-settings` for config management

### Monitoring
- Add structured logging (JSON format)
- Track metrics: processing time, error rates
- Use APM tools (DataDog, New Relic)

### Scaling
- Horizontal: Run multiple FastAPI instances behind load balancer
- Vertical: Increase resources for compute-heavy extraction
- Database: Use connection pooling, read replicas