# Architecture Documentation ## Hexagonal Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────┐ │ INCOMING ADAPTERS │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ FastAPI Routes (HTTP) │ │ │ │ - ProcessDocumentRequest → API Schemas │ │ │ │ - ExtractAndChunkRequest → API Schemas │ │ │ └──────────────────────────────────────────────────────────────┘ │ └──────────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ CORE DOMAIN │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ PORTS (Interfaces) │ │ │ │ ┌────────────────────┐ ┌───────────────────────────┐ │ │ │ │ │ Incoming Ports │ │ Outgoing Ports │ │ │ │ │ │ - ITextProcessor │ │ - IExtractor │ │ │ │ │ │ │ │ - IChunker │ │ │ │ │ │ │ │ - IDocumentRepository │ │ │ │ │ └────────────────────┘ └───────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ SERVICES (Business Logic) │ │ │ │ - DocumentProcessorService │ │ │ │ • Orchestrates Extract → Clean → Chunk → Save │ │ │ │ • Depends ONLY on Port interfaces │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ DOMAIN MODELS (Rich Entities) │ │ │ │ - Document (with validation & business methods) │ │ │ │ - Chunk (immutable value object) │ │ │ │ - ChunkingStrategy (configuration) │ │ │ │ - DocumentMetadata │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ DOMAIN LOGIC (Pure Functions) │ │ │ │ - normalize_whitespace() │ │ │ │ - clean_text() │ │ │ │ - split_into_paragraphs() │ │ │ │ - find_sentence_boundary_before() │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ EXCEPTIONS (Domain Errors) │ │ │ │ - ExtractionError, ChunkingError, ProcessingError │ │ │ │ - ValidationError, RepositoryError │ │ │ └──────────────────────────────────────────────────────────────┘ │ └──────────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ OUTGOING ADAPTERS │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ EXTRACTORS (Implements IExtractor) │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ │ │ PDFExtractor│ │DocxExtractor│ │TxtExtractor│ │ │ │ │ │ (PyPDF2) │ │(python-docx)│ │ (built-in) │ │ │ │ │ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ - Managed by ExtractorFactory (Factory Pattern) │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ CHUNKERS (Implements IChunker) │ │ │ │ ┌─────────────────┐ ┌──────────────────┐ │ │ │ │ │ FixedSizeChunker│ │ParagraphChunker │ │ │ │ │ │ - Fixed chunks │ │ - Respect │ │ │ │ │ │ - With overlap │ │ paragraphs │ │ │ │ │ └─────────────────┘ └──────────────────┘ │ │ │ │ - Managed by ChunkingContext (Strategy Pattern) │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ REPOSITORY (Implements IDocumentRepository) │ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ │ InMemoryDocumentRepository │ │ │ │ │ │ - Thread-safe Dict storage │ │ │ │ │ │ - Easy to swap for PostgreSQL │ │ │ │ │ └──────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ BOOTSTRAP (Wiring) │ │ ApplicationContainer: │ │ - Creates all adapters │ │ - Injects dependencies into core │ │ - ONLY place where adapters are instantiated │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Data Flow: Process Document ``` 1. HTTP Request │ ▼ 2. FastAPI Route (Incoming Adapter) │ - Validates request schema ▼ 3. DocumentProcessorService (Core) │ - Calls ExtractorFactory ▼ 4. PDFExtractor (Outgoing Adapter) │ - Extracts text using PyPDF2 │ - Maps PyPDF2 exceptions → Domain exceptions ▼ 5. DocumentProcessorService │ - Cleans text using domain logic utils │ - Validates Document ▼ 6. InMemoryRepository (Outgoing Adapter) │ - Saves Document ▼ 7. DocumentProcessorService │ - Returns Document ▼ 8. FastAPI Route │ - Converts Document → DocumentResponse ▼ 9. HTTP Response ``` ## Data Flow: Extract and Chunk ``` 1. HTTP Request │ ▼ 2. FastAPI Route │ - Validates request ▼ 3. DocumentProcessorService │ - Gets extractor from factory │ - Extracts text ▼ 4. Extractor (PDF/DOCX/TXT) │ - Returns Document ▼ 5. DocumentProcessorService │ - Cleans text │ - Calls ChunkingContext ▼ 6. ChunkingContext (Strategy Pattern) │ - Selects appropriate chunker ▼ 7. Chunker (FixedSize/Paragraph) │ - Splits text into segments │ - Creates Chunk entities ▼ 8. DocumentProcessorService │ - Returns List[Chunk] ▼ 9. FastAPI Route │ - Converts Chunks → ChunkResponse[] ▼ 10. HTTP Response ``` ## Dependency Rules ### ✅ ALLOWED Dependencies ``` Incoming Adapters → Core Ports (Incoming) Core Services → Core Ports (Outgoing) Core → Core (Domain Models, Logic Utils, Exceptions) Bootstrap → Everything (Wiring only) ``` ### ❌ FORBIDDEN Dependencies ``` Core → Adapters (NEVER!) Core → External Libraries (Only in Adapters) Domain Models → Services Domain Models → Ports ``` ## Key Design Patterns ### 1. Hexagonal Architecture (Ports & Adapters) - **Purpose**: Isolate core business logic from external concerns - **Implementation**: - Ports: Interface definitions (ITextProcessor, IExtractor, etc.) - Adapters: Concrete implementations (PDFExtractor, FastAPI routes) ### 2. Factory Pattern - **Class**: `ExtractorFactory` - **Purpose**: Create appropriate extractor based on file extension - **Benefit**: Centralized extractor management, easy to add new types ### 3. Strategy Pattern - **Class**: `ChunkingContext` - **Purpose**: Switch between chunking strategies at runtime - **Strategies**: FixedSizeChunker, ParagraphChunker - **Benefit**: Easy to add new chunking algorithms ### 4. Repository Pattern - **Interface**: `IDocumentRepository` - **Implementation**: `InMemoryDocumentRepository` - **Purpose**: Abstract data persistence - **Benefit**: Easy to swap storage (memory → PostgreSQL → MongoDB) ### 5. Dependency Injection - **Class**: `ApplicationContainer` - **Purpose**: Wire all dependencies at startup - **Benefit**: Loose coupling, easy testing ### 6. Template Method Pattern - **Classes**: `BaseExtractor`, `BaseChunker` - **Purpose**: Define algorithm skeleton, let subclasses fill in details - **Benefit**: Code reuse, consistent behavior ## SOLID Principles Application ### Single Responsibility Principle (SRP) - Each extractor handles ONE file type - Each chunker handles ONE strategy - Each service method does ONE thing - Functions are max 15-20 lines ### Open/Closed Principle (OCP) - Add new extractors without modifying core - Add new chunkers without modifying service - Extend via interfaces, not modification ### Liskov Substitution Principle (LSP) - All IExtractor implementations are interchangeable - All IChunker implementations are interchangeable - Polymorphism works correctly ### Interface Segregation Principle (ISP) - Small, focused interfaces - IExtractor: Only extraction concerns - IChunker: Only chunking concerns - No fat interfaces ### Dependency Inversion Principle (DIP) - Core depends on IExtractor (abstraction) - Core does NOT depend on PDFExtractor (concrete) - High-level modules don't depend on low-level modules ## Error Handling Strategy ### Domain Exceptions All external errors are caught and wrapped in domain exceptions: ```python try: PyPDF2.PdfReader(file) # External library except PyPDF2.errors.PdfReadError as e: raise ExtractionError( # Domain exception message="Invalid PDF", details=str(e), ) ``` ### Exception Hierarchy ``` DomainException (Base) ├── ExtractionError │ ├── UnsupportedFileTypeError │ └── EmptyContentError ├── ChunkingError ├── ProcessingError ├── ValidationError └── RepositoryError └── DocumentNotFoundError ``` ### HTTP Error Mapping FastAPI adapter maps domain exceptions to HTTP status codes: - `UnsupportedFileTypeError` → 400 Bad Request - `ExtractionError` → 422 Unprocessable Entity - `DocumentNotFoundError` → 404 Not Found - `ProcessingError` → 500 Internal Server Error ## Testing Strategy ### Unit Tests (Core) - Test domain models in isolation - Test logic utils (pure functions) - Test services with mock ports ### Integration Tests (Adapters) - Test extractors with real files - Test chunkers with real text - Test repository operations ### API Tests (End-to-End) - Test FastAPI routes - Test complete workflows - Test error scenarios ### Example Test Structure ```python def test_document_processor_service(): # Arrange: Create mocks mock_repository = MockRepository() mock_factory = MockExtractorFactory() mock_context = MockChunkingContext() # Act: Inject mocks service = DocumentProcessorService( extractor_factory=mock_factory, chunking_context=mock_context, repository=mock_repository, ) # Assert: Test behavior result = service.process_document(...) assert result.is_processed ``` ## Extensibility Examples ### Adding a New Extractor (HTML) 1. Create `html_extractor.py`: ```python class HTMLExtractor(BaseExtractor): def __init__(self): super().__init__(supported_extensions=['html', 'htm']) def _extract_text(self, file_path: Path) -> str: from bs4 import BeautifulSoup html = file_path.read_text() soup = BeautifulSoup(html, 'html.parser') return soup.get_text() ``` 2. Register in `bootstrap.py`: ```python factory.register_extractor(HTMLExtractor()) ``` ### Adding a New Chunking Strategy (Sentence) 1. Create `sentence_chunker.py`: ```python class SentenceChunker(BaseChunker): def __init__(self): super().__init__(strategy_name="sentence") def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]: # Use NLTK to split into sentences sentences = nltk.sent_tokenize(text) # Group sentences to reach chunk_size return grouped_segments ``` 2. Register in `bootstrap.py`: ```python context.register_chunker(SentenceChunker()) ``` ### Adding Database Persistence 1. Create `postgres_repository.py`: ```python class PostgresDocumentRepository(IDocumentRepository): def __init__(self, connection_string: str): self.engine = create_engine(connection_string) def save(self, document: Document) -> Document: # Save to PostgreSQL pass ``` 2. Swap in `bootstrap.py`: ```python def _create_repository(self): return PostgresDocumentRepository("postgresql://...") ``` ## Performance Considerations ### Current Implementation - In-memory storage: O(1) lookups, limited by RAM - Synchronous processing: Sequential file processing - Thread-safe: Uses locks for concurrent access ### Future Optimizations - **Async Processing**: Use `asyncio` for concurrent document processing - **Caching**: Add Redis for frequently accessed documents - **Streaming**: Process large files in chunks - **Database**: Use PostgreSQL with indexes for better queries - **Message Queue**: Use Celery/RabbitMQ for background processing ## Deployment Considerations ### Configuration - Use environment variables for settings - Externalize file paths, database connections - Use `pydantic-settings` for config management ### Monitoring - Add structured logging (JSON format) - Track metrics: processing time, error rates - Use APM tools (DataDog, New Relic) ### Scaling - Horizontal: Run multiple FastAPI instances behind load balancer - Vertical: Increase resources for compute-heavy extraction - Database: Use connection pooling, read replicas