19 KiB
19 KiB
Architecture Documentation
Hexagonal Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ INCOMING ADAPTERS │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ FastAPI Routes (HTTP) │ │
│ │ - ProcessDocumentRequest → API Schemas │ │
│ │ - ExtractAndChunkRequest → API Schemas │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CORE DOMAIN │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ PORTS (Interfaces) │ │
│ │ ┌────────────────────┐ ┌───────────────────────────┐ │ │
│ │ │ Incoming Ports │ │ Outgoing Ports │ │ │
│ │ │ - ITextProcessor │ │ - IExtractor │ │ │
│ │ │ │ │ - IChunker │ │ │
│ │ │ │ │ - IDocumentRepository │ │ │
│ │ └────────────────────┘ └───────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ SERVICES (Business Logic) │ │
│ │ - DocumentProcessorService │ │
│ │ • Orchestrates Extract → Clean → Chunk → Save │ │
│ │ • Depends ONLY on Port interfaces │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ DOMAIN MODELS (Rich Entities) │ │
│ │ - Document (with validation & business methods) │ │
│ │ - Chunk (immutable value object) │ │
│ │ - ChunkingStrategy (configuration) │ │
│ │ - DocumentMetadata │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ DOMAIN LOGIC (Pure Functions) │ │
│ │ - normalize_whitespace() │ │
│ │ - clean_text() │ │
│ │ - split_into_paragraphs() │ │
│ │ - find_sentence_boundary_before() │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ EXCEPTIONS (Domain Errors) │ │
│ │ - ExtractionError, ChunkingError, ProcessingError │ │
│ │ - ValidationError, RepositoryError │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ OUTGOING ADAPTERS │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ EXTRACTORS (Implements IExtractor) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ PDFExtractor│ │DocxExtractor│ │TxtExtractor│ │ │
│ │ │ (PyPDF2) │ │(python-docx)│ │ (built-in) │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ │ - Managed by ExtractorFactory (Factory Pattern) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ CHUNKERS (Implements IChunker) │ │
│ │ ┌─────────────────┐ ┌──────────────────┐ │ │
│ │ │ FixedSizeChunker│ │ParagraphChunker │ │ │
│ │ │ - Fixed chunks │ │ - Respect │ │ │
│ │ │ - With overlap │ │ paragraphs │ │ │
│ │ └─────────────────┘ └──────────────────┘ │ │
│ │ - Managed by ChunkingContext (Strategy Pattern) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ REPOSITORY (Implements IDocumentRepository) │ │
│ │ ┌──────────────────────────────────┐ │ │
│ │ │ InMemoryDocumentRepository │ │ │
│ │ │ - Thread-safe Dict storage │ │ │
│ │ │ - Easy to swap for PostgreSQL │ │ │
│ │ └──────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ BOOTSTRAP (Wiring) │
│ ApplicationContainer: │
│ - Creates all adapters │
│ - Injects dependencies into core │
│ - ONLY place where adapters are instantiated │
└─────────────────────────────────────────────────────────────────────┘
Data Flow: Process Document
1. HTTP Request
│
▼
2. FastAPI Route (Incoming Adapter)
│ - Validates request schema
▼
3. DocumentProcessorService (Core)
│ - Calls ExtractorFactory
▼
4. PDFExtractor (Outgoing Adapter)
│ - Extracts text using PyPDF2
│ - Maps PyPDF2 exceptions → Domain exceptions
▼
5. DocumentProcessorService
│ - Cleans text using domain logic utils
│ - Validates Document
▼
6. InMemoryRepository (Outgoing Adapter)
│ - Saves Document
▼
7. DocumentProcessorService
│ - Returns Document
▼
8. FastAPI Route
│ - Converts Document → DocumentResponse
▼
9. HTTP Response
Data Flow: Extract and Chunk
1. HTTP Request
│
▼
2. FastAPI Route
│ - Validates request
▼
3. DocumentProcessorService
│ - Gets extractor from factory
│ - Extracts text
▼
4. Extractor (PDF/DOCX/TXT)
│ - Returns Document
▼
5. DocumentProcessorService
│ - Cleans text
│ - Calls ChunkingContext
▼
6. ChunkingContext (Strategy Pattern)
│ - Selects appropriate chunker
▼
7. Chunker (FixedSize/Paragraph)
│ - Splits text into segments
│ - Creates Chunk entities
▼
8. DocumentProcessorService
│ - Returns List[Chunk]
▼
9. FastAPI Route
│ - Converts Chunks → ChunkResponse[]
▼
10. HTTP Response
Dependency Rules
✅ ALLOWED Dependencies
Incoming Adapters → Core Ports (Incoming)
Core Services → Core Ports (Outgoing)
Core → Core (Domain Models, Logic Utils, Exceptions)
Bootstrap → Everything (Wiring only)
❌ FORBIDDEN Dependencies
Core → Adapters (NEVER!)
Core → External Libraries (Only in Adapters)
Domain Models → Services
Domain Models → Ports
Key Design Patterns
1. Hexagonal Architecture (Ports & Adapters)
- Purpose: Isolate core business logic from external concerns
- Implementation:
- Ports: Interface definitions (ITextProcessor, IExtractor, etc.)
- Adapters: Concrete implementations (PDFExtractor, FastAPI routes)
2. Factory Pattern
- Class:
ExtractorFactory - Purpose: Create appropriate extractor based on file extension
- Benefit: Centralized extractor management, easy to add new types
3. Strategy Pattern
- Class:
ChunkingContext - Purpose: Switch between chunking strategies at runtime
- Strategies: FixedSizeChunker, ParagraphChunker
- Benefit: Easy to add new chunking algorithms
4. Repository Pattern
- Interface:
IDocumentRepository - Implementation:
InMemoryDocumentRepository - Purpose: Abstract data persistence
- Benefit: Easy to swap storage (memory → PostgreSQL → MongoDB)
5. Dependency Injection
- Class:
ApplicationContainer - Purpose: Wire all dependencies at startup
- Benefit: Loose coupling, easy testing
6. Template Method Pattern
- Classes:
BaseExtractor,BaseChunker - Purpose: Define algorithm skeleton, let subclasses fill in details
- Benefit: Code reuse, consistent behavior
SOLID Principles Application
Single Responsibility Principle (SRP)
- Each extractor handles ONE file type
- Each chunker handles ONE strategy
- Each service method does ONE thing
- Functions are max 15-20 lines
Open/Closed Principle (OCP)
- Add new extractors without modifying core
- Add new chunkers without modifying service
- Extend via interfaces, not modification
Liskov Substitution Principle (LSP)
- All IExtractor implementations are interchangeable
- All IChunker implementations are interchangeable
- Polymorphism works correctly
Interface Segregation Principle (ISP)
- Small, focused interfaces
- IExtractor: Only extraction concerns
- IChunker: Only chunking concerns
- No fat interfaces
Dependency Inversion Principle (DIP)
- Core depends on IExtractor (abstraction)
- Core does NOT depend on PDFExtractor (concrete)
- High-level modules don't depend on low-level modules
Error Handling Strategy
Domain Exceptions
All external errors are caught and wrapped in domain exceptions:
try:
PyPDF2.PdfReader(file) # External library
except PyPDF2.errors.PdfReadError as e:
raise ExtractionError( # Domain exception
message="Invalid PDF",
details=str(e),
)
Exception Hierarchy
DomainException (Base)
├── ExtractionError
│ ├── UnsupportedFileTypeError
│ └── EmptyContentError
├── ChunkingError
├── ProcessingError
├── ValidationError
└── RepositoryError
└── DocumentNotFoundError
HTTP Error Mapping
FastAPI adapter maps domain exceptions to HTTP status codes:
UnsupportedFileTypeError→ 400 Bad RequestExtractionError→ 422 Unprocessable EntityDocumentNotFoundError→ 404 Not FoundProcessingError→ 500 Internal Server Error
Testing Strategy
Unit Tests (Core)
- Test domain models in isolation
- Test logic utils (pure functions)
- Test services with mock ports
Integration Tests (Adapters)
- Test extractors with real files
- Test chunkers with real text
- Test repository operations
API Tests (End-to-End)
- Test FastAPI routes
- Test complete workflows
- Test error scenarios
Example Test Structure
def test_document_processor_service():
# Arrange: Create mocks
mock_repository = MockRepository()
mock_factory = MockExtractorFactory()
mock_context = MockChunkingContext()
# Act: Inject mocks
service = DocumentProcessorService(
extractor_factory=mock_factory,
chunking_context=mock_context,
repository=mock_repository,
)
# Assert: Test behavior
result = service.process_document(...)
assert result.is_processed
Extensibility Examples
Adding a New Extractor (HTML)
- Create
html_extractor.py:
class HTMLExtractor(BaseExtractor):
def __init__(self):
super().__init__(supported_extensions=['html', 'htm'])
def _extract_text(self, file_path: Path) -> str:
from bs4 import BeautifulSoup
html = file_path.read_text()
soup = BeautifulSoup(html, 'html.parser')
return soup.get_text()
- Register in
bootstrap.py:
factory.register_extractor(HTMLExtractor())
Adding a New Chunking Strategy (Sentence)
- Create
sentence_chunker.py:
class SentenceChunker(BaseChunker):
def __init__(self):
super().__init__(strategy_name="sentence")
def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
# Use NLTK to split into sentences
sentences = nltk.sent_tokenize(text)
# Group sentences to reach chunk_size
return grouped_segments
- Register in
bootstrap.py:
context.register_chunker(SentenceChunker())
Adding Database Persistence
- Create
postgres_repository.py:
class PostgresDocumentRepository(IDocumentRepository):
def __init__(self, connection_string: str):
self.engine = create_engine(connection_string)
def save(self, document: Document) -> Document:
# Save to PostgreSQL
pass
- Swap in
bootstrap.py:
def _create_repository(self):
return PostgresDocumentRepository("postgresql://...")
Performance Considerations
Current Implementation
- In-memory storage: O(1) lookups, limited by RAM
- Synchronous processing: Sequential file processing
- Thread-safe: Uses locks for concurrent access
Future Optimizations
- Async Processing: Use
asynciofor concurrent document processing - Caching: Add Redis for frequently accessed documents
- Streaming: Process large files in chunks
- Database: Use PostgreSQL with indexes for better queries
- Message Queue: Use Celery/RabbitMQ for background processing
Deployment Considerations
Configuration
- Use environment variables for settings
- Externalize file paths, database connections
- Use
pydantic-settingsfor config management
Monitoring
- Add structured logging (JSON format)
- Track metrics: processing time, error rates
- Use APM tools (DataDog, New Relic)
Scaling
- Horizontal: Run multiple FastAPI instances behind load balancer
- Vertical: Increase resources for compute-heavy extraction
- Database: Use connection pooling, read replicas