text_processor/ARCHITECTURE.md
m.dabbagh 70f5b1478c init
2026-01-07 19:15:46 +03:30

411 lines
19 KiB
Markdown

# Architecture Documentation
## Hexagonal Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────┐
│ INCOMING ADAPTERS │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ FastAPI Routes (HTTP) │ │
│ │ - ProcessDocumentRequest → API Schemas │ │
│ │ - ExtractAndChunkRequest → API Schemas │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ CORE DOMAIN │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ PORTS (Interfaces) │ │
│ │ ┌────────────────────┐ ┌───────────────────────────┐ │ │
│ │ │ Incoming Ports │ │ Outgoing Ports │ │ │
│ │ │ - ITextProcessor │ │ - IExtractor │ │ │
│ │ │ │ │ - IChunker │ │ │
│ │ │ │ │ - IDocumentRepository │ │ │
│ │ └────────────────────┘ └───────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ SERVICES (Business Logic) │ │
│ │ - DocumentProcessorService │ │
│ │ • Orchestrates Extract → Clean → Chunk → Save │ │
│ │ • Depends ONLY on Port interfaces │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ DOMAIN MODELS (Rich Entities) │ │
│ │ - Document (with validation & business methods) │ │
│ │ - Chunk (immutable value object) │ │
│ │ - ChunkingStrategy (configuration) │ │
│ │ - DocumentMetadata │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ DOMAIN LOGIC (Pure Functions) │ │
│ │ - normalize_whitespace() │ │
│ │ - clean_text() │ │
│ │ - split_into_paragraphs() │ │
│ │ - find_sentence_boundary_before() │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ EXCEPTIONS (Domain Errors) │ │
│ │ - ExtractionError, ChunkingError, ProcessingError │ │
│ │ - ValidationError, RepositoryError │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ OUTGOING ADAPTERS │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ EXTRACTORS (Implements IExtractor) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ PDFExtractor│ │DocxExtractor│ │TxtExtractor│ │ │
│ │ │ (PyPDF2) │ │(python-docx)│ │ (built-in) │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ │ - Managed by ExtractorFactory (Factory Pattern) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ CHUNKERS (Implements IChunker) │ │
│ │ ┌─────────────────┐ ┌──────────────────┐ │ │
│ │ │ FixedSizeChunker│ │ParagraphChunker │ │ │
│ │ │ - Fixed chunks │ │ - Respect │ │ │
│ │ │ - With overlap │ │ paragraphs │ │ │
│ │ └─────────────────┘ └──────────────────┘ │ │
│ │ - Managed by ChunkingContext (Strategy Pattern) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ REPOSITORY (Implements IDocumentRepository) │ │
│ │ ┌──────────────────────────────────┐ │ │
│ │ │ InMemoryDocumentRepository │ │ │
│ │ │ - Thread-safe Dict storage │ │ │
│ │ │ - Easy to swap for PostgreSQL │ │ │
│ │ └──────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ BOOTSTRAP (Wiring) │
│ ApplicationContainer: │
│ - Creates all adapters │
│ - Injects dependencies into core │
│ - ONLY place where adapters are instantiated │
└─────────────────────────────────────────────────────────────────────┘
```
## Data Flow: Process Document
```
1. HTTP Request
2. FastAPI Route (Incoming Adapter)
│ - Validates request schema
3. DocumentProcessorService (Core)
│ - Calls ExtractorFactory
4. PDFExtractor (Outgoing Adapter)
│ - Extracts text using PyPDF2
│ - Maps PyPDF2 exceptions → Domain exceptions
5. DocumentProcessorService
│ - Cleans text using domain logic utils
│ - Validates Document
6. InMemoryRepository (Outgoing Adapter)
│ - Saves Document
7. DocumentProcessorService
│ - Returns Document
8. FastAPI Route
│ - Converts Document → DocumentResponse
9. HTTP Response
```
## Data Flow: Extract and Chunk
```
1. HTTP Request
2. FastAPI Route
│ - Validates request
3. DocumentProcessorService
│ - Gets extractor from factory
│ - Extracts text
4. Extractor (PDF/DOCX/TXT)
│ - Returns Document
5. DocumentProcessorService
│ - Cleans text
│ - Calls ChunkingContext
6. ChunkingContext (Strategy Pattern)
│ - Selects appropriate chunker
7. Chunker (FixedSize/Paragraph)
│ - Splits text into segments
│ - Creates Chunk entities
8. DocumentProcessorService
│ - Returns List[Chunk]
9. FastAPI Route
│ - Converts Chunks → ChunkResponse[]
10. HTTP Response
```
## Dependency Rules
### ✅ ALLOWED Dependencies
```
Incoming Adapters → Core Ports (Incoming)
Core Services → Core Ports (Outgoing)
Core → Core (Domain Models, Logic Utils, Exceptions)
Bootstrap → Everything (Wiring only)
```
### ❌ FORBIDDEN Dependencies
```
Core → Adapters (NEVER!)
Core → External Libraries (Only in Adapters)
Domain Models → Services
Domain Models → Ports
```
## Key Design Patterns
### 1. Hexagonal Architecture (Ports & Adapters)
- **Purpose**: Isolate core business logic from external concerns
- **Implementation**:
- Ports: Interface definitions (ITextProcessor, IExtractor, etc.)
- Adapters: Concrete implementations (PDFExtractor, FastAPI routes)
### 2. Factory Pattern
- **Class**: `ExtractorFactory`
- **Purpose**: Create appropriate extractor based on file extension
- **Benefit**: Centralized extractor management, easy to add new types
### 3. Strategy Pattern
- **Class**: `ChunkingContext`
- **Purpose**: Switch between chunking strategies at runtime
- **Strategies**: FixedSizeChunker, ParagraphChunker
- **Benefit**: Easy to add new chunking algorithms
### 4. Repository Pattern
- **Interface**: `IDocumentRepository`
- **Implementation**: `InMemoryDocumentRepository`
- **Purpose**: Abstract data persistence
- **Benefit**: Easy to swap storage (memory → PostgreSQL → MongoDB)
### 5. Dependency Injection
- **Class**: `ApplicationContainer`
- **Purpose**: Wire all dependencies at startup
- **Benefit**: Loose coupling, easy testing
### 6. Template Method Pattern
- **Classes**: `BaseExtractor`, `BaseChunker`
- **Purpose**: Define algorithm skeleton, let subclasses fill in details
- **Benefit**: Code reuse, consistent behavior
## SOLID Principles Application
### Single Responsibility Principle (SRP)
- Each extractor handles ONE file type
- Each chunker handles ONE strategy
- Each service method does ONE thing
- Functions are max 15-20 lines
### Open/Closed Principle (OCP)
- Add new extractors without modifying core
- Add new chunkers without modifying service
- Extend via interfaces, not modification
### Liskov Substitution Principle (LSP)
- All IExtractor implementations are interchangeable
- All IChunker implementations are interchangeable
- Polymorphism works correctly
### Interface Segregation Principle (ISP)
- Small, focused interfaces
- IExtractor: Only extraction concerns
- IChunker: Only chunking concerns
- No fat interfaces
### Dependency Inversion Principle (DIP)
- Core depends on IExtractor (abstraction)
- Core does NOT depend on PDFExtractor (concrete)
- High-level modules don't depend on low-level modules
## Error Handling Strategy
### Domain Exceptions
All external errors are caught and wrapped in domain exceptions:
```python
try:
PyPDF2.PdfReader(file) # External library
except PyPDF2.errors.PdfReadError as e:
raise ExtractionError( # Domain exception
message="Invalid PDF",
details=str(e),
)
```
### Exception Hierarchy
```
DomainException (Base)
├── ExtractionError
│ ├── UnsupportedFileTypeError
│ └── EmptyContentError
├── ChunkingError
├── ProcessingError
├── ValidationError
└── RepositoryError
└── DocumentNotFoundError
```
### HTTP Error Mapping
FastAPI adapter maps domain exceptions to HTTP status codes:
- `UnsupportedFileTypeError` → 400 Bad Request
- `ExtractionError` → 422 Unprocessable Entity
- `DocumentNotFoundError` → 404 Not Found
- `ProcessingError` → 500 Internal Server Error
## Testing Strategy
### Unit Tests (Core)
- Test domain models in isolation
- Test logic utils (pure functions)
- Test services with mock ports
### Integration Tests (Adapters)
- Test extractors with real files
- Test chunkers with real text
- Test repository operations
### API Tests (End-to-End)
- Test FastAPI routes
- Test complete workflows
- Test error scenarios
### Example Test Structure
```python
def test_document_processor_service():
# Arrange: Create mocks
mock_repository = MockRepository()
mock_factory = MockExtractorFactory()
mock_context = MockChunkingContext()
# Act: Inject mocks
service = DocumentProcessorService(
extractor_factory=mock_factory,
chunking_context=mock_context,
repository=mock_repository,
)
# Assert: Test behavior
result = service.process_document(...)
assert result.is_processed
```
## Extensibility Examples
### Adding a New Extractor (HTML)
1. Create `html_extractor.py`:
```python
class HTMLExtractor(BaseExtractor):
def __init__(self):
super().__init__(supported_extensions=['html', 'htm'])
def _extract_text(self, file_path: Path) -> str:
from bs4 import BeautifulSoup
html = file_path.read_text()
soup = BeautifulSoup(html, 'html.parser')
return soup.get_text()
```
2. Register in `bootstrap.py`:
```python
factory.register_extractor(HTMLExtractor())
```
### Adding a New Chunking Strategy (Sentence)
1. Create `sentence_chunker.py`:
```python
class SentenceChunker(BaseChunker):
def __init__(self):
super().__init__(strategy_name="sentence")
def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
# Use NLTK to split into sentences
sentences = nltk.sent_tokenize(text)
# Group sentences to reach chunk_size
return grouped_segments
```
2. Register in `bootstrap.py`:
```python
context.register_chunker(SentenceChunker())
```
### Adding Database Persistence
1. Create `postgres_repository.py`:
```python
class PostgresDocumentRepository(IDocumentRepository):
def __init__(self, connection_string: str):
self.engine = create_engine(connection_string)
def save(self, document: Document) -> Document:
# Save to PostgreSQL
pass
```
2. Swap in `bootstrap.py`:
```python
def _create_repository(self):
return PostgresDocumentRepository("postgresql://...")
```
## Performance Considerations
### Current Implementation
- In-memory storage: O(1) lookups, limited by RAM
- Synchronous processing: Sequential file processing
- Thread-safe: Uses locks for concurrent access
### Future Optimizations
- **Async Processing**: Use `asyncio` for concurrent document processing
- **Caching**: Add Redis for frequently accessed documents
- **Streaming**: Process large files in chunks
- **Database**: Use PostgreSQL with indexes for better queries
- **Message Queue**: Use Celery/RabbitMQ for background processing
## Deployment Considerations
### Configuration
- Use environment variables for settings
- Externalize file paths, database connections
- Use `pydantic-settings` for config management
### Monitoring
- Add structured logging (JSON format)
- Track metrics: processing time, error rates
- Use APM tools (DataDog, New Relic)
### Scaling
- Horizontal: Run multiple FastAPI instances behind load balancer
- Vertical: Increase resources for compute-heavy extraction
- Database: Use connection pooling, read replicas