411 lines
19 KiB
Markdown
411 lines
19 KiB
Markdown
# Architecture Documentation
|
|
|
|
## Hexagonal Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ INCOMING ADAPTERS │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ FastAPI Routes (HTTP) │ │
|
|
│ │ - ProcessDocumentRequest → API Schemas │ │
|
|
│ │ - ExtractAndChunkRequest → API Schemas │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
└──────────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ CORE DOMAIN │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ PORTS (Interfaces) │ │
|
|
│ │ ┌────────────────────┐ ┌───────────────────────────┐ │ │
|
|
│ │ │ Incoming Ports │ │ Outgoing Ports │ │ │
|
|
│ │ │ - ITextProcessor │ │ - IExtractor │ │ │
|
|
│ │ │ │ │ - IChunker │ │ │
|
|
│ │ │ │ │ - IDocumentRepository │ │ │
|
|
│ │ └────────────────────┘ └───────────────────────────┘ │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ SERVICES (Business Logic) │ │
|
|
│ │ - DocumentProcessorService │ │
|
|
│ │ • Orchestrates Extract → Clean → Chunk → Save │ │
|
|
│ │ • Depends ONLY on Port interfaces │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ DOMAIN MODELS (Rich Entities) │ │
|
|
│ │ - Document (with validation & business methods) │ │
|
|
│ │ - Chunk (immutable value object) │ │
|
|
│ │ - ChunkingStrategy (configuration) │ │
|
|
│ │ - DocumentMetadata │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ DOMAIN LOGIC (Pure Functions) │ │
|
|
│ │ - normalize_whitespace() │ │
|
|
│ │ - clean_text() │ │
|
|
│ │ - split_into_paragraphs() │ │
|
|
│ │ - find_sentence_boundary_before() │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ EXCEPTIONS (Domain Errors) │ │
|
|
│ │ - ExtractionError, ChunkingError, ProcessingError │ │
|
|
│ │ - ValidationError, RepositoryError │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
└──────────────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ OUTGOING ADAPTERS │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ EXTRACTORS (Implements IExtractor) │ │
|
|
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
|
│ │ │ PDFExtractor│ │DocxExtractor│ │TxtExtractor│ │ │
|
|
│ │ │ (PyPDF2) │ │(python-docx)│ │ (built-in) │ │ │
|
|
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
|
│ │ - Managed by ExtractorFactory (Factory Pattern) │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ CHUNKERS (Implements IChunker) │ │
|
|
│ │ ┌─────────────────┐ ┌──────────────────┐ │ │
|
|
│ │ │ FixedSizeChunker│ │ParagraphChunker │ │ │
|
|
│ │ │ - Fixed chunks │ │ - Respect │ │ │
|
|
│ │ │ - With overlap │ │ paragraphs │ │ │
|
|
│ │ └─────────────────┘ └──────────────────┘ │ │
|
|
│ │ - Managed by ChunkingContext (Strategy Pattern) │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────────┐ │
|
|
│ │ REPOSITORY (Implements IDocumentRepository) │ │
|
|
│ │ ┌──────────────────────────────────┐ │ │
|
|
│ │ │ InMemoryDocumentRepository │ │ │
|
|
│ │ │ - Thread-safe Dict storage │ │ │
|
|
│ │ │ - Easy to swap for PostgreSQL │ │ │
|
|
│ │ └──────────────────────────────────┘ │ │
|
|
│ └──────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ BOOTSTRAP (Wiring) │
|
|
│ ApplicationContainer: │
|
|
│ - Creates all adapters │
|
|
│ - Injects dependencies into core │
|
|
│ - ONLY place where adapters are instantiated │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Data Flow: Process Document
|
|
|
|
```
|
|
1. HTTP Request
|
|
│
|
|
▼
|
|
2. FastAPI Route (Incoming Adapter)
|
|
│ - Validates request schema
|
|
▼
|
|
3. DocumentProcessorService (Core)
|
|
│ - Calls ExtractorFactory
|
|
▼
|
|
4. PDFExtractor (Outgoing Adapter)
|
|
│ - Extracts text using PyPDF2
|
|
│ - Maps PyPDF2 exceptions → Domain exceptions
|
|
▼
|
|
5. DocumentProcessorService
|
|
│ - Cleans text using domain logic utils
|
|
│ - Validates Document
|
|
▼
|
|
6. InMemoryRepository (Outgoing Adapter)
|
|
│ - Saves Document
|
|
▼
|
|
7. DocumentProcessorService
|
|
│ - Returns Document
|
|
▼
|
|
8. FastAPI Route
|
|
│ - Converts Document → DocumentResponse
|
|
▼
|
|
9. HTTP Response
|
|
```
|
|
|
|
## Data Flow: Extract and Chunk
|
|
|
|
```
|
|
1. HTTP Request
|
|
│
|
|
▼
|
|
2. FastAPI Route
|
|
│ - Validates request
|
|
▼
|
|
3. DocumentProcessorService
|
|
│ - Gets extractor from factory
|
|
│ - Extracts text
|
|
▼
|
|
4. Extractor (PDF/DOCX/TXT)
|
|
│ - Returns Document
|
|
▼
|
|
5. DocumentProcessorService
|
|
│ - Cleans text
|
|
│ - Calls ChunkingContext
|
|
▼
|
|
6. ChunkingContext (Strategy Pattern)
|
|
│ - Selects appropriate chunker
|
|
▼
|
|
7. Chunker (FixedSize/Paragraph)
|
|
│ - Splits text into segments
|
|
│ - Creates Chunk entities
|
|
▼
|
|
8. DocumentProcessorService
|
|
│ - Returns List[Chunk]
|
|
▼
|
|
9. FastAPI Route
|
|
│ - Converts Chunks → ChunkResponse[]
|
|
▼
|
|
10. HTTP Response
|
|
```
|
|
|
|
## Dependency Rules
|
|
|
|
### ✅ ALLOWED Dependencies
|
|
|
|
```
|
|
Incoming Adapters → Core Ports (Incoming)
|
|
Core Services → Core Ports (Outgoing)
|
|
Core → Core (Domain Models, Logic Utils, Exceptions)
|
|
Bootstrap → Everything (Wiring only)
|
|
```
|
|
|
|
### ❌ FORBIDDEN Dependencies
|
|
|
|
```
|
|
Core → Adapters (NEVER!)
|
|
Core → External Libraries (Only in Adapters)
|
|
Domain Models → Services
|
|
Domain Models → Ports
|
|
```
|
|
|
|
## Key Design Patterns
|
|
|
|
### 1. Hexagonal Architecture (Ports & Adapters)
|
|
- **Purpose**: Isolate core business logic from external concerns
|
|
- **Implementation**:
|
|
- Ports: Interface definitions (ITextProcessor, IExtractor, etc.)
|
|
- Adapters: Concrete implementations (PDFExtractor, FastAPI routes)
|
|
|
|
### 2. Factory Pattern
|
|
- **Class**: `ExtractorFactory`
|
|
- **Purpose**: Create appropriate extractor based on file extension
|
|
- **Benefit**: Centralized extractor management, easy to add new types
|
|
|
|
### 3. Strategy Pattern
|
|
- **Class**: `ChunkingContext`
|
|
- **Purpose**: Switch between chunking strategies at runtime
|
|
- **Strategies**: FixedSizeChunker, ParagraphChunker
|
|
- **Benefit**: Easy to add new chunking algorithms
|
|
|
|
### 4. Repository Pattern
|
|
- **Interface**: `IDocumentRepository`
|
|
- **Implementation**: `InMemoryDocumentRepository`
|
|
- **Purpose**: Abstract data persistence
|
|
- **Benefit**: Easy to swap storage (memory → PostgreSQL → MongoDB)
|
|
|
|
### 5. Dependency Injection
|
|
- **Class**: `ApplicationContainer`
|
|
- **Purpose**: Wire all dependencies at startup
|
|
- **Benefit**: Loose coupling, easy testing
|
|
|
|
### 6. Template Method Pattern
|
|
- **Classes**: `BaseExtractor`, `BaseChunker`
|
|
- **Purpose**: Define algorithm skeleton, let subclasses fill in details
|
|
- **Benefit**: Code reuse, consistent behavior
|
|
|
|
## SOLID Principles Application
|
|
|
|
### Single Responsibility Principle (SRP)
|
|
- Each extractor handles ONE file type
|
|
- Each chunker handles ONE strategy
|
|
- Each service method does ONE thing
|
|
- Functions are max 15-20 lines
|
|
|
|
### Open/Closed Principle (OCP)
|
|
- Add new extractors without modifying core
|
|
- Add new chunkers without modifying service
|
|
- Extend via interfaces, not modification
|
|
|
|
### Liskov Substitution Principle (LSP)
|
|
- All IExtractor implementations are interchangeable
|
|
- All IChunker implementations are interchangeable
|
|
- Polymorphism works correctly
|
|
|
|
### Interface Segregation Principle (ISP)
|
|
- Small, focused interfaces
|
|
- IExtractor: Only extraction concerns
|
|
- IChunker: Only chunking concerns
|
|
- No fat interfaces
|
|
|
|
### Dependency Inversion Principle (DIP)
|
|
- Core depends on IExtractor (abstraction)
|
|
- Core does NOT depend on PDFExtractor (concrete)
|
|
- High-level modules don't depend on low-level modules
|
|
|
|
## Error Handling Strategy
|
|
|
|
### Domain Exceptions
|
|
All external errors are caught and wrapped in domain exceptions:
|
|
|
|
```python
|
|
try:
|
|
PyPDF2.PdfReader(file) # External library
|
|
except PyPDF2.errors.PdfReadError as e:
|
|
raise ExtractionError( # Domain exception
|
|
message="Invalid PDF",
|
|
details=str(e),
|
|
)
|
|
```
|
|
|
|
### Exception Hierarchy
|
|
```
|
|
DomainException (Base)
|
|
├── ExtractionError
|
|
│ ├── UnsupportedFileTypeError
|
|
│ └── EmptyContentError
|
|
├── ChunkingError
|
|
├── ProcessingError
|
|
├── ValidationError
|
|
└── RepositoryError
|
|
└── DocumentNotFoundError
|
|
```
|
|
|
|
### HTTP Error Mapping
|
|
FastAPI adapter maps domain exceptions to HTTP status codes:
|
|
- `UnsupportedFileTypeError` → 400 Bad Request
|
|
- `ExtractionError` → 422 Unprocessable Entity
|
|
- `DocumentNotFoundError` → 404 Not Found
|
|
- `ProcessingError` → 500 Internal Server Error
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests (Core)
|
|
- Test domain models in isolation
|
|
- Test logic utils (pure functions)
|
|
- Test services with mock ports
|
|
|
|
### Integration Tests (Adapters)
|
|
- Test extractors with real files
|
|
- Test chunkers with real text
|
|
- Test repository operations
|
|
|
|
### API Tests (End-to-End)
|
|
- Test FastAPI routes
|
|
- Test complete workflows
|
|
- Test error scenarios
|
|
|
|
### Example Test Structure
|
|
```python
|
|
def test_document_processor_service():
|
|
# Arrange: Create mocks
|
|
mock_repository = MockRepository()
|
|
mock_factory = MockExtractorFactory()
|
|
mock_context = MockChunkingContext()
|
|
|
|
# Act: Inject mocks
|
|
service = DocumentProcessorService(
|
|
extractor_factory=mock_factory,
|
|
chunking_context=mock_context,
|
|
repository=mock_repository,
|
|
)
|
|
|
|
# Assert: Test behavior
|
|
result = service.process_document(...)
|
|
assert result.is_processed
|
|
```
|
|
|
|
## Extensibility Examples
|
|
|
|
### Adding a New Extractor (HTML)
|
|
1. Create `html_extractor.py`:
|
|
```python
|
|
class HTMLExtractor(BaseExtractor):
|
|
def __init__(self):
|
|
super().__init__(supported_extensions=['html', 'htm'])
|
|
|
|
def _extract_text(self, file_path: Path) -> str:
|
|
from bs4 import BeautifulSoup
|
|
html = file_path.read_text()
|
|
soup = BeautifulSoup(html, 'html.parser')
|
|
return soup.get_text()
|
|
```
|
|
|
|
2. Register in `bootstrap.py`:
|
|
```python
|
|
factory.register_extractor(HTMLExtractor())
|
|
```
|
|
|
|
### Adding a New Chunking Strategy (Sentence)
|
|
1. Create `sentence_chunker.py`:
|
|
```python
|
|
class SentenceChunker(BaseChunker):
|
|
def __init__(self):
|
|
super().__init__(strategy_name="sentence")
|
|
|
|
def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
|
|
# Use NLTK to split into sentences
|
|
sentences = nltk.sent_tokenize(text)
|
|
# Group sentences to reach chunk_size
|
|
return grouped_segments
|
|
```
|
|
|
|
2. Register in `bootstrap.py`:
|
|
```python
|
|
context.register_chunker(SentenceChunker())
|
|
```
|
|
|
|
### Adding Database Persistence
|
|
1. Create `postgres_repository.py`:
|
|
```python
|
|
class PostgresDocumentRepository(IDocumentRepository):
|
|
def __init__(self, connection_string: str):
|
|
self.engine = create_engine(connection_string)
|
|
|
|
def save(self, document: Document) -> Document:
|
|
# Save to PostgreSQL
|
|
pass
|
|
```
|
|
|
|
2. Swap in `bootstrap.py`:
|
|
```python
|
|
def _create_repository(self):
|
|
return PostgresDocumentRepository("postgresql://...")
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Current Implementation
|
|
- In-memory storage: O(1) lookups, limited by RAM
|
|
- Synchronous processing: Sequential file processing
|
|
- Thread-safe: Uses locks for concurrent access
|
|
|
|
### Future Optimizations
|
|
- **Async Processing**: Use `asyncio` for concurrent document processing
|
|
- **Caching**: Add Redis for frequently accessed documents
|
|
- **Streaming**: Process large files in chunks
|
|
- **Database**: Use PostgreSQL with indexes for better queries
|
|
- **Message Queue**: Use Celery/RabbitMQ for background processing
|
|
|
|
## Deployment Considerations
|
|
|
|
### Configuration
|
|
- Use environment variables for settings
|
|
- Externalize file paths, database connections
|
|
- Use `pydantic-settings` for config management
|
|
|
|
### Monitoring
|
|
- Add structured logging (JSON format)
|
|
- Track metrics: processing time, error rates
|
|
- Use APM tools (DataDog, New Relic)
|
|
|
|
### Scaling
|
|
- Horizontal: Run multiple FastAPI instances behind load balancer
|
|
- Vertical: Increase resources for compute-heavy extraction
|
|
- Database: Use connection pooling, read replicas
|