m.dabbagh 70f5b1478c init

2026-01-07 19:15:46 +03:30

19 KiB

Raw Blame History

Architecture Documentation

Hexagonal Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         INCOMING ADAPTERS                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  FastAPI Routes (HTTP)                                       │   │
│  │  - ProcessDocumentRequest → API Schemas                      │   │
│  │  - ExtractAndChunkRequest → API Schemas                      │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         CORE DOMAIN                                 │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  PORTS (Interfaces)                                          │   │
│  │  ┌────────────────────┐    ┌───────────────────────────┐    │   │
│  │  │  Incoming Ports    │    │  Outgoing Ports           │    │   │
│  │  │  - ITextProcessor  │    │  - IExtractor             │    │   │
│  │  │                    │    │  - IChunker               │    │   │
│  │  │                    │    │  - IDocumentRepository    │    │   │
│  │  └────────────────────┘    └───────────────────────────┘    │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  SERVICES (Business Logic)                                   │   │
│  │  - DocumentProcessorService                                  │   │
│  │    • Orchestrates Extract → Clean → Chunk → Save            │   │
│  │    • Depends ONLY on Port interfaces                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  DOMAIN MODELS (Rich Entities)                               │   │
│  │  - Document (with validation & business methods)             │   │
│  │  - Chunk (immutable value object)                            │   │
│  │  - ChunkingStrategy (configuration)                          │   │
│  │  - DocumentMetadata                                          │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  DOMAIN LOGIC (Pure Functions)                               │   │
│  │  - normalize_whitespace()                                    │   │
│  │  - clean_text()                                              │   │
│  │  - split_into_paragraphs()                                   │   │
│  │  - find_sentence_boundary_before()                           │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  EXCEPTIONS (Domain Errors)                                  │   │
│  │  - ExtractionError, ChunkingError, ProcessingError          │   │
│  │  - ValidationError, RepositoryError                          │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         OUTGOING ADAPTERS                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  EXTRACTORS (Implements IExtractor)                          │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │   │
│  │  │ PDFExtractor│  │DocxExtractor│ │TxtExtractor│             │   │
│  │  │  (PyPDF2)   │  │(python-docx)│ │ (built-in) │             │   │
│  │  └────────────┘  └────────────┘  └────────────┘             │   │
│  │  - Managed by ExtractorFactory (Factory Pattern)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  CHUNKERS (Implements IChunker)                              │   │
│  │  ┌─────────────────┐  ┌──────────────────┐                  │   │
│  │  │ FixedSizeChunker│  │ParagraphChunker  │                  │   │
│  │  │  - Fixed chunks │  │ - Respect        │                  │   │
│  │  │  - With overlap │  │   paragraphs     │                  │   │
│  │  └─────────────────┘  └──────────────────┘                  │   │
│  │  - Managed by ChunkingContext (Strategy Pattern)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  REPOSITORY (Implements IDocumentRepository)                 │   │
│  │  ┌──────────────────────────────────┐                        │   │
│  │  │  InMemoryDocumentRepository      │                        │   │
│  │  │  - Thread-safe Dict storage      │                        │   │
│  │  │  - Easy to swap for PostgreSQL   │                        │   │
│  │  └──────────────────────────────────┘                        │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                         BOOTSTRAP (Wiring)                          │
│  ApplicationContainer:                                              │
│    - Creates all adapters                                           │
│    - Injects dependencies into core                                 │
│    - ONLY place where adapters are instantiated                     │
└─────────────────────────────────────────────────────────────────────┘

Data Flow: Process Document

1. HTTP Request
   │
   ▼
2. FastAPI Route (Incoming Adapter)
   │ - Validates request schema
   ▼
3. DocumentProcessorService (Core)
   │ - Calls ExtractorFactory
   ▼
4. PDFExtractor (Outgoing Adapter)
   │ - Extracts text using PyPDF2
   │ - Maps PyPDF2 exceptions → Domain exceptions
   ▼
5. DocumentProcessorService
   │ - Cleans text using domain logic utils
   │ - Validates Document
   ▼
6. InMemoryRepository (Outgoing Adapter)
   │ - Saves Document
   ▼
7. DocumentProcessorService
   │ - Returns Document
   ▼
8. FastAPI Route
   │ - Converts Document → DocumentResponse
   ▼
9. HTTP Response

Data Flow: Extract and Chunk

1. HTTP Request
   │
   ▼
2. FastAPI Route
   │ - Validates request
   ▼
3. DocumentProcessorService
   │ - Gets extractor from factory
   │ - Extracts text
   ▼
4. Extractor (PDF/DOCX/TXT)
   │ - Returns Document
   ▼
5. DocumentProcessorService
   │ - Cleans text
   │ - Calls ChunkingContext
   ▼
6. ChunkingContext (Strategy Pattern)
   │ - Selects appropriate chunker
   ▼
7. Chunker (FixedSize/Paragraph)
   │ - Splits text into segments
   │ - Creates Chunk entities
   ▼
8. DocumentProcessorService
   │ - Returns List[Chunk]
   ▼
9. FastAPI Route
   │ - Converts Chunks → ChunkResponse[]
   ▼
10. HTTP Response

Dependency Rules

✅ ALLOWED Dependencies

Incoming Adapters → Core Ports (Incoming)
Core Services → Core Ports (Outgoing)
Core → Core (Domain Models, Logic Utils, Exceptions)
Bootstrap → Everything (Wiring only)

❌ FORBIDDEN Dependencies

Core → Adapters (NEVER!)
Core → External Libraries (Only in Adapters)
Domain Models → Services
Domain Models → Ports

Key Design Patterns

1. Hexagonal Architecture (Ports & Adapters)

Purpose: Isolate core business logic from external concerns
Implementation:
- Ports: Interface definitions (ITextProcessor, IExtractor, etc.)
- Adapters: Concrete implementations (PDFExtractor, FastAPI routes)

2. Factory Pattern

Class: ExtractorFactory
Purpose: Create appropriate extractor based on file extension
Benefit: Centralized extractor management, easy to add new types

3. Strategy Pattern

Class: ChunkingContext
Purpose: Switch between chunking strategies at runtime
Strategies: FixedSizeChunker, ParagraphChunker
Benefit: Easy to add new chunking algorithms

4. Repository Pattern

Interface: IDocumentRepository
Implementation: InMemoryDocumentRepository
Purpose: Abstract data persistence
Benefit: Easy to swap storage (memory → PostgreSQL → MongoDB)

5. Dependency Injection

Class: ApplicationContainer
Purpose: Wire all dependencies at startup
Benefit: Loose coupling, easy testing

6. Template Method Pattern

Classes: BaseExtractor, BaseChunker
Purpose: Define algorithm skeleton, let subclasses fill in details
Benefit: Code reuse, consistent behavior

SOLID Principles Application

Single Responsibility Principle (SRP)

Each extractor handles ONE file type
Each chunker handles ONE strategy
Each service method does ONE thing
Functions are max 15-20 lines

Open/Closed Principle (OCP)

Add new extractors without modifying core
Add new chunkers without modifying service
Extend via interfaces, not modification

Liskov Substitution Principle (LSP)

All IExtractor implementations are interchangeable
All IChunker implementations are interchangeable
Polymorphism works correctly

Interface Segregation Principle (ISP)

Small, focused interfaces
IExtractor: Only extraction concerns
IChunker: Only chunking concerns
No fat interfaces

Dependency Inversion Principle (DIP)

Core depends on IExtractor (abstraction)
Core does NOT depend on PDFExtractor (concrete)
High-level modules don't depend on low-level modules

Error Handling Strategy

Domain Exceptions

All external errors are caught and wrapped in domain exceptions:

try:
    PyPDF2.PdfReader(file)  # External library
except PyPDF2.errors.PdfReadError as e:
    raise ExtractionError(  # Domain exception
        message="Invalid PDF",
        details=str(e),
    )

Exception Hierarchy

DomainException (Base)
├── ExtractionError
│   ├── UnsupportedFileTypeError
│   └── EmptyContentError
├── ChunkingError
├── ProcessingError
├── ValidationError
└── RepositoryError
    └── DocumentNotFoundError

HTTP Error Mapping

FastAPI adapter maps domain exceptions to HTTP status codes:

UnsupportedFileTypeError → 400 Bad Request
ExtractionError → 422 Unprocessable Entity
DocumentNotFoundError → 404 Not Found
ProcessingError → 500 Internal Server Error

Testing Strategy

Unit Tests (Core)

Test domain models in isolation
Test logic utils (pure functions)
Test services with mock ports

Integration Tests (Adapters)

Test extractors with real files
Test chunkers with real text
Test repository operations

API Tests (End-to-End)

Test FastAPI routes
Test complete workflows
Test error scenarios

Example Test Structure

def test_document_processor_service():
    # Arrange: Create mocks
    mock_repository = MockRepository()
    mock_factory = MockExtractorFactory()
    mock_context = MockChunkingContext()

    # Act: Inject mocks
    service = DocumentProcessorService(
        extractor_factory=mock_factory,
        chunking_context=mock_context,
        repository=mock_repository,
    )

    # Assert: Test behavior
    result = service.process_document(...)
    assert result.is_processed

Extensibility Examples

Adding a New Extractor (HTML)

Create html_extractor.py:

class HTMLExtractor(BaseExtractor):
    def __init__(self):
        super().__init__(supported_extensions=['html', 'htm'])

    def _extract_text(self, file_path: Path) -> str:
        from bs4 import BeautifulSoup
        html = file_path.read_text()
        soup = BeautifulSoup(html, 'html.parser')
        return soup.get_text()

factory.register_extractor(HTMLExtractor())

Adding a New Chunking Strategy (Sentence)

Create sentence_chunker.py:

class SentenceChunker(BaseChunker):
    def __init__(self):
        super().__init__(strategy_name="sentence")

    def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
        # Use NLTK to split into sentences
        sentences = nltk.sent_tokenize(text)
        # Group sentences to reach chunk_size
        return grouped_segments

context.register_chunker(SentenceChunker())

Adding Database Persistence

Create postgres_repository.py:

class PostgresDocumentRepository(IDocumentRepository):
    def __init__(self, connection_string: str):
        self.engine = create_engine(connection_string)

    def save(self, document: Document) -> Document:
        # Save to PostgreSQL
        pass

Swap in bootstrap.py:

def _create_repository(self):
    return PostgresDocumentRepository("postgresql://...")

Performance Considerations

Current Implementation

In-memory storage: O(1) lookups, limited by RAM
Synchronous processing: Sequential file processing
Thread-safe: Uses locks for concurrent access

Future Optimizations

Async Processing: Use asyncio for concurrent document processing
Caching: Add Redis for frequently accessed documents
Streaming: Process large files in chunks
Database: Use PostgreSQL with indexes for better queries
Message Queue: Use Celery/RabbitMQ for background processing

Deployment Considerations

Configuration

Use environment variables for settings
Externalize file paths, database connections
Use pydantic-settings for config management

Monitoring

Add structured logging (JSON format)
Track metrics: processing time, error rates
Use APM tools (DataDog, New Relic)

Scaling

Horizontal: Run multiple FastAPI instances behind load balancer
Vertical: Increase resources for compute-heavy extraction
Database: Use connection pooling, read replicas

19 KiB Raw Blame History

Architecture Documentation

Hexagonal Architecture Overview

Data Flow: Process Document

Data Flow: Extract and Chunk

Dependency Rules

✅ ALLOWED Dependencies

❌ FORBIDDEN Dependencies

Key Design Patterns

1. Hexagonal Architecture (Ports & Adapters)

2. Factory Pattern

3. Strategy Pattern

4. Repository Pattern

5. Dependency Injection

6. Template Method Pattern

SOLID Principles Application

Single Responsibility Principle (SRP)

Open/Closed Principle (OCP)

Liskov Substitution Principle (LSP)

Interface Segregation Principle (ISP)

Dependency Inversion Principle (DIP)

Error Handling Strategy

Domain Exceptions

Exception Hierarchy

HTTP Error Mapping

Testing Strategy

Unit Tests (Core)

Integration Tests (Adapters)

API Tests (End-to-End)

Example Test Structure

Extensibility Examples

Adding a New Extractor (HTML)

Adding a New Chunking Strategy (Sentence)

Adding Database Persistence

Performance Considerations

Current Implementation

Future Optimizations

Deployment Considerations

Configuration

Monitoring

Scaling

19 KiB

Raw Blame History