text_processor/ARCHITECTURE.md
m.dabbagh 70f5b1478c init
2026-01-07 19:15:46 +03:30

19 KiB

Architecture Documentation

Hexagonal Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         INCOMING ADAPTERS                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  FastAPI Routes (HTTP)                                       │   │
│  │  - ProcessDocumentRequest → API Schemas                      │   │
│  │  - ExtractAndChunkRequest → API Schemas                      │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         CORE DOMAIN                                 │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  PORTS (Interfaces)                                          │   │
│  │  ┌────────────────────┐    ┌───────────────────────────┐    │   │
│  │  │  Incoming Ports    │    │  Outgoing Ports           │    │   │
│  │  │  - ITextProcessor  │    │  - IExtractor             │    │   │
│  │  │                    │    │  - IChunker               │    │   │
│  │  │                    │    │  - IDocumentRepository    │    │   │
│  │  └────────────────────┘    └───────────────────────────┘    │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  SERVICES (Business Logic)                                   │   │
│  │  - DocumentProcessorService                                  │   │
│  │    • Orchestrates Extract → Clean → Chunk → Save            │   │
│  │    • Depends ONLY on Port interfaces                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  DOMAIN MODELS (Rich Entities)                               │   │
│  │  - Document (with validation & business methods)             │   │
│  │  - Chunk (immutable value object)                            │   │
│  │  - ChunkingStrategy (configuration)                          │   │
│  │  - DocumentMetadata                                          │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  DOMAIN LOGIC (Pure Functions)                               │   │
│  │  - normalize_whitespace()                                    │   │
│  │  - clean_text()                                              │   │
│  │  - split_into_paragraphs()                                   │   │
│  │  - find_sentence_boundary_before()                           │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  EXCEPTIONS (Domain Errors)                                  │   │
│  │  - ExtractionError, ChunkingError, ProcessingError          │   │
│  │  - ValidationError, RepositoryError                          │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         OUTGOING ADAPTERS                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  EXTRACTORS (Implements IExtractor)                          │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │   │
│  │  │ PDFExtractor│  │DocxExtractor│ │TxtExtractor│             │   │
│  │  │  (PyPDF2)   │  │(python-docx)│ │ (built-in) │             │   │
│  │  └────────────┘  └────────────┘  └────────────┘             │   │
│  │  - Managed by ExtractorFactory (Factory Pattern)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  CHUNKERS (Implements IChunker)                              │   │
│  │  ┌─────────────────┐  ┌──────────────────┐                  │   │
│  │  │ FixedSizeChunker│  │ParagraphChunker  │                  │   │
│  │  │  - Fixed chunks │  │ - Respect        │                  │   │
│  │  │  - With overlap │  │   paragraphs     │                  │   │
│  │  └─────────────────┘  └──────────────────┘                  │   │
│  │  - Managed by ChunkingContext (Strategy Pattern)            │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  REPOSITORY (Implements IDocumentRepository)                 │   │
│  │  ┌──────────────────────────────────┐                        │   │
│  │  │  InMemoryDocumentRepository      │                        │   │
│  │  │  - Thread-safe Dict storage      │                        │   │
│  │  │  - Easy to swap for PostgreSQL   │                        │   │
│  │  └──────────────────────────────────┘                        │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                         BOOTSTRAP (Wiring)                          │
│  ApplicationContainer:                                              │
│    - Creates all adapters                                           │
│    - Injects dependencies into core                                 │
│    - ONLY place where adapters are instantiated                     │
└─────────────────────────────────────────────────────────────────────┘

Data Flow: Process Document

1. HTTP Request
   │
   ▼
2. FastAPI Route (Incoming Adapter)
   │ - Validates request schema
   ▼
3. DocumentProcessorService (Core)
   │ - Calls ExtractorFactory
   ▼
4. PDFExtractor (Outgoing Adapter)
   │ - Extracts text using PyPDF2
   │ - Maps PyPDF2 exceptions → Domain exceptions
   ▼
5. DocumentProcessorService
   │ - Cleans text using domain logic utils
   │ - Validates Document
   ▼
6. InMemoryRepository (Outgoing Adapter)
   │ - Saves Document
   ▼
7. DocumentProcessorService
   │ - Returns Document
   ▼
8. FastAPI Route
   │ - Converts Document → DocumentResponse
   ▼
9. HTTP Response

Data Flow: Extract and Chunk

1. HTTP Request
   │
   ▼
2. FastAPI Route
   │ - Validates request
   ▼
3. DocumentProcessorService
   │ - Gets extractor from factory
   │ - Extracts text
   ▼
4. Extractor (PDF/DOCX/TXT)
   │ - Returns Document
   ▼
5. DocumentProcessorService
   │ - Cleans text
   │ - Calls ChunkingContext
   ▼
6. ChunkingContext (Strategy Pattern)
   │ - Selects appropriate chunker
   ▼
7. Chunker (FixedSize/Paragraph)
   │ - Splits text into segments
   │ - Creates Chunk entities
   ▼
8. DocumentProcessorService
   │ - Returns List[Chunk]
   ▼
9. FastAPI Route
   │ - Converts Chunks → ChunkResponse[]
   ▼
10. HTTP Response

Dependency Rules

ALLOWED Dependencies

Incoming Adapters → Core Ports (Incoming)
Core Services → Core Ports (Outgoing)
Core → Core (Domain Models, Logic Utils, Exceptions)
Bootstrap → Everything (Wiring only)

FORBIDDEN Dependencies

Core → Adapters (NEVER!)
Core → External Libraries (Only in Adapters)
Domain Models → Services
Domain Models → Ports

Key Design Patterns

1. Hexagonal Architecture (Ports & Adapters)

  • Purpose: Isolate core business logic from external concerns
  • Implementation:
    • Ports: Interface definitions (ITextProcessor, IExtractor, etc.)
    • Adapters: Concrete implementations (PDFExtractor, FastAPI routes)

2. Factory Pattern

  • Class: ExtractorFactory
  • Purpose: Create appropriate extractor based on file extension
  • Benefit: Centralized extractor management, easy to add new types

3. Strategy Pattern

  • Class: ChunkingContext
  • Purpose: Switch between chunking strategies at runtime
  • Strategies: FixedSizeChunker, ParagraphChunker
  • Benefit: Easy to add new chunking algorithms

4. Repository Pattern

  • Interface: IDocumentRepository
  • Implementation: InMemoryDocumentRepository
  • Purpose: Abstract data persistence
  • Benefit: Easy to swap storage (memory → PostgreSQL → MongoDB)

5. Dependency Injection

  • Class: ApplicationContainer
  • Purpose: Wire all dependencies at startup
  • Benefit: Loose coupling, easy testing

6. Template Method Pattern

  • Classes: BaseExtractor, BaseChunker
  • Purpose: Define algorithm skeleton, let subclasses fill in details
  • Benefit: Code reuse, consistent behavior

SOLID Principles Application

Single Responsibility Principle (SRP)

  • Each extractor handles ONE file type
  • Each chunker handles ONE strategy
  • Each service method does ONE thing
  • Functions are max 15-20 lines

Open/Closed Principle (OCP)

  • Add new extractors without modifying core
  • Add new chunkers without modifying service
  • Extend via interfaces, not modification

Liskov Substitution Principle (LSP)

  • All IExtractor implementations are interchangeable
  • All IChunker implementations are interchangeable
  • Polymorphism works correctly

Interface Segregation Principle (ISP)

  • Small, focused interfaces
  • IExtractor: Only extraction concerns
  • IChunker: Only chunking concerns
  • No fat interfaces

Dependency Inversion Principle (DIP)

  • Core depends on IExtractor (abstraction)
  • Core does NOT depend on PDFExtractor (concrete)
  • High-level modules don't depend on low-level modules

Error Handling Strategy

Domain Exceptions

All external errors are caught and wrapped in domain exceptions:

try:
    PyPDF2.PdfReader(file)  # External library
except PyPDF2.errors.PdfReadError as e:
    raise ExtractionError(  # Domain exception
        message="Invalid PDF",
        details=str(e),
    )

Exception Hierarchy

DomainException (Base)
├── ExtractionError
│   ├── UnsupportedFileTypeError
│   └── EmptyContentError
├── ChunkingError
├── ProcessingError
├── ValidationError
└── RepositoryError
    └── DocumentNotFoundError

HTTP Error Mapping

FastAPI adapter maps domain exceptions to HTTP status codes:

  • UnsupportedFileTypeError → 400 Bad Request
  • ExtractionError → 422 Unprocessable Entity
  • DocumentNotFoundError → 404 Not Found
  • ProcessingError → 500 Internal Server Error

Testing Strategy

Unit Tests (Core)

  • Test domain models in isolation
  • Test logic utils (pure functions)
  • Test services with mock ports

Integration Tests (Adapters)

  • Test extractors with real files
  • Test chunkers with real text
  • Test repository operations

API Tests (End-to-End)

  • Test FastAPI routes
  • Test complete workflows
  • Test error scenarios

Example Test Structure

def test_document_processor_service():
    # Arrange: Create mocks
    mock_repository = MockRepository()
    mock_factory = MockExtractorFactory()
    mock_context = MockChunkingContext()

    # Act: Inject mocks
    service = DocumentProcessorService(
        extractor_factory=mock_factory,
        chunking_context=mock_context,
        repository=mock_repository,
    )

    # Assert: Test behavior
    result = service.process_document(...)
    assert result.is_processed

Extensibility Examples

Adding a New Extractor (HTML)

  1. Create html_extractor.py:
class HTMLExtractor(BaseExtractor):
    def __init__(self):
        super().__init__(supported_extensions=['html', 'htm'])

    def _extract_text(self, file_path: Path) -> str:
        from bs4 import BeautifulSoup
        html = file_path.read_text()
        soup = BeautifulSoup(html, 'html.parser')
        return soup.get_text()
  1. Register in bootstrap.py:
factory.register_extractor(HTMLExtractor())

Adding a New Chunking Strategy (Sentence)

  1. Create sentence_chunker.py:
class SentenceChunker(BaseChunker):
    def __init__(self):
        super().__init__(strategy_name="sentence")

    def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
        # Use NLTK to split into sentences
        sentences = nltk.sent_tokenize(text)
        # Group sentences to reach chunk_size
        return grouped_segments
  1. Register in bootstrap.py:
context.register_chunker(SentenceChunker())

Adding Database Persistence

  1. Create postgres_repository.py:
class PostgresDocumentRepository(IDocumentRepository):
    def __init__(self, connection_string: str):
        self.engine = create_engine(connection_string)

    def save(self, document: Document) -> Document:
        # Save to PostgreSQL
        pass
  1. Swap in bootstrap.py:
def _create_repository(self):
    return PostgresDocumentRepository("postgresql://...")

Performance Considerations

Current Implementation

  • In-memory storage: O(1) lookups, limited by RAM
  • Synchronous processing: Sequential file processing
  • Thread-safe: Uses locks for concurrent access

Future Optimizations

  • Async Processing: Use asyncio for concurrent document processing
  • Caching: Add Redis for frequently accessed documents
  • Streaming: Process large files in chunks
  • Database: Use PostgreSQL with indexes for better queries
  • Message Queue: Use Celery/RabbitMQ for background processing

Deployment Considerations

Configuration

  • Use environment variables for settings
  • Externalize file paths, database connections
  • Use pydantic-settings for config management

Monitoring

  • Add structured logging (JSON format)
  • Track metrics: processing time, error rates
  • Use APM tools (DataDog, New Relic)

Scaling

  • Horizontal: Run multiple FastAPI instances behind load balancer
  • Vertical: Increase resources for compute-heavy extraction
  • Database: Use connection pooling, read replicas