2026-01-18 19:38:53 +03:30
2026-01-18 19:38:53 +03:30
2026-01-07 19:15:46 +03:30
2026-01-07 19:15:46 +03:30

Text Processor - Hexagonal Architecture

A production-ready text extraction and chunking system built with Hexagonal Architecture (Ports & Adapters pattern).

Architecture Overview

This project demonstrates a "Gold Standard" implementation of Clean Architecture principles:

Project Structure

text_processor_hex/
├── src/
│   ├── core/                      # Domain Layer (Pure Business Logic)
│   │   ├── domain/
│   │   │   ├── models.py          # Rich Pydantic v2 entities
│   │   │   ├── exceptions.py      # Custom domain exceptions
│   │   │   └── logic_utils.py     # Pure functions for text processing
│   │   ├── ports/
│   │   │   ├── incoming/          # Service Interfaces (Use Cases)
│   │   │   └── outgoing/          # SPIs (Extractor, Chunker, Repository)
│   │   └── services/              # Business logic orchestration
│   ├── adapters/
│   │   ├── incoming/              # FastAPI routes & schemas
│   │   └── outgoing/
│   │       ├── extractors/        # PDF/DOCX/TXT implementations
│   │       ├── chunkers/          # Chunking strategy implementations
│   │       └── persistence/       # Repository implementations
│   ├── shared/                    # Cross-cutting concerns (logging)
│   └── bootstrap.py               # Dependency Injection wiring
├── main.py                        # Application entry point
└── requirements.txt

Key Design Patterns

  1. Hexagonal Architecture: Core domain is isolated from external concerns
  2. Dependency Inversion: Core depends on abstractions (ports), not implementations
  3. Strategy Pattern: Pluggable chunking strategies (FixedSize, Paragraph)
  4. Factory Pattern: Dynamic extractor selection based on file type
  5. Repository Pattern: Abstract data persistence
  6. Rich Domain Models: Entities with validation and business logic

SOLID Principles

  • Single Responsibility: Each class has one reason to change
  • Open/Closed: Extensible via strategies and factories
  • Liskov Substitution: All adapters are substitutable
  • Interface Segregation: Focused port interfaces
  • Dependency Inversion: Core depends on abstractions

Features

  • Extract text from PDF, DOCX, and TXT files
  • Multiple chunking strategies:
    • Fixed Size: Split text into equal-sized chunks with overlap
    • Paragraph: Respect document structure and paragraph boundaries
  • Rich domain models with validation
  • Comprehensive error handling with domain exceptions
  • RESTful API with FastAPI
  • Thread-safe in-memory repository
  • Fully typed with Python 3.10+ type hints

Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running the Application

# Start the FastAPI server
python main.py

# Or use uvicorn directly
uvicorn main:app --reload --host 0.0.0.0 --port 8000

The API will be available at:

API Endpoints

Process Document

POST /api/v1/process
{
  "file_path": "/path/to/document.pdf",
  "chunking_strategy": {
    "strategy_name": "fixed_size",
    "chunk_size": 1000,
    "overlap_size": 100,
    "respect_boundaries": true
  }
}

Extract and Chunk

POST /api/v1/extract-and-chunk
{
  "file_path": "/path/to/document.pdf",
  "chunking_strategy": {
    "strategy_name": "paragraph",
    "chunk_size": 1000,
    "overlap_size": 0,
    "respect_boundaries": true
  }
}

Get Document

GET /api/v1/documents/{document_id}

List Documents

GET /api/v1/documents?limit=100&offset=0

Delete Document

DELETE /api/v1/documents/{document_id}

Health Check

GET /api/v1/health

Programmatic Usage

from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy

# Create application container
container = create_application(log_level="INFO")

# Get the service
service = container.text_processor_service

# Process a document
strategy = ChunkingStrategy(
    strategy_name="fixed_size",
    chunk_size=1000,
    overlap_size=100,
    respect_boundaries=True,
)

document = service.process_document(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

print(f"Processed: {document.get_metadata_summary()}")
print(f"Preview: {document.get_content_preview()}")

# Extract and chunk
chunks = service.extract_and_chunk(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

for chunk in chunks:
    print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")

Adding New Extractors

To add support for a new file type:

  1. Create a new extractor in src/adapters/outgoing/extractors/:
from .base import BaseExtractor

class MyExtractor(BaseExtractor):
    def __init__(self):
        super().__init__(supported_extensions=['myext'])

    def _extract_text(self, file_path: Path) -> str:
        # Your extraction logic here
        return extracted_text
  1. Register in src/bootstrap.py:
factory.register_extractor(MyExtractor())

Adding New Chunking Strategies

To add a new chunking strategy:

  1. Create a new chunker in src/adapters/outgoing/chunkers/:
from .base import BaseChunker

class MyChunker(BaseChunker):
    def __init__(self):
        super().__init__(strategy_name="my_strategy")

    def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
        # Your chunking logic here
        return segments
  1. Register in src/bootstrap.py:
context.register_chunker(MyChunker())

Testing

The architecture is designed for easy testing:

# Mock the repository
from src.core.ports.outgoing.repository import IDocumentRepository

class MockRepository(IDocumentRepository):
    # Implement interface for testing
    pass

# Inject mock in service
service = DocumentProcessorService(
    extractor_factory=extractor_factory,
    chunking_context=chunking_context,
    repository=MockRepository(),  # Mock injected here
)

Design Decisions

Why Hexagonal Architecture?

  1. Testability: Core business logic can be tested without any infrastructure
  2. Flexibility: Easy to swap implementations (e.g., switch from in-memory to PostgreSQL)
  3. Maintainability: Clear separation of concerns
  4. Scalability: Add new features without modifying core

Why Pydantic v2?

  • Runtime validation of domain models
  • Type safety
  • Automatic serialization/deserialization
  • Performance improvements over v1

Why Strategy Pattern for Chunking?

  • Runtime strategy selection
  • Easy to add new strategies
  • Each strategy isolated and testable

Why Factory Pattern for Extractors?

  • Automatic extractor selection based on file type
  • Easy to add support for new file types
  • Centralized extractor management

Code Quality Standards

  • Type Hints: 100% type coverage
  • Docstrings: Google-style documentation on all public APIs
  • Function Size: Maximum 15-20 lines per function
  • Single Responsibility: Each class/function does ONE thing
  • DRY: No code duplication
  • KISS: Simple, readable solutions

Future Enhancements

  • Database persistence (PostgreSQL, MongoDB)
  • Async document processing
  • Caching layer (Redis)
  • Sentence chunking strategy
  • Semantic chunking with embeddings
  • Batch processing API
  • Document versioning
  • Full-text search integration

License

MIT License

Description
No description provided
Readme 308 KiB
Languages
Python 98.3%
Shell 1.7%