# Text Processor - Hexagonal Architecture A production-ready text extraction and chunking system built with **Hexagonal Architecture** (Ports & Adapters pattern). ## Architecture Overview This project demonstrates a "Gold Standard" implementation of Clean Architecture principles: ### Project Structure ``` text_processor_hex/ ├── src/ │ ├── core/ # Domain Layer (Pure Business Logic) │ │ ├── domain/ │ │ │ ├── models.py # Rich Pydantic v2 entities │ │ │ ├── exceptions.py # Custom domain exceptions │ │ │ └── logic_utils.py # Pure functions for text processing │ │ ├── ports/ │ │ │ ├── incoming/ # Service Interfaces (Use Cases) │ │ │ └── outgoing/ # SPIs (Extractor, Chunker, Repository) │ │ └── services/ # Business logic orchestration │ ├── adapters/ │ │ ├── incoming/ # FastAPI routes & schemas │ │ └── outgoing/ │ │ ├── extractors/ # PDF/DOCX/TXT implementations │ │ ├── chunkers/ # Chunking strategy implementations │ │ └── persistence/ # Repository implementations │ ├── shared/ # Cross-cutting concerns (logging) │ └── bootstrap.py # Dependency Injection wiring ├── main.py # Application entry point └── requirements.txt ``` ## Key Design Patterns 1. **Hexagonal Architecture**: Core domain is isolated from external concerns 2. **Dependency Inversion**: Core depends on abstractions (ports), not implementations 3. **Strategy Pattern**: Pluggable chunking strategies (FixedSize, Paragraph) 4. **Factory Pattern**: Dynamic extractor selection based on file type 5. **Repository Pattern**: Abstract data persistence 6. **Rich Domain Models**: Entities with validation and business logic ## SOLID Principles - **S**ingle Responsibility: Each class has one reason to change - **O**pen/Closed: Extensible via strategies and factories - **L**iskov Substitution: All adapters are substitutable - **I**nterface Segregation: Focused port interfaces - **D**ependency Inversion: Core depends on abstractions ## Features - Extract text from PDF, DOCX, and TXT files - Multiple chunking strategies: - **Fixed Size**: Split text into equal-sized chunks with overlap - **Paragraph**: Respect document structure and paragraph boundaries - Rich domain models with validation - Comprehensive error handling with domain exceptions - RESTful API with FastAPI - Thread-safe in-memory repository - Fully typed with Python 3.10+ type hints ## Installation ```bash # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` ## Running the Application ```bash # Start the FastAPI server python main.py # Or use uvicorn directly uvicorn main:app --reload --host 0.0.0.0 --port 8000 ``` The API will be available at: - API: http://localhost:8000/api/v1 - Docs: http://localhost:8000/docs - ReDoc: http://localhost:8000/redoc ## API Endpoints ### Process Document ```bash POST /api/v1/process { "file_path": "/path/to/document.pdf", "chunking_strategy": { "strategy_name": "fixed_size", "chunk_size": 1000, "overlap_size": 100, "respect_boundaries": true } } ``` ### Extract and Chunk ```bash POST /api/v1/extract-and-chunk { "file_path": "/path/to/document.pdf", "chunking_strategy": { "strategy_name": "paragraph", "chunk_size": 1000, "overlap_size": 0, "respect_boundaries": true } } ``` ### Get Document ```bash GET /api/v1/documents/{document_id} ``` ### List Documents ```bash GET /api/v1/documents?limit=100&offset=0 ``` ### Delete Document ```bash DELETE /api/v1/documents/{document_id} ``` ### Health Check ```bash GET /api/v1/health ``` ## Programmatic Usage ```python from pathlib import Path from src.bootstrap import create_application from src.core.domain.models import ChunkingStrategy # Create application container container = create_application(log_level="INFO") # Get the service service = container.text_processor_service # Process a document strategy = ChunkingStrategy( strategy_name="fixed_size", chunk_size=1000, overlap_size=100, respect_boundaries=True, ) document = service.process_document( file_path=Path("example.pdf"), chunking_strategy=strategy, ) print(f"Processed: {document.get_metadata_summary()}") print(f"Preview: {document.get_content_preview()}") # Extract and chunk chunks = service.extract_and_chunk( file_path=Path("example.pdf"), chunking_strategy=strategy, ) for chunk in chunks: print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars") ``` ## Adding New Extractors To add support for a new file type: 1. Create a new extractor in `src/adapters/outgoing/extractors/`: ```python from .base import BaseExtractor class MyExtractor(BaseExtractor): def __init__(self): super().__init__(supported_extensions=['myext']) def _extract_text(self, file_path: Path) -> str: # Your extraction logic here return extracted_text ``` 2. Register in `src/bootstrap.py`: ```python factory.register_extractor(MyExtractor()) ``` ## Adding New Chunking Strategies To add a new chunking strategy: 1. Create a new chunker in `src/adapters/outgoing/chunkers/`: ```python from .base import BaseChunker class MyChunker(BaseChunker): def __init__(self): super().__init__(strategy_name="my_strategy") def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]: # Your chunking logic here return segments ``` 2. Register in `src/bootstrap.py`: ```python context.register_chunker(MyChunker()) ``` ## Testing The architecture is designed for easy testing: ```python # Mock the repository from src.core.ports.outgoing.repository import IDocumentRepository class MockRepository(IDocumentRepository): # Implement interface for testing pass # Inject mock in service service = DocumentProcessorService( extractor_factory=extractor_factory, chunking_context=chunking_context, repository=MockRepository(), # Mock injected here ) ``` ## Design Decisions ### Why Hexagonal Architecture? 1. **Testability**: Core business logic can be tested without any infrastructure 2. **Flexibility**: Easy to swap implementations (e.g., switch from in-memory to PostgreSQL) 3. **Maintainability**: Clear separation of concerns 4. **Scalability**: Add new features without modifying core ### Why Pydantic v2? - Runtime validation of domain models - Type safety - Automatic serialization/deserialization - Performance improvements over v1 ### Why Strategy Pattern for Chunking? - Runtime strategy selection - Easy to add new strategies - Each strategy isolated and testable ### Why Factory Pattern for Extractors? - Automatic extractor selection based on file type - Easy to add support for new file types - Centralized extractor management ## Code Quality Standards - **Type Hints**: 100% type coverage - **Docstrings**: Google-style documentation on all public APIs - **Function Size**: Maximum 15-20 lines per function - **Single Responsibility**: Each class/function does ONE thing - **DRY**: No code duplication - **KISS**: Simple, readable solutions ## Future Enhancements - Database persistence (PostgreSQL, MongoDB) - Async document processing - Caching layer (Redis) - Sentence chunking strategy - Semantic chunking with embeddings - Batch processing API - Document versioning - Full-text search integration ## License MIT License