Text Processor - Hexagonal Architecture
A production-ready text extraction and chunking system built with Hexagonal Architecture (Ports & Adapters pattern).
Architecture Overview
This project demonstrates a "Gold Standard" implementation of Clean Architecture principles:
Project Structure
text_processor_hex/
├── src/
│ ├── core/ # Domain Layer (Pure Business Logic)
│ │ ├── domain/
│ │ │ ├── models.py # Rich Pydantic v2 entities
│ │ │ ├── exceptions.py # Custom domain exceptions
│ │ │ └── logic_utils.py # Pure functions for text processing
│ │ ├── ports/
│ │ │ ├── incoming/ # Service Interfaces (Use Cases)
│ │ │ └── outgoing/ # SPIs (Extractor, Chunker, Repository)
│ │ └── services/ # Business logic orchestration
│ ├── adapters/
│ │ ├── incoming/ # FastAPI routes & schemas
│ │ └── outgoing/
│ │ ├── extractors/ # PDF/DOCX/TXT implementations
│ │ ├── chunkers/ # Chunking strategy implementations
│ │ └── persistence/ # Repository implementations
│ ├── shared/ # Cross-cutting concerns (logging)
│ └── bootstrap.py # Dependency Injection wiring
├── main.py # Application entry point
└── requirements.txt
Key Design Patterns
- Hexagonal Architecture: Core domain is isolated from external concerns
- Dependency Inversion: Core depends on abstractions (ports), not implementations
- Strategy Pattern: Pluggable chunking strategies (FixedSize, Paragraph)
- Factory Pattern: Dynamic extractor selection based on file type
- Repository Pattern: Abstract data persistence
- Rich Domain Models: Entities with validation and business logic
SOLID Principles
- Single Responsibility: Each class has one reason to change
- Open/Closed: Extensible via strategies and factories
- Liskov Substitution: All adapters are substitutable
- Interface Segregation: Focused port interfaces
- Dependency Inversion: Core depends on abstractions
Features
- Extract text from PDF, DOCX, and TXT files
- Multiple chunking strategies:
- Fixed Size: Split text into equal-sized chunks with overlap
- Paragraph: Respect document structure and paragraph boundaries
- Rich domain models with validation
- Comprehensive error handling with domain exceptions
- RESTful API with FastAPI
- Thread-safe in-memory repository
- Fully typed with Python 3.10+ type hints
Installation
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Running the Application
# Start the FastAPI server
python main.py
# Or use uvicorn directly
uvicorn main:app --reload --host 0.0.0.0 --port 8000
The API will be available at:
- API: http://localhost:8000/api/v1
- Docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
API Endpoints
Process Document
POST /api/v1/process
{
"file_path": "/path/to/document.pdf",
"chunking_strategy": {
"strategy_name": "fixed_size",
"chunk_size": 1000,
"overlap_size": 100,
"respect_boundaries": true
}
}
Extract and Chunk
POST /api/v1/extract-and-chunk
{
"file_path": "/path/to/document.pdf",
"chunking_strategy": {
"strategy_name": "paragraph",
"chunk_size": 1000,
"overlap_size": 0,
"respect_boundaries": true
}
}
Get Document
GET /api/v1/documents/{document_id}
List Documents
GET /api/v1/documents?limit=100&offset=0
Delete Document
DELETE /api/v1/documents/{document_id}
Health Check
GET /api/v1/health
Programmatic Usage
from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy
# Create application container
container = create_application(log_level="INFO")
# Get the service
service = container.text_processor_service
# Process a document
strategy = ChunkingStrategy(
strategy_name="fixed_size",
chunk_size=1000,
overlap_size=100,
respect_boundaries=True,
)
document = service.process_document(
file_path=Path("example.pdf"),
chunking_strategy=strategy,
)
print(f"Processed: {document.get_metadata_summary()}")
print(f"Preview: {document.get_content_preview()}")
# Extract and chunk
chunks = service.extract_and_chunk(
file_path=Path("example.pdf"),
chunking_strategy=strategy,
)
for chunk in chunks:
print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")
Adding New Extractors
To add support for a new file type:
- Create a new extractor in
src/adapters/outgoing/extractors/:
from .base import BaseExtractor
class MyExtractor(BaseExtractor):
def __init__(self):
super().__init__(supported_extensions=['myext'])
def _extract_text(self, file_path: Path) -> str:
# Your extraction logic here
return extracted_text
- Register in
src/bootstrap.py:
factory.register_extractor(MyExtractor())
Adding New Chunking Strategies
To add a new chunking strategy:
- Create a new chunker in
src/adapters/outgoing/chunkers/:
from .base import BaseChunker
class MyChunker(BaseChunker):
def __init__(self):
super().__init__(strategy_name="my_strategy")
def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
# Your chunking logic here
return segments
- Register in
src/bootstrap.py:
context.register_chunker(MyChunker())
Testing
The architecture is designed for easy testing:
# Mock the repository
from src.core.ports.outgoing.repository import IDocumentRepository
class MockRepository(IDocumentRepository):
# Implement interface for testing
pass
# Inject mock in service
service = DocumentProcessorService(
extractor_factory=extractor_factory,
chunking_context=chunking_context,
repository=MockRepository(), # Mock injected here
)
Design Decisions
Why Hexagonal Architecture?
- Testability: Core business logic can be tested without any infrastructure
- Flexibility: Easy to swap implementations (e.g., switch from in-memory to PostgreSQL)
- Maintainability: Clear separation of concerns
- Scalability: Add new features without modifying core
Why Pydantic v2?
- Runtime validation of domain models
- Type safety
- Automatic serialization/deserialization
- Performance improvements over v1
Why Strategy Pattern for Chunking?
- Runtime strategy selection
- Easy to add new strategies
- Each strategy isolated and testable
Why Factory Pattern for Extractors?
- Automatic extractor selection based on file type
- Easy to add support for new file types
- Centralized extractor management
Code Quality Standards
- Type Hints: 100% type coverage
- Docstrings: Google-style documentation on all public APIs
- Function Size: Maximum 15-20 lines per function
- Single Responsibility: Each class/function does ONE thing
- DRY: No code duplication
- KISS: Simple, readable solutions
Future Enhancements
- Database persistence (PostgreSQL, MongoDB)
- Async document processing
- Caching layer (Redis)
- Sentence chunking strategy
- Semantic chunking with embeddings
- Batch processing API
- Document versioning
- Full-text search integration
License
MIT License
Description
Languages
Python
98.3%
Shell
1.7%