298 lines
7.7 KiB
Markdown
298 lines
7.7 KiB
Markdown
# Text Processor - Hexagonal Architecture
|
|
|
|
A production-ready text extraction and chunking system built with **Hexagonal Architecture** (Ports & Adapters pattern).
|
|
|
|
## Architecture Overview
|
|
|
|
This project demonstrates a "Gold Standard" implementation of Clean Architecture principles:
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
text_processor_hex/
|
|
├── src/
|
|
│ ├── core/ # Domain Layer (Pure Business Logic)
|
|
│ │ ├── domain/
|
|
│ │ │ ├── models.py # Rich Pydantic v2 entities
|
|
│ │ │ ├── exceptions.py # Custom domain exceptions
|
|
│ │ │ └── logic_utils.py # Pure functions for text processing
|
|
│ │ ├── ports/
|
|
│ │ │ ├── incoming/ # Service Interfaces (Use Cases)
|
|
│ │ │ └── outgoing/ # SPIs (Extractor, Chunker, Repository)
|
|
│ │ └── services/ # Business logic orchestration
|
|
│ ├── adapters/
|
|
│ │ ├── incoming/ # FastAPI routes & schemas
|
|
│ │ └── outgoing/
|
|
│ │ ├── extractors/ # PDF/DOCX/TXT implementations
|
|
│ │ ├── chunkers/ # Chunking strategy implementations
|
|
│ │ └── persistence/ # Repository implementations
|
|
│ ├── shared/ # Cross-cutting concerns (logging)
|
|
│ └── bootstrap.py # Dependency Injection wiring
|
|
├── main.py # Application entry point
|
|
└── requirements.txt
|
|
```
|
|
|
|
## Key Design Patterns
|
|
|
|
1. **Hexagonal Architecture**: Core domain is isolated from external concerns
|
|
2. **Dependency Inversion**: Core depends on abstractions (ports), not implementations
|
|
3. **Strategy Pattern**: Pluggable chunking strategies (FixedSize, Paragraph)
|
|
4. **Factory Pattern**: Dynamic extractor selection based on file type
|
|
5. **Repository Pattern**: Abstract data persistence
|
|
6. **Rich Domain Models**: Entities with validation and business logic
|
|
|
|
## SOLID Principles
|
|
|
|
- **S**ingle Responsibility: Each class has one reason to change
|
|
- **O**pen/Closed: Extensible via strategies and factories
|
|
- **L**iskov Substitution: All adapters are substitutable
|
|
- **I**nterface Segregation: Focused port interfaces
|
|
- **D**ependency Inversion: Core depends on abstractions
|
|
|
|
## Features
|
|
|
|
- Extract text from PDF, DOCX, and TXT files
|
|
- Multiple chunking strategies:
|
|
- **Fixed Size**: Split text into equal-sized chunks with overlap
|
|
- **Paragraph**: Respect document structure and paragraph boundaries
|
|
- Rich domain models with validation
|
|
- Comprehensive error handling with domain exceptions
|
|
- RESTful API with FastAPI
|
|
- Thread-safe in-memory repository
|
|
- Fully typed with Python 3.10+ type hints
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Create virtual environment
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Running the Application
|
|
|
|
```bash
|
|
# Start the FastAPI server
|
|
python main.py
|
|
|
|
# Or use uvicorn directly
|
|
uvicorn main:app --reload --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
The API will be available at:
|
|
- API: http://localhost:8000/api/v1
|
|
- Docs: http://localhost:8000/docs
|
|
- ReDoc: http://localhost:8000/redoc
|
|
|
|
## API Endpoints
|
|
|
|
### Process Document
|
|
```bash
|
|
POST /api/v1/process
|
|
{
|
|
"file_path": "/path/to/document.pdf",
|
|
"chunking_strategy": {
|
|
"strategy_name": "fixed_size",
|
|
"chunk_size": 1000,
|
|
"overlap_size": 100,
|
|
"respect_boundaries": true
|
|
}
|
|
}
|
|
```
|
|
|
|
### Extract and Chunk
|
|
```bash
|
|
POST /api/v1/extract-and-chunk
|
|
{
|
|
"file_path": "/path/to/document.pdf",
|
|
"chunking_strategy": {
|
|
"strategy_name": "paragraph",
|
|
"chunk_size": 1000,
|
|
"overlap_size": 0,
|
|
"respect_boundaries": true
|
|
}
|
|
}
|
|
```
|
|
|
|
### Get Document
|
|
```bash
|
|
GET /api/v1/documents/{document_id}
|
|
```
|
|
|
|
### List Documents
|
|
```bash
|
|
GET /api/v1/documents?limit=100&offset=0
|
|
```
|
|
|
|
### Delete Document
|
|
```bash
|
|
DELETE /api/v1/documents/{document_id}
|
|
```
|
|
|
|
### Health Check
|
|
```bash
|
|
GET /api/v1/health
|
|
```
|
|
|
|
## Programmatic Usage
|
|
|
|
```python
|
|
from pathlib import Path
|
|
from src.bootstrap import create_application
|
|
from src.core.domain.models import ChunkingStrategy
|
|
|
|
# Create application container
|
|
container = create_application(log_level="INFO")
|
|
|
|
# Get the service
|
|
service = container.text_processor_service
|
|
|
|
# Process a document
|
|
strategy = ChunkingStrategy(
|
|
strategy_name="fixed_size",
|
|
chunk_size=1000,
|
|
overlap_size=100,
|
|
respect_boundaries=True,
|
|
)
|
|
|
|
document = service.process_document(
|
|
file_path=Path("example.pdf"),
|
|
chunking_strategy=strategy,
|
|
)
|
|
|
|
print(f"Processed: {document.get_metadata_summary()}")
|
|
print(f"Preview: {document.get_content_preview()}")
|
|
|
|
# Extract and chunk
|
|
chunks = service.extract_and_chunk(
|
|
file_path=Path("example.pdf"),
|
|
chunking_strategy=strategy,
|
|
)
|
|
|
|
for chunk in chunks:
|
|
print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")
|
|
```
|
|
|
|
## Adding New Extractors
|
|
|
|
To add support for a new file type:
|
|
|
|
1. Create a new extractor in `src/adapters/outgoing/extractors/`:
|
|
|
|
```python
|
|
from .base import BaseExtractor
|
|
|
|
class MyExtractor(BaseExtractor):
|
|
def __init__(self):
|
|
super().__init__(supported_extensions=['myext'])
|
|
|
|
def _extract_text(self, file_path: Path) -> str:
|
|
# Your extraction logic here
|
|
return extracted_text
|
|
```
|
|
|
|
2. Register in `src/bootstrap.py`:
|
|
|
|
```python
|
|
factory.register_extractor(MyExtractor())
|
|
```
|
|
|
|
## Adding New Chunking Strategies
|
|
|
|
To add a new chunking strategy:
|
|
|
|
1. Create a new chunker in `src/adapters/outgoing/chunkers/`:
|
|
|
|
```python
|
|
from .base import BaseChunker
|
|
|
|
class MyChunker(BaseChunker):
|
|
def __init__(self):
|
|
super().__init__(strategy_name="my_strategy")
|
|
|
|
def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
|
|
# Your chunking logic here
|
|
return segments
|
|
```
|
|
|
|
2. Register in `src/bootstrap.py`:
|
|
|
|
```python
|
|
context.register_chunker(MyChunker())
|
|
```
|
|
|
|
## Testing
|
|
|
|
The architecture is designed for easy testing:
|
|
|
|
```python
|
|
# Mock the repository
|
|
from src.core.ports.outgoing.repository import IDocumentRepository
|
|
|
|
class MockRepository(IDocumentRepository):
|
|
# Implement interface for testing
|
|
pass
|
|
|
|
# Inject mock in service
|
|
service = DocumentProcessorService(
|
|
extractor_factory=extractor_factory,
|
|
chunking_context=chunking_context,
|
|
repository=MockRepository(), # Mock injected here
|
|
)
|
|
```
|
|
|
|
## Design Decisions
|
|
|
|
### Why Hexagonal Architecture?
|
|
|
|
1. **Testability**: Core business logic can be tested without any infrastructure
|
|
2. **Flexibility**: Easy to swap implementations (e.g., switch from in-memory to PostgreSQL)
|
|
3. **Maintainability**: Clear separation of concerns
|
|
4. **Scalability**: Add new features without modifying core
|
|
|
|
### Why Pydantic v2?
|
|
|
|
- Runtime validation of domain models
|
|
- Type safety
|
|
- Automatic serialization/deserialization
|
|
- Performance improvements over v1
|
|
|
|
### Why Strategy Pattern for Chunking?
|
|
|
|
- Runtime strategy selection
|
|
- Easy to add new strategies
|
|
- Each strategy isolated and testable
|
|
|
|
### Why Factory Pattern for Extractors?
|
|
|
|
- Automatic extractor selection based on file type
|
|
- Easy to add support for new file types
|
|
- Centralized extractor management
|
|
|
|
## Code Quality Standards
|
|
|
|
- **Type Hints**: 100% type coverage
|
|
- **Docstrings**: Google-style documentation on all public APIs
|
|
- **Function Size**: Maximum 15-20 lines per function
|
|
- **Single Responsibility**: Each class/function does ONE thing
|
|
- **DRY**: No code duplication
|
|
- **KISS**: Simple, readable solutions
|
|
|
|
## Future Enhancements
|
|
|
|
- Database persistence (PostgreSQL, MongoDB)
|
|
- Async document processing
|
|
- Caching layer (Redis)
|
|
- Sentence chunking strategy
|
|
- Semantic chunking with embeddings
|
|
- Batch processing API
|
|
- Document versioning
|
|
- Full-text search integration
|
|
|
|
## License
|
|
|
|
MIT License
|