m.dabbagh/text_processor

Fork 0

Go to file

m.dabbagh 6086ddf818 add /chunk route

2026-01-19 21:54:23 +03:30

src

add /chunk route

2026-01-19 21:54:23 +03:30

main.py

some fixes on architecture. make bootstrap wraps only the hexagonal plus the outgoing adapters

2026-01-07 21:02:38 +03:30

README.md

init

2026-01-07 19:15:46 +03:30

requirements.txt

add SourceFile, DocumentSection models and markdown parser

2026-01-08 03:46:35 +03:30

verify_architecture.sh

init

2026-01-07 19:15:46 +03:30

README.md

Text Processor - Hexagonal Architecture

A production-ready text extraction and chunking system built with Hexagonal Architecture (Ports & Adapters pattern).

Architecture Overview

This project demonstrates a "Gold Standard" implementation of Clean Architecture principles:

Project Structure

text_processor_hex/
├── src/
│   ├── core/                      # Domain Layer (Pure Business Logic)
│   │   ├── domain/
│   │   │   ├── models.py          # Rich Pydantic v2 entities
│   │   │   ├── exceptions.py      # Custom domain exceptions
│   │   │   └── logic_utils.py     # Pure functions for text processing
│   │   ├── ports/
│   │   │   ├── incoming/          # Service Interfaces (Use Cases)
│   │   │   └── outgoing/          # SPIs (Extractor, Chunker, Repository)
│   │   └── services/              # Business logic orchestration
│   ├── adapters/
│   │   ├── incoming/              # FastAPI routes & schemas
│   │   └── outgoing/
│   │       ├── extractors/        # PDF/DOCX/TXT implementations
│   │       ├── chunkers/          # Chunking strategy implementations
│   │       └── persistence/       # Repository implementations
│   ├── shared/                    # Cross-cutting concerns (logging)
│   └── bootstrap.py               # Dependency Injection wiring
├── main.py                        # Application entry point
└── requirements.txt

Key Design Patterns

Hexagonal Architecture: Core domain is isolated from external concerns
Dependency Inversion: Core depends on abstractions (ports), not implementations
Strategy Pattern: Pluggable chunking strategies (FixedSize, Paragraph)
Factory Pattern: Dynamic extractor selection based on file type
Repository Pattern: Abstract data persistence
Rich Domain Models: Entities with validation and business logic

SOLID Principles

Single Responsibility: Each class has one reason to change
Open/Closed: Extensible via strategies and factories
Liskov Substitution: All adapters are substitutable
Interface Segregation: Focused port interfaces
Dependency Inversion: Core depends on abstractions

Features

Extract text from PDF, DOCX, and TXT files
Multiple chunking strategies:
- Fixed Size: Split text into equal-sized chunks with overlap
- Paragraph: Respect document structure and paragraph boundaries
Rich domain models with validation
Comprehensive error handling with domain exceptions
RESTful API with FastAPI
Thread-safe in-memory repository
Fully typed with Python 3.10+ type hints

Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running the Application

# Start the FastAPI server
python main.py

# Or use uvicorn directly
uvicorn main:app --reload --host 0.0.0.0 --port 8000

The API will be available at:

API Endpoints

Process Document

POST /api/v1/process
{
  "file_path": "/path/to/document.pdf",
  "chunking_strategy": {
    "strategy_name": "fixed_size",
    "chunk_size": 1000,
    "overlap_size": 100,
    "respect_boundaries": true
  }
}

Extract and Chunk

POST /api/v1/extract-and-chunk
{
  "file_path": "/path/to/document.pdf",
  "chunking_strategy": {
    "strategy_name": "paragraph",
    "chunk_size": 1000,
    "overlap_size": 0,
    "respect_boundaries": true
  }
}

Get Document

GET /api/v1/documents/{document_id}

List Documents

GET /api/v1/documents?limit=100&offset=0

Delete Document

DELETE /api/v1/documents/{document_id}

Health Check

GET /api/v1/health

Programmatic Usage

from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy

# Create application container
container = create_application(log_level="INFO")

# Get the service
service = container.text_processor_service

# Process a document
strategy = ChunkingStrategy(
    strategy_name="fixed_size",
    chunk_size=1000,
    overlap_size=100,
    respect_boundaries=True,
)

document = service.process_document(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

print(f"Processed: {document.get_metadata_summary()}")
print(f"Preview: {document.get_content_preview()}")

# Extract and chunk
chunks = service.extract_and_chunk(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

for chunk in chunks:
    print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")

Adding New Extractors

To add support for a new file type:

Create a new extractor in src/adapters/outgoing/extractors/:

from .base import BaseExtractor

class MyExtractor(BaseExtractor):
    def __init__(self):
        super().__init__(supported_extensions=['myext'])

    def _extract_text(self, file_path: Path) -> str:
        # Your extraction logic here
        return extracted_text

factory.register_extractor(MyExtractor())

Adding New Chunking Strategies

To add a new chunking strategy:

Create a new chunker in src/adapters/outgoing/chunkers/:

from .base import BaseChunker

class MyChunker(BaseChunker):
    def __init__(self):
        super().__init__(strategy_name="my_strategy")

    def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
        # Your chunking logic here
        return segments

context.register_chunker(MyChunker())

Testing

The architecture is designed for easy testing:

# Mock the repository
from src.core.ports.outgoing.repository import IDocumentRepository

class MockRepository(IDocumentRepository):
    # Implement interface for testing
    pass

# Inject mock in service
service = DocumentProcessorService(
    extractor_factory=extractor_factory,
    chunking_context=chunking_context,
    repository=MockRepository(),  # Mock injected here
)

Design Decisions

Why Hexagonal Architecture?

Testability: Core business logic can be tested without any infrastructure
Flexibility: Easy to swap implementations (e.g., switch from in-memory to PostgreSQL)
Maintainability: Clear separation of concerns
Scalability: Add new features without modifying core

Why Pydantic v2?

Runtime validation of domain models
Type safety
Automatic serialization/deserialization
Performance improvements over v1

Why Strategy Pattern for Chunking?

Runtime strategy selection
Easy to add new strategies
Each strategy isolated and testable

Why Factory Pattern for Extractors?

Automatic extractor selection based on file type
Easy to add support for new file types
Centralized extractor management

Code Quality Standards

Type Hints: 100% type coverage
Docstrings: Google-style documentation on all public APIs
Function Size: Maximum 15-20 lines per function
Single Responsibility: Each class/function does ONE thing
DRY: No code duplication
KISS: Simple, readable solutions

Future Enhancements

Database persistence (PostgreSQL, MongoDB)
Async document processing
Caching layer (Redis)
Sentence chunking strategy
Semantic chunking with embeddings
Batch processing API
Document versioning
Full-text search integration

License

MIT License