text_processor/README.md

# Text Processor - Hexagonal Architecture

A production-ready text extraction and chunking system built with **Hexagonal Architecture** (Ports & Adapters pattern).

## Architecture Overview

This project demonstrates a "Gold Standard" implementation of Clean Architecture principles:

### Project Structure

```
text_processor_hex/
├── src/
│   ├── core/                      # Domain Layer (Pure Business Logic)
│   │   ├── domain/
│   │   │   ├── models.py          # Rich Pydantic v2 entities
│   │   │   ├── exceptions.py      # Custom domain exceptions
│   │   │   └── logic_utils.py     # Pure functions for text processing
│   │   ├── ports/
│   │   │   ├── incoming/          # Service Interfaces (Use Cases)
│   │   │   └── outgoing/          # SPIs (Extractor, Chunker, Repository)
│   │   └── services/              # Business logic orchestration
│   ├── adapters/
│   │   ├── incoming/              # FastAPI routes & schemas
│   │   └── outgoing/
│   │       ├── extractors/        # PDF/DOCX/TXT implementations
│   │       ├── chunkers/          # Chunking strategy implementations
│   │       └── persistence/       # Repository implementations
│   ├── shared/                    # Cross-cutting concerns (logging)
│   └── bootstrap.py               # Dependency Injection wiring
├── main.py                        # Application entry point
└── requirements.txt
```

## Key Design Patterns

1. **Hexagonal Architecture**: Core domain is isolated from external concerns
2. **Dependency Inversion**: Core depends on abstractions (ports), not implementations
3. **Strategy Pattern**: Pluggable chunking strategies (FixedSize, Paragraph)
4. **Factory Pattern**: Dynamic extractor selection based on file type
5. **Repository Pattern**: Abstract data persistence
6. **Rich Domain Models**: Entities with validation and business logic

## SOLID Principles

- **S**ingle Responsibility: Each class has one reason to change
- **O**pen/Closed: Extensible via strategies and factories
- **L**iskov Substitution: All adapters are substitutable
- **I**nterface Segregation: Focused port interfaces
- **D**ependency Inversion: Core depends on abstractions

## Features

- Extract text from PDF, DOCX, and TXT files
- Multiple chunking strategies:
  - **Fixed Size**: Split text into equal-sized chunks with overlap
  - **Paragraph**: Respect document structure and paragraph boundaries
- Rich domain models with validation
- Comprehensive error handling with domain exceptions
- RESTful API with FastAPI
- Thread-safe in-memory repository
- Fully typed with Python 3.10+ type hints

## Installation

```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

## Running the Application

```bash
# Start the FastAPI server
python main.py

# Or use uvicorn directly
uvicorn main:app --reload --host 0.0.0.0 --port 8000
```

The API will be available at:
- API: http://localhost:8000/api/v1
- Docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

## API Endpoints

### Process Document
```bash
POST /api/v1/process
{
  "file_path": "/path/to/document.pdf",
  "chunking_strategy": {
    "strategy_name": "fixed_size",
    "chunk_size": 1000,
    "overlap_size": 100,
    "respect_boundaries": true
  }
}
```

### Extract and Chunk
```bash
POST /api/v1/extract-and-chunk
{
  "file_path": "/path/to/document.pdf",
  "chunking_strategy": {
    "strategy_name": "paragraph",
    "chunk_size": 1000,
    "overlap_size": 0,
    "respect_boundaries": true
  }
}
```

### Get Document
```bash
GET /api/v1/documents/{document_id}
```

### List Documents
```bash
GET /api/v1/documents?limit=100&offset=0
```

### Delete Document
```bash
DELETE /api/v1/documents/{document_id}
```

### Health Check
```bash
GET /api/v1/health
```

## Programmatic Usage

```python
from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy

# Create application container
container = create_application(log_level="INFO")

# Get the service
service = container.text_processor_service

# Process a document
strategy = ChunkingStrategy(
    strategy_name="fixed_size",
    chunk_size=1000,
    overlap_size=100,
    respect_boundaries=True,
)

document = service.process_document(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

print(f"Processed: {document.get_metadata_summary()}")
print(f"Preview: {document.get_content_preview()}")

# Extract and chunk
chunks = service.extract_and_chunk(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

for chunk in chunks:
    print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")
```

## Adding New Extractors

To add support for a new file type:

1. Create a new extractor in `src/adapters/outgoing/extractors/`:

```python
from .base import BaseExtractor

class MyExtractor(BaseExtractor):
    def __init__(self):
        super().__init__(supported_extensions=['myext'])

    def _extract_text(self, file_path: Path) -> str:
        # Your extraction logic here
        return extracted_text
```

2. Register in `src/bootstrap.py`:

```python
factory.register_extractor(MyExtractor())
```

## Adding New Chunking Strategies

To add a new chunking strategy:

1. Create a new chunker in `src/adapters/outgoing/chunkers/`:

```python
from .base import BaseChunker

class MyChunker(BaseChunker):
    def __init__(self):
        super().__init__(strategy_name="my_strategy")

    def _split_text(self, text: str, strategy: ChunkingStrategy) -> List[tuple[str, int, int]]:
        # Your chunking logic here
        return segments
```

2. Register in `src/bootstrap.py`:

```python
context.register_chunker(MyChunker())
```

## Testing

The architecture is designed for easy testing:

```python
# Mock the repository
from src.core.ports.outgoing.repository import IDocumentRepository

class MockRepository(IDocumentRepository):
    # Implement interface for testing
    pass

# Inject mock in service
service = DocumentProcessorService(
    extractor_factory=extractor_factory,
    chunking_context=chunking_context,
    repository=MockRepository(),  # Mock injected here
)
```

## Design Decisions

### Why Hexagonal Architecture?

1. **Testability**: Core business logic can be tested without any infrastructure
2. **Flexibility**: Easy to swap implementations (e.g., switch from in-memory to PostgreSQL)
3. **Maintainability**: Clear separation of concerns
4. **Scalability**: Add new features without modifying core

### Why Pydantic v2?

- Runtime validation of domain models
- Type safety
- Automatic serialization/deserialization
- Performance improvements over v1

### Why Strategy Pattern for Chunking?

- Runtime strategy selection
- Easy to add new strategies
- Each strategy isolated and testable

### Why Factory Pattern for Extractors?

- Automatic extractor selection based on file type
- Easy to add support for new file types
- Centralized extractor management

## Code Quality Standards

- **Type Hints**: 100% type coverage
- **Docstrings**: Google-style documentation on all public APIs
- **Function Size**: Maximum 15-20 lines per function
- **Single Responsibility**: Each class/function does ONE thing
- **DRY**: No code duplication
- **KISS**: Simple, readable solutions

## Future Enhancements

- Database persistence (PostgreSQL, MongoDB)
- Async document processing
- Caching layer (Redis)
- Sentence chunking strategy
- Semantic chunking with embeddings
- Batch processing API
- Document versioning
- Full-text search integration

## License

MIT License