text_processor/QUICK_START.md

# Quick Start Guide

## Installation

```bash
# Navigate to project directory
cd text_processor_hex

# Create virtual environment
python -m venv venv

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

## Run the Application

### Option 1: FastAPI Server
```bash
python main.py
```
Then visit: http://localhost:8000/docs

### Option 2: Programmatic Usage
```bash
python example_usage.py
```

## Basic Usage Examples

### 1. Using the API (cURL)

**Process a Document:**
```bash
curl -X POST "http://localhost:8000/api/v1/process" \
  -H "Content-Type: application/json" \
  -d '{
    "file_path": "/path/to/document.pdf",
    "chunking_strategy": {
      "strategy_name": "fixed_size",
      "chunk_size": 1000,
      "overlap_size": 100,
      "respect_boundaries": true
    }
  }'
```

**Extract and Chunk:**
```bash
curl -X POST "http://localhost:8000/api/v1/extract-and-chunk" \
  -H "Content-Type: application/json" \
  -d '{
    "file_path": "/path/to/document.pdf",
    "chunking_strategy": {
      "strategy_name": "paragraph",
      "chunk_size": 1000,
      "overlap_size": 0,
      "respect_boundaries": true
    }
  }'
```

**Get Document:**
```bash
curl -X GET "http://localhost:8000/api/v1/documents/{document_id}"
```

**List Documents:**
```bash
curl -X GET "http://localhost:8000/api/v1/documents?limit=10&offset=0"
```

**Delete Document:**
```bash
curl -X DELETE "http://localhost:8000/api/v1/documents/{document_id}"
```

### 2. Using Python Code

```python
from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy

# Initialize
container = create_application()
service = container.text_processor_service

# Process a PDF
strategy = ChunkingStrategy(
    strategy_name="fixed_size",
    chunk_size=1000,
    overlap_size=100,
    respect_boundaries=True,
)

document = service.process_document(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

print(f"Document ID: {document.id}")
print(f"Metadata: {document.get_metadata_summary()}")

# Extract and chunk
chunks = service.extract_and_chunk(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

for chunk in chunks:
    print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")
```

## Available Chunking Strategies

### 1. Fixed Size
Splits text into equal-sized chunks with optional overlap.

```python
ChunkingStrategy(
    strategy_name="fixed_size",
    chunk_size=1000,        # Target size in characters
    overlap_size=100,       # Overlap between chunks
    respect_boundaries=True # Try to break at sentences
)
```

### 2. Paragraph
Splits text by paragraph boundaries, combining paragraphs to reach target size.

```python
ChunkingStrategy(
    strategy_name="paragraph",
    chunk_size=1000,
    overlap_size=0,
    respect_boundaries=True
)
```

## Supported File Types

- **PDF** (.pdf) - using PyPDF2
- **DOCX** (.docx) - using python-docx
- **Text** (.txt, .md, .text) - native Python

## Project Structure

```
text_processor_hex/
├── main.py                    # FastAPI entry point
├── example_usage.py           # Usage examples
├── requirements.txt           # Dependencies
│
└── src/
    ├── core/                  # Business logic (NO external dependencies)
    │   ├── domain/            # Models, exceptions, logic
    │   ├── ports/             # Interface definitions
    │   └── services/          # Orchestration
    │
    ├── adapters/              # External integrations
    │   ├── incoming/          # FastAPI routes
    │   └── outgoing/          # Extractors, chunkers, storage
    │
    ├── shared/                # Utilities
    └── bootstrap.py           # Dependency injection
```

## Common Tasks

### Add a New File Type
1. Create extractor in `src/adapters/outgoing/extractors/`
2. Extend `BaseExtractor`
3. Register in `bootstrap.py`

### Add a New Chunking Strategy
1. Create chunker in `src/adapters/outgoing/chunkers/`
2. Extend `BaseChunker`
3. Register in `bootstrap.py`

### Change Storage
1. Implement `IDocumentRepository` interface
2. Swap implementation in `bootstrap.py`

## Testing

```bash
# Run example
python example_usage.py

# Test API with curl
curl http://localhost:8000/health

# Check API docs
# Visit: http://localhost:8000/docs
```

## Troubleshooting

### Import Errors
```bash
# Make sure you're in the right directory
cd text_processor_hex

# Activate virtual environment
source venv/bin/activate
```

### Missing Dependencies
```bash
pip install -r requirements.txt
```

### File Not Found Errors
Use absolute paths for file_path in API requests:
```json
{
  "file_path": "/absolute/path/to/file.pdf"
}
```

## Architecture Highlights

**Hexagonal Architecture:**
- Core business logic is isolated
- Easy to test without infrastructure
- Easy to swap implementations

**Design Patterns:**
- Factory: ExtractorFactory selects extractor by file type
- Strategy: ChunkingContext selects chunking strategy
- Repository: Abstract data storage
- Dependency Injection: All dependencies injected via bootstrap

**SOLID Principles:**
- Single Responsibility: Each class does one thing
- Open/Closed: Add features without modifying core
- Dependency Inversion: Core depends on abstractions

## Next Steps

1. Read `README.md` for detailed documentation
2. Read `ARCHITECTURE.md` for architecture details
3. Run `example_usage.py` to see it in action
4. Explore the code starting from `bootstrap.py`
5. Try the API using the Swagger docs at `/docs`

## Need Help?

- Check `README.md` for detailed docs
- Check `ARCHITECTURE.md` for architecture diagrams
- Check `PROJECT_SUMMARY.md` for complete overview
- Look at `example_usage.py` for usage patterns