text_processor/QUICK_START.md
m.dabbagh 70f5b1478c init
2026-01-07 19:15:46 +03:30

5.8 KiB

Quick Start Guide

Installation

# Navigate to project directory
cd text_processor_hex

# Create virtual environment
python -m venv venv

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run the Application

Option 1: FastAPI Server

python main.py

Then visit: http://localhost:8000/docs

Option 2: Programmatic Usage

python example_usage.py

Basic Usage Examples

1. Using the API (cURL)

Process a Document:

curl -X POST "http://localhost:8000/api/v1/process" \
  -H "Content-Type: application/json" \
  -d '{
    "file_path": "/path/to/document.pdf",
    "chunking_strategy": {
      "strategy_name": "fixed_size",
      "chunk_size": 1000,
      "overlap_size": 100,
      "respect_boundaries": true
    }
  }'

Extract and Chunk:

curl -X POST "http://localhost:8000/api/v1/extract-and-chunk" \
  -H "Content-Type: application/json" \
  -d '{
    "file_path": "/path/to/document.pdf",
    "chunking_strategy": {
      "strategy_name": "paragraph",
      "chunk_size": 1000,
      "overlap_size": 0,
      "respect_boundaries": true
    }
  }'

Get Document:

curl -X GET "http://localhost:8000/api/v1/documents/{document_id}"

List Documents:

curl -X GET "http://localhost:8000/api/v1/documents?limit=10&offset=0"

Delete Document:

curl -X DELETE "http://localhost:8000/api/v1/documents/{document_id}"

2. Using Python Code

from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy

# Initialize
container = create_application()
service = container.text_processor_service

# Process a PDF
strategy = ChunkingStrategy(
    strategy_name="fixed_size",
    chunk_size=1000,
    overlap_size=100,
    respect_boundaries=True,
)

document = service.process_document(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

print(f"Document ID: {document.id}")
print(f"Metadata: {document.get_metadata_summary()}")

# Extract and chunk
chunks = service.extract_and_chunk(
    file_path=Path("example.pdf"),
    chunking_strategy=strategy,
)

for chunk in chunks:
    print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")

Available Chunking Strategies

1. Fixed Size

Splits text into equal-sized chunks with optional overlap.

ChunkingStrategy(
    strategy_name="fixed_size",
    chunk_size=1000,        # Target size in characters
    overlap_size=100,       # Overlap between chunks
    respect_boundaries=True # Try to break at sentences
)

2. Paragraph

Splits text by paragraph boundaries, combining paragraphs to reach target size.

ChunkingStrategy(
    strategy_name="paragraph",
    chunk_size=1000,
    overlap_size=0,
    respect_boundaries=True
)

Supported File Types

  • PDF (.pdf) - using PyPDF2
  • DOCX (.docx) - using python-docx
  • Text (.txt, .md, .text) - native Python

Project Structure

text_processor_hex/
├── main.py                    # FastAPI entry point
├── example_usage.py           # Usage examples
├── requirements.txt           # Dependencies
│
└── src/
    ├── core/                  # Business logic (NO external dependencies)
    │   ├── domain/            # Models, exceptions, logic
    │   ├── ports/             # Interface definitions
    │   └── services/          # Orchestration
    │
    ├── adapters/              # External integrations
    │   ├── incoming/          # FastAPI routes
    │   └── outgoing/          # Extractors, chunkers, storage
    │
    ├── shared/                # Utilities
    └── bootstrap.py           # Dependency injection

Common Tasks

Add a New File Type

  1. Create extractor in src/adapters/outgoing/extractors/
  2. Extend BaseExtractor
  3. Register in bootstrap.py

Add a New Chunking Strategy

  1. Create chunker in src/adapters/outgoing/chunkers/
  2. Extend BaseChunker
  3. Register in bootstrap.py

Change Storage

  1. Implement IDocumentRepository interface
  2. Swap implementation in bootstrap.py

Testing

# Run example
python example_usage.py

# Test API with curl
curl http://localhost:8000/health

# Check API docs
# Visit: http://localhost:8000/docs

Troubleshooting

Import Errors

# Make sure you're in the right directory
cd text_processor_hex

# Activate virtual environment
source venv/bin/activate

Missing Dependencies

pip install -r requirements.txt

File Not Found Errors

Use absolute paths for file_path in API requests:

{
  "file_path": "/absolute/path/to/file.pdf"
}

Architecture Highlights

Hexagonal Architecture:

  • Core business logic is isolated
  • Easy to test without infrastructure
  • Easy to swap implementations

Design Patterns:

  • Factory: ExtractorFactory selects extractor by file type
  • Strategy: ChunkingContext selects chunking strategy
  • Repository: Abstract data storage
  • Dependency Injection: All dependencies injected via bootstrap

SOLID Principles:

  • Single Responsibility: Each class does one thing
  • Open/Closed: Add features without modifying core
  • Dependency Inversion: Core depends on abstractions

Next Steps

  1. Read README.md for detailed documentation
  2. Read ARCHITECTURE.md for architecture details
  3. Run example_usage.py to see it in action
  4. Explore the code starting from bootstrap.py
  5. Try the API using the Swagger docs at /docs

Need Help?

  • Check README.md for detailed docs
  • Check ARCHITECTURE.md for architecture diagrams
  • Check PROJECT_SUMMARY.md for complete overview
  • Look at example_usage.py for usage patterns