# Quick Start Guide ## Installation ```bash # Navigate to project directory cd text_processor_hex # Create virtual environment python -m venv venv # Activate virtual environment source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` ## Run the Application ### Option 1: FastAPI Server ```bash python main.py ``` Then visit: http://localhost:8000/docs ### Option 2: Programmatic Usage ```bash python example_usage.py ``` ## Basic Usage Examples ### 1. Using the API (cURL) **Process a Document:** ```bash curl -X POST "http://localhost:8000/api/v1/process" \ -H "Content-Type: application/json" \ -d '{ "file_path": "/path/to/document.pdf", "chunking_strategy": { "strategy_name": "fixed_size", "chunk_size": 1000, "overlap_size": 100, "respect_boundaries": true } }' ``` **Extract and Chunk:** ```bash curl -X POST "http://localhost:8000/api/v1/extract-and-chunk" \ -H "Content-Type: application/json" \ -d '{ "file_path": "/path/to/document.pdf", "chunking_strategy": { "strategy_name": "paragraph", "chunk_size": 1000, "overlap_size": 0, "respect_boundaries": true } }' ``` **Get Document:** ```bash curl -X GET "http://localhost:8000/api/v1/documents/{document_id}" ``` **List Documents:** ```bash curl -X GET "http://localhost:8000/api/v1/documents?limit=10&offset=0" ``` **Delete Document:** ```bash curl -X DELETE "http://localhost:8000/api/v1/documents/{document_id}" ``` ### 2. Using Python Code ```python from pathlib import Path from src.bootstrap import create_application from src.core.domain.models import ChunkingStrategy # Initialize container = create_application() service = container.text_processor_service # Process a PDF strategy = ChunkingStrategy( strategy_name="fixed_size", chunk_size=1000, overlap_size=100, respect_boundaries=True, ) document = service.process_document( file_path=Path("example.pdf"), chunking_strategy=strategy, ) print(f"Document ID: {document.id}") print(f"Metadata: {document.get_metadata_summary()}") # Extract and chunk chunks = service.extract_and_chunk( file_path=Path("example.pdf"), chunking_strategy=strategy, ) for chunk in chunks: print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars") ``` ## Available Chunking Strategies ### 1. Fixed Size Splits text into equal-sized chunks with optional overlap. ```python ChunkingStrategy( strategy_name="fixed_size", chunk_size=1000, # Target size in characters overlap_size=100, # Overlap between chunks respect_boundaries=True # Try to break at sentences ) ``` ### 2. Paragraph Splits text by paragraph boundaries, combining paragraphs to reach target size. ```python ChunkingStrategy( strategy_name="paragraph", chunk_size=1000, overlap_size=0, respect_boundaries=True ) ``` ## Supported File Types - **PDF** (.pdf) - using PyPDF2 - **DOCX** (.docx) - using python-docx - **Text** (.txt, .md, .text) - native Python ## Project Structure ``` text_processor_hex/ ├── main.py # FastAPI entry point ├── example_usage.py # Usage examples ├── requirements.txt # Dependencies │ └── src/ ├── core/ # Business logic (NO external dependencies) │ ├── domain/ # Models, exceptions, logic │ ├── ports/ # Interface definitions │ └── services/ # Orchestration │ ├── adapters/ # External integrations │ ├── incoming/ # FastAPI routes │ └── outgoing/ # Extractors, chunkers, storage │ ├── shared/ # Utilities └── bootstrap.py # Dependency injection ``` ## Common Tasks ### Add a New File Type 1. Create extractor in `src/adapters/outgoing/extractors/` 2. Extend `BaseExtractor` 3. Register in `bootstrap.py` ### Add a New Chunking Strategy 1. Create chunker in `src/adapters/outgoing/chunkers/` 2. Extend `BaseChunker` 3. Register in `bootstrap.py` ### Change Storage 1. Implement `IDocumentRepository` interface 2. Swap implementation in `bootstrap.py` ## Testing ```bash # Run example python example_usage.py # Test API with curl curl http://localhost:8000/health # Check API docs # Visit: http://localhost:8000/docs ``` ## Troubleshooting ### Import Errors ```bash # Make sure you're in the right directory cd text_processor_hex # Activate virtual environment source venv/bin/activate ``` ### Missing Dependencies ```bash pip install -r requirements.txt ``` ### File Not Found Errors Use absolute paths for file_path in API requests: ```json { "file_path": "/absolute/path/to/file.pdf" } ``` ## Architecture Highlights **Hexagonal Architecture:** - Core business logic is isolated - Easy to test without infrastructure - Easy to swap implementations **Design Patterns:** - Factory: ExtractorFactory selects extractor by file type - Strategy: ChunkingContext selects chunking strategy - Repository: Abstract data storage - Dependency Injection: All dependencies injected via bootstrap **SOLID Principles:** - Single Responsibility: Each class does one thing - Open/Closed: Add features without modifying core - Dependency Inversion: Core depends on abstractions ## Next Steps 1. Read `README.md` for detailed documentation 2. Read `ARCHITECTURE.md` for architecture details 3. Run `example_usage.py` to see it in action 4. Explore the code starting from `bootstrap.py` 5. Try the API using the Swagger docs at `/docs` ## Need Help? - Check `README.md` for detailed docs - Check `ARCHITECTURE.md` for architecture diagrams - Check `PROJECT_SUMMARY.md` for complete overview - Look at `example_usage.py` for usage patterns