5.8 KiB
5.8 KiB
Quick Start Guide
Installation
# Navigate to project directory
cd text_processor_hex
# Create virtual environment
python -m venv venv
# Activate virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Run the Application
Option 1: FastAPI Server
python main.py
Then visit: http://localhost:8000/docs
Option 2: Programmatic Usage
python example_usage.py
Basic Usage Examples
1. Using the API (cURL)
Process a Document:
curl -X POST "http://localhost:8000/api/v1/process" \
-H "Content-Type: application/json" \
-d '{
"file_path": "/path/to/document.pdf",
"chunking_strategy": {
"strategy_name": "fixed_size",
"chunk_size": 1000,
"overlap_size": 100,
"respect_boundaries": true
}
}'
Extract and Chunk:
curl -X POST "http://localhost:8000/api/v1/extract-and-chunk" \
-H "Content-Type: application/json" \
-d '{
"file_path": "/path/to/document.pdf",
"chunking_strategy": {
"strategy_name": "paragraph",
"chunk_size": 1000,
"overlap_size": 0,
"respect_boundaries": true
}
}'
Get Document:
curl -X GET "http://localhost:8000/api/v1/documents/{document_id}"
List Documents:
curl -X GET "http://localhost:8000/api/v1/documents?limit=10&offset=0"
Delete Document:
curl -X DELETE "http://localhost:8000/api/v1/documents/{document_id}"
2. Using Python Code
from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy
# Initialize
container = create_application()
service = container.text_processor_service
# Process a PDF
strategy = ChunkingStrategy(
strategy_name="fixed_size",
chunk_size=1000,
overlap_size=100,
respect_boundaries=True,
)
document = service.process_document(
file_path=Path("example.pdf"),
chunking_strategy=strategy,
)
print(f"Document ID: {document.id}")
print(f"Metadata: {document.get_metadata_summary()}")
# Extract and chunk
chunks = service.extract_and_chunk(
file_path=Path("example.pdf"),
chunking_strategy=strategy,
)
for chunk in chunks:
print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")
Available Chunking Strategies
1. Fixed Size
Splits text into equal-sized chunks with optional overlap.
ChunkingStrategy(
strategy_name="fixed_size",
chunk_size=1000, # Target size in characters
overlap_size=100, # Overlap between chunks
respect_boundaries=True # Try to break at sentences
)
2. Paragraph
Splits text by paragraph boundaries, combining paragraphs to reach target size.
ChunkingStrategy(
strategy_name="paragraph",
chunk_size=1000,
overlap_size=0,
respect_boundaries=True
)
Supported File Types
- PDF (.pdf) - using PyPDF2
- DOCX (.docx) - using python-docx
- Text (.txt, .md, .text) - native Python
Project Structure
text_processor_hex/
├── main.py # FastAPI entry point
├── example_usage.py # Usage examples
├── requirements.txt # Dependencies
│
└── src/
├── core/ # Business logic (NO external dependencies)
│ ├── domain/ # Models, exceptions, logic
│ ├── ports/ # Interface definitions
│ └── services/ # Orchestration
│
├── adapters/ # External integrations
│ ├── incoming/ # FastAPI routes
│ └── outgoing/ # Extractors, chunkers, storage
│
├── shared/ # Utilities
└── bootstrap.py # Dependency injection
Common Tasks
Add a New File Type
- Create extractor in
src/adapters/outgoing/extractors/ - Extend
BaseExtractor - Register in
bootstrap.py
Add a New Chunking Strategy
- Create chunker in
src/adapters/outgoing/chunkers/ - Extend
BaseChunker - Register in
bootstrap.py
Change Storage
- Implement
IDocumentRepositoryinterface - Swap implementation in
bootstrap.py
Testing
# Run example
python example_usage.py
# Test API with curl
curl http://localhost:8000/health
# Check API docs
# Visit: http://localhost:8000/docs
Troubleshooting
Import Errors
# Make sure you're in the right directory
cd text_processor_hex
# Activate virtual environment
source venv/bin/activate
Missing Dependencies
pip install -r requirements.txt
File Not Found Errors
Use absolute paths for file_path in API requests:
{
"file_path": "/absolute/path/to/file.pdf"
}
Architecture Highlights
Hexagonal Architecture:
- Core business logic is isolated
- Easy to test without infrastructure
- Easy to swap implementations
Design Patterns:
- Factory: ExtractorFactory selects extractor by file type
- Strategy: ChunkingContext selects chunking strategy
- Repository: Abstract data storage
- Dependency Injection: All dependencies injected via bootstrap
SOLID Principles:
- Single Responsibility: Each class does one thing
- Open/Closed: Add features without modifying core
- Dependency Inversion: Core depends on abstractions
Next Steps
- Read
README.mdfor detailed documentation - Read
ARCHITECTURE.mdfor architecture details - Run
example_usage.pyto see it in action - Explore the code starting from
bootstrap.py - Try the API using the Swagger docs at
/docs
Need Help?
- Check
README.mdfor detailed docs - Check
ARCHITECTURE.mdfor architecture diagrams - Check
PROJECT_SUMMARY.mdfor complete overview - Look at
example_usage.pyfor usage patterns