text_processor/QUICK_START.md
m.dabbagh 70f5b1478c init
2026-01-07 19:15:46 +03:30

257 lines
5.8 KiB
Markdown

# Quick Start Guide
## Installation
```bash
# Navigate to project directory
cd text_processor_hex
# Create virtual environment
python -m venv venv
# Activate virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
## Run the Application
### Option 1: FastAPI Server
```bash
python main.py
```
Then visit: http://localhost:8000/docs
### Option 2: Programmatic Usage
```bash
python example_usage.py
```
## Basic Usage Examples
### 1. Using the API (cURL)
**Process a Document:**
```bash
curl -X POST "http://localhost:8000/api/v1/process" \
-H "Content-Type: application/json" \
-d '{
"file_path": "/path/to/document.pdf",
"chunking_strategy": {
"strategy_name": "fixed_size",
"chunk_size": 1000,
"overlap_size": 100,
"respect_boundaries": true
}
}'
```
**Extract and Chunk:**
```bash
curl -X POST "http://localhost:8000/api/v1/extract-and-chunk" \
-H "Content-Type: application/json" \
-d '{
"file_path": "/path/to/document.pdf",
"chunking_strategy": {
"strategy_name": "paragraph",
"chunk_size": 1000,
"overlap_size": 0,
"respect_boundaries": true
}
}'
```
**Get Document:**
```bash
curl -X GET "http://localhost:8000/api/v1/documents/{document_id}"
```
**List Documents:**
```bash
curl -X GET "http://localhost:8000/api/v1/documents?limit=10&offset=0"
```
**Delete Document:**
```bash
curl -X DELETE "http://localhost:8000/api/v1/documents/{document_id}"
```
### 2. Using Python Code
```python
from pathlib import Path
from src.bootstrap import create_application
from src.core.domain.models import ChunkingStrategy
# Initialize
container = create_application()
service = container.text_processor_service
# Process a PDF
strategy = ChunkingStrategy(
strategy_name="fixed_size",
chunk_size=1000,
overlap_size=100,
respect_boundaries=True,
)
document = service.process_document(
file_path=Path("example.pdf"),
chunking_strategy=strategy,
)
print(f"Document ID: {document.id}")
print(f"Metadata: {document.get_metadata_summary()}")
# Extract and chunk
chunks = service.extract_and_chunk(
file_path=Path("example.pdf"),
chunking_strategy=strategy,
)
for chunk in chunks:
print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")
```
## Available Chunking Strategies
### 1. Fixed Size
Splits text into equal-sized chunks with optional overlap.
```python
ChunkingStrategy(
strategy_name="fixed_size",
chunk_size=1000, # Target size in characters
overlap_size=100, # Overlap between chunks
respect_boundaries=True # Try to break at sentences
)
```
### 2. Paragraph
Splits text by paragraph boundaries, combining paragraphs to reach target size.
```python
ChunkingStrategy(
strategy_name="paragraph",
chunk_size=1000,
overlap_size=0,
respect_boundaries=True
)
```
## Supported File Types
- **PDF** (.pdf) - using PyPDF2
- **DOCX** (.docx) - using python-docx
- **Text** (.txt, .md, .text) - native Python
## Project Structure
```
text_processor_hex/
├── main.py # FastAPI entry point
├── example_usage.py # Usage examples
├── requirements.txt # Dependencies
└── src/
├── core/ # Business logic (NO external dependencies)
│ ├── domain/ # Models, exceptions, logic
│ ├── ports/ # Interface definitions
│ └── services/ # Orchestration
├── adapters/ # External integrations
│ ├── incoming/ # FastAPI routes
│ └── outgoing/ # Extractors, chunkers, storage
├── shared/ # Utilities
└── bootstrap.py # Dependency injection
```
## Common Tasks
### Add a New File Type
1. Create extractor in `src/adapters/outgoing/extractors/`
2. Extend `BaseExtractor`
3. Register in `bootstrap.py`
### Add a New Chunking Strategy
1. Create chunker in `src/adapters/outgoing/chunkers/`
2. Extend `BaseChunker`
3. Register in `bootstrap.py`
### Change Storage
1. Implement `IDocumentRepository` interface
2. Swap implementation in `bootstrap.py`
## Testing
```bash
# Run example
python example_usage.py
# Test API with curl
curl http://localhost:8000/health
# Check API docs
# Visit: http://localhost:8000/docs
```
## Troubleshooting
### Import Errors
```bash
# Make sure you're in the right directory
cd text_processor_hex
# Activate virtual environment
source venv/bin/activate
```
### Missing Dependencies
```bash
pip install -r requirements.txt
```
### File Not Found Errors
Use absolute paths for file_path in API requests:
```json
{
"file_path": "/absolute/path/to/file.pdf"
}
```
## Architecture Highlights
**Hexagonal Architecture:**
- Core business logic is isolated
- Easy to test without infrastructure
- Easy to swap implementations
**Design Patterns:**
- Factory: ExtractorFactory selects extractor by file type
- Strategy: ChunkingContext selects chunking strategy
- Repository: Abstract data storage
- Dependency Injection: All dependencies injected via bootstrap
**SOLID Principles:**
- Single Responsibility: Each class does one thing
- Open/Closed: Add features without modifying core
- Dependency Inversion: Core depends on abstractions
## Next Steps
1. Read `README.md` for detailed documentation
2. Read `ARCHITECTURE.md` for architecture details
3. Run `example_usage.py` to see it in action
4. Explore the code starting from `bootstrap.py`
5. Try the API using the Swagger docs at `/docs`
## Need Help?
- Check `README.md` for detailed docs
- Check `ARCHITECTURE.md` for architecture diagrams
- Check `PROJECT_SUMMARY.md` for complete overview
- Look at `example_usage.py` for usage patterns