257 lines
5.8 KiB
Markdown
257 lines
5.8 KiB
Markdown
# Quick Start Guide
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Navigate to project directory
|
|
cd text_processor_hex
|
|
|
|
# Create virtual environment
|
|
python -m venv venv
|
|
|
|
# Activate virtual environment
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Run the Application
|
|
|
|
### Option 1: FastAPI Server
|
|
```bash
|
|
python main.py
|
|
```
|
|
Then visit: http://localhost:8000/docs
|
|
|
|
### Option 2: Programmatic Usage
|
|
```bash
|
|
python example_usage.py
|
|
```
|
|
|
|
## Basic Usage Examples
|
|
|
|
### 1. Using the API (cURL)
|
|
|
|
**Process a Document:**
|
|
```bash
|
|
curl -X POST "http://localhost:8000/api/v1/process" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"file_path": "/path/to/document.pdf",
|
|
"chunking_strategy": {
|
|
"strategy_name": "fixed_size",
|
|
"chunk_size": 1000,
|
|
"overlap_size": 100,
|
|
"respect_boundaries": true
|
|
}
|
|
}'
|
|
```
|
|
|
|
**Extract and Chunk:**
|
|
```bash
|
|
curl -X POST "http://localhost:8000/api/v1/extract-and-chunk" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"file_path": "/path/to/document.pdf",
|
|
"chunking_strategy": {
|
|
"strategy_name": "paragraph",
|
|
"chunk_size": 1000,
|
|
"overlap_size": 0,
|
|
"respect_boundaries": true
|
|
}
|
|
}'
|
|
```
|
|
|
|
**Get Document:**
|
|
```bash
|
|
curl -X GET "http://localhost:8000/api/v1/documents/{document_id}"
|
|
```
|
|
|
|
**List Documents:**
|
|
```bash
|
|
curl -X GET "http://localhost:8000/api/v1/documents?limit=10&offset=0"
|
|
```
|
|
|
|
**Delete Document:**
|
|
```bash
|
|
curl -X DELETE "http://localhost:8000/api/v1/documents/{document_id}"
|
|
```
|
|
|
|
### 2. Using Python Code
|
|
|
|
```python
|
|
from pathlib import Path
|
|
from src.bootstrap import create_application
|
|
from src.core.domain.models import ChunkingStrategy
|
|
|
|
# Initialize
|
|
container = create_application()
|
|
service = container.text_processor_service
|
|
|
|
# Process a PDF
|
|
strategy = ChunkingStrategy(
|
|
strategy_name="fixed_size",
|
|
chunk_size=1000,
|
|
overlap_size=100,
|
|
respect_boundaries=True,
|
|
)
|
|
|
|
document = service.process_document(
|
|
file_path=Path("example.pdf"),
|
|
chunking_strategy=strategy,
|
|
)
|
|
|
|
print(f"Document ID: {document.id}")
|
|
print(f"Metadata: {document.get_metadata_summary()}")
|
|
|
|
# Extract and chunk
|
|
chunks = service.extract_and_chunk(
|
|
file_path=Path("example.pdf"),
|
|
chunking_strategy=strategy,
|
|
)
|
|
|
|
for chunk in chunks:
|
|
print(f"Chunk {chunk.sequence_number}: {chunk.get_length()} chars")
|
|
```
|
|
|
|
## Available Chunking Strategies
|
|
|
|
### 1. Fixed Size
|
|
Splits text into equal-sized chunks with optional overlap.
|
|
|
|
```python
|
|
ChunkingStrategy(
|
|
strategy_name="fixed_size",
|
|
chunk_size=1000, # Target size in characters
|
|
overlap_size=100, # Overlap between chunks
|
|
respect_boundaries=True # Try to break at sentences
|
|
)
|
|
```
|
|
|
|
### 2. Paragraph
|
|
Splits text by paragraph boundaries, combining paragraphs to reach target size.
|
|
|
|
```python
|
|
ChunkingStrategy(
|
|
strategy_name="paragraph",
|
|
chunk_size=1000,
|
|
overlap_size=0,
|
|
respect_boundaries=True
|
|
)
|
|
```
|
|
|
|
## Supported File Types
|
|
|
|
- **PDF** (.pdf) - using PyPDF2
|
|
- **DOCX** (.docx) - using python-docx
|
|
- **Text** (.txt, .md, .text) - native Python
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
text_processor_hex/
|
|
├── main.py # FastAPI entry point
|
|
├── example_usage.py # Usage examples
|
|
├── requirements.txt # Dependencies
|
|
│
|
|
└── src/
|
|
├── core/ # Business logic (NO external dependencies)
|
|
│ ├── domain/ # Models, exceptions, logic
|
|
│ ├── ports/ # Interface definitions
|
|
│ └── services/ # Orchestration
|
|
│
|
|
├── adapters/ # External integrations
|
|
│ ├── incoming/ # FastAPI routes
|
|
│ └── outgoing/ # Extractors, chunkers, storage
|
|
│
|
|
├── shared/ # Utilities
|
|
└── bootstrap.py # Dependency injection
|
|
```
|
|
|
|
## Common Tasks
|
|
|
|
### Add a New File Type
|
|
1. Create extractor in `src/adapters/outgoing/extractors/`
|
|
2. Extend `BaseExtractor`
|
|
3. Register in `bootstrap.py`
|
|
|
|
### Add a New Chunking Strategy
|
|
1. Create chunker in `src/adapters/outgoing/chunkers/`
|
|
2. Extend `BaseChunker`
|
|
3. Register in `bootstrap.py`
|
|
|
|
### Change Storage
|
|
1. Implement `IDocumentRepository` interface
|
|
2. Swap implementation in `bootstrap.py`
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
# Run example
|
|
python example_usage.py
|
|
|
|
# Test API with curl
|
|
curl http://localhost:8000/health
|
|
|
|
# Check API docs
|
|
# Visit: http://localhost:8000/docs
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Import Errors
|
|
```bash
|
|
# Make sure you're in the right directory
|
|
cd text_processor_hex
|
|
|
|
# Activate virtual environment
|
|
source venv/bin/activate
|
|
```
|
|
|
|
### Missing Dependencies
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### File Not Found Errors
|
|
Use absolute paths for file_path in API requests:
|
|
```json
|
|
{
|
|
"file_path": "/absolute/path/to/file.pdf"
|
|
}
|
|
```
|
|
|
|
## Architecture Highlights
|
|
|
|
**Hexagonal Architecture:**
|
|
- Core business logic is isolated
|
|
- Easy to test without infrastructure
|
|
- Easy to swap implementations
|
|
|
|
**Design Patterns:**
|
|
- Factory: ExtractorFactory selects extractor by file type
|
|
- Strategy: ChunkingContext selects chunking strategy
|
|
- Repository: Abstract data storage
|
|
- Dependency Injection: All dependencies injected via bootstrap
|
|
|
|
**SOLID Principles:**
|
|
- Single Responsibility: Each class does one thing
|
|
- Open/Closed: Add features without modifying core
|
|
- Dependency Inversion: Core depends on abstractions
|
|
|
|
## Next Steps
|
|
|
|
1. Read `README.md` for detailed documentation
|
|
2. Read `ARCHITECTURE.md` for architecture details
|
|
3. Run `example_usage.py` to see it in action
|
|
4. Explore the code starting from `bootstrap.py`
|
|
5. Try the API using the Swagger docs at `/docs`
|
|
|
|
## Need Help?
|
|
|
|
- Check `README.md` for detailed docs
|
|
- Check `ARCHITECTURE.md` for architecture diagrams
|
|
- Check `PROJECT_SUMMARY.md` for complete overview
|
|
- Look at `example_usage.py` for usage patterns
|