231 lines
13 KiB
Plaintext
231 lines
13 KiB
Plaintext
TEXT PROCESSOR - HEXAGONAL ARCHITECTURE
|
|
Complete Directory Structure
|
|
|
|
text_processor_hex/
|
|
│
|
|
├── 📄 README.md Project documentation and overview
|
|
├── 📄 QUICK_START.md Quick start guide for users
|
|
├── 📄 ARCHITECTURE.md Detailed architecture documentation
|
|
├── 📄 PROJECT_SUMMARY.md Complete project summary
|
|
├── 📄 DIRECTORY_TREE.txt This file
|
|
│
|
|
├── 📄 requirements.txt Python dependencies
|
|
├── 🚀 main.py FastAPI application entry point
|
|
├── 📝 example_usage.py Programmatic usage examples
|
|
│
|
|
└── 📁 src/
|
|
├── 📄 __init__.py
|
|
├── 🔧 bootstrap.py ⚙️ DEPENDENCY INJECTION CONTAINER
|
|
│
|
|
├── 📁 core/ ⭐ DOMAIN LAYER (Pure Business Logic)
|
|
│ ├── 📄 __init__.py
|
|
│ │
|
|
│ ├── 📁 domain/ Domain Models & Logic
|
|
│ │ ├── 📄 __init__.py
|
|
│ │ ├── 📦 models.py Rich Pydantic v2 Entities
|
|
│ │ │ - Document
|
|
│ │ │ - DocumentMetadata
|
|
│ │ │ - Chunk
|
|
│ │ │ - ChunkingStrategy
|
|
│ │ ├── ⚠️ exceptions.py Domain Exceptions
|
|
│ │ │ - ExtractionError
|
|
│ │ │ - ChunkingError
|
|
│ │ │ - ProcessingError
|
|
│ │ │ - ValidationError
|
|
│ │ │ - RepositoryError
|
|
│ │ └── 🔨 logic_utils.py Pure Functions
|
|
│ │ - normalize_whitespace()
|
|
│ │ - clean_text()
|
|
│ │ - split_into_paragraphs()
|
|
│ │ - truncate_to_word_boundary()
|
|
│ │
|
|
│ ├── 📁 ports/ Port Interfaces (Abstractions)
|
|
│ │ ├── 📄 __init__.py
|
|
│ │ │
|
|
│ │ ├── 📁 incoming/ Service Interfaces (Use Cases)
|
|
│ │ │ ├── 📄 __init__.py
|
|
│ │ │ └── 🔌 text_processor.py ITextProcessor
|
|
│ │ │ - process_document()
|
|
│ │ │ - extract_and_chunk()
|
|
│ │ │ - get_document()
|
|
│ │ │ - list_documents()
|
|
│ │ │
|
|
│ │ └── 📁 outgoing/ SPIs (Service Provider Interfaces)
|
|
│ │ ├── 📄 __init__.py
|
|
│ │ ├── 🔌 extractor.py IExtractor
|
|
│ │ │ - extract()
|
|
│ │ │ - supports_file_type()
|
|
│ │ ├── 🔌 chunker.py IChunker
|
|
│ │ │ - chunk()
|
|
│ │ │ - supports_strategy()
|
|
│ │ └── 🔌 repository.py IDocumentRepository
|
|
│ │ - save()
|
|
│ │ - find_by_id()
|
|
│ │ - delete()
|
|
│ │
|
|
│ └── 📁 services/ Business Logic Orchestration
|
|
│ ├── 📄 __init__.py
|
|
│ └── ⚙️ document_processor_service.py
|
|
│ DocumentProcessorService
|
|
│ Implements: ITextProcessor
|
|
│ Workflow: Extract → Clean → Chunk → Save
|
|
│
|
|
├── 📁 adapters/ 🔌 ADAPTER LAYER (External Concerns)
|
|
│ ├── 📄 __init__.py
|
|
│ │
|
|
│ ├── 📁 incoming/ Driving Adapters (Primary)
|
|
│ │ ├── 📄 __init__.py
|
|
│ │ ├── 🌐 api_routes.py FastAPI Routes (HTTP Adapter)
|
|
│ │ │ - POST /process
|
|
│ │ │ - POST /extract-and-chunk
|
|
│ │ │ - GET /documents/{id}
|
|
│ │ │ - GET /documents
|
|
│ │ │ - DELETE /documents/{id}
|
|
│ │ └── 📋 api_schemas.py Pydantic Request/Response Models
|
|
│ │ - ProcessDocumentRequest
|
|
│ │ - DocumentResponse
|
|
│ │ - ChunkResponse
|
|
│ │
|
|
│ └── 📁 outgoing/ Driven Adapters (Secondary)
|
|
│ ├── 📄 __init__.py
|
|
│ │
|
|
│ ├── 📁 extractors/ Text Extraction Adapters
|
|
│ │ ├── 📄 __init__.py
|
|
│ │ ├── 📑 base.py BaseExtractor (Template Method)
|
|
│ │ ├── 📕 pdf_extractor.py PDFExtractor
|
|
│ │ │ Uses: PyPDF2
|
|
│ │ │ Supports: .pdf
|
|
│ │ ├── 📘 docx_extractor.py DocxExtractor
|
|
│ │ │ Uses: python-docx
|
|
│ │ │ Supports: .docx
|
|
│ │ ├── 📄 txt_extractor.py TxtExtractor
|
|
│ │ │ Uses: built-in
|
|
│ │ │ Supports: .txt, .md
|
|
│ │ └── 🏭 factory.py ExtractorFactory (Factory Pattern)
|
|
│ │ - create_extractor()
|
|
│ │ - register_extractor()
|
|
│ │
|
|
│ ├── 📁 chunkers/ Text Chunking Adapters
|
|
│ │ ├── 📄 __init__.py
|
|
│ │ ├── 📑 base.py BaseChunker (Template Method)
|
|
│ │ ├── ✂️ fixed_size_chunker.py FixedSizeChunker
|
|
│ │ │ Strategy: Fixed-size chunks
|
|
│ │ │ Features: Overlap, boundaries
|
|
│ │ ├── 📝 paragraph_chunker.py ParagraphChunker
|
|
│ │ │ Strategy: Paragraph-based
|
|
│ │ │ Features: Respect paragraphs
|
|
│ │ └── 🎯 context.py ChunkingContext (Strategy Pattern)
|
|
│ │ - set_strategy()
|
|
│ │ - execute_chunking()
|
|
│ │
|
|
│ └── 📁 persistence/ Data Persistence Adapters
|
|
│ ├── 📄 __init__.py
|
|
│ └── 💾 in_memory_repository.py
|
|
│ InMemoryDocumentRepository
|
|
│ Features: Thread-safe, Dict storage
|
|
│
|
|
└── 📁 shared/ 🛠️ SHARED LAYER (Cross-Cutting)
|
|
├── 📄 __init__.py
|
|
├── 🎛️ constants.py Application Constants
|
|
│ - File types
|
|
│ - Chunk sizes
|
|
│ - API config
|
|
└── 📋 logging_config.py Logging Configuration
|
|
- setup_logging()
|
|
- get_logger()
|
|
|
|
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
📊 PROJECT STATISTICS
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
Total Files: 44
|
|
- Python files: 42
|
|
- Documentation: 4 (README, ARCHITECTURE, SUMMARY, QUICK_START)
|
|
- Configuration: 1 (requirements.txt)
|
|
- Other: 1 (this tree)
|
|
|
|
Lines of Code: ~3,800
|
|
- Core Domain: ~1,200 lines
|
|
- Adapters: ~1,400 lines
|
|
- Bootstrap/Main: ~200 lines
|
|
- Documentation: ~1,000 lines
|
|
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
🏗️ ARCHITECTURE LAYERS
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
1. CORE (Domain Layer)
|
|
- Pure business logic
|
|
- No external dependencies
|
|
- Rich domain models
|
|
- Pure functions
|
|
|
|
2. ADAPTERS (Infrastructure Layer)
|
|
- Incoming: FastAPI (HTTP)
|
|
- Outgoing: Extractors, Chunkers, Repository
|
|
- Technology-specific implementations
|
|
|
|
3. BOOTSTRAP (Wiring Layer)
|
|
- Dependency injection
|
|
- Configuration
|
|
- Application assembly
|
|
|
|
4. SHARED (Utilities Layer)
|
|
- Cross-cutting concerns
|
|
- Logging, constants
|
|
- No business logic
|
|
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
🎨 DESIGN PATTERNS
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
✓ Hexagonal Architecture (Ports & Adapters)
|
|
✓ Factory Pattern (ExtractorFactory)
|
|
✓ Strategy Pattern (ChunkingContext)
|
|
✓ Repository Pattern (IDocumentRepository)
|
|
✓ Template Method Pattern (BaseExtractor, BaseChunker)
|
|
✓ Dependency Injection (ApplicationContainer)
|
|
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
💎 SOLID PRINCIPLES
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
✓ Single Responsibility: Each class has one job
|
|
✓ Open/Closed: Extend via interfaces, not modification
|
|
✓ Liskov Substitution: All implementations are interchangeable
|
|
✓ Interface Segregation: Small, focused interfaces
|
|
✓ Dependency Inversion: Depend on abstractions, not concretions
|
|
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
🎯 KEY FEATURES
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
✓ Multiple file types (PDF, DOCX, TXT)
|
|
✓ Multiple chunking strategies (Fixed, Paragraph)
|
|
✓ Rich domain models with validation
|
|
✓ Comprehensive error handling
|
|
✓ RESTful API with FastAPI
|
|
✓ Thread-safe repository
|
|
✓ 100% type hints
|
|
✓ Google-style docstrings
|
|
✓ Complete documentation
|
|
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
📚 DOCUMENTATION FILES
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
README.md - Project overview and installation
|
|
QUICK_START.md - Quick start guide for users
|
|
ARCHITECTURE.md - Detailed architecture documentation with diagrams
|
|
PROJECT_SUMMARY.md - Complete project summary and statistics
|
|
DIRECTORY_TREE.txt - This file
|
|
|
|
═══════════════════════════════════════════════════════════════════════════
|