# File Processing Pipeline Documentation

## Overview

The system implements a robust file processing pipeline that handles agent-related document uploads, processes them through an AI pipeline, and stores both raw and processed files in an organized structure.

## Directory Structure

```
cdn.webcomtower.ir/
└── exhibitions/
    └── {exhibition_id}/
        └── boughten_booths/
            └── {booth_id}/
                └── agents/
                    ├── raw_files/      (initial uploaded files)
                    └── processed_files/ (files after AI processing)
```

## Component Breakdown

### 1. Upload Configuration (src/config/upload.js)
- Uses Multer for handling file uploads
- Implements 100MB file size limit
- Creates unique filenames using timestamp and random numbers
- Stores files initially in a temporary directory
- Supports different naming conventions based on reference type

### 2. File Controller (src/controllers/fileController.js)
- Validates upload requests
- Handles file validation
- Creates upload tracking entries
- Manages file movement from temporary to final location
- Builds proper directory structure
- Generates file URLs
- Handles error scenarios and cleanup

### 3. Document Processing Controller (src/AI-PipeLine/controllers/documentProcessing.js)
- Manages the document processing queue
- Calculates estimated processing times
- Spawns Python document processor
- Integrates with OpenAI for AI processing
- Updates processing status and progress
- Creates AI assistants and vector stores
- Maintains processing summaries

### 4. Queue Status Model (src/AI-PipeLine/models/queueStatus.js)
- Tracks processing queue state in MongoDB
- Fields include:
  * agent_id, exhibition_id, boughten_booth_id
  * queue_position and progress
  * estimated_completion_time
  * status (queued/processing/completed/failed)
  * timestamps and error information
- Auto-removes completed/failed entries after 1 hour
- Maintains efficient indexes for queue operations

### 5. Document Processor (src/AI-PipeLine/utils/document_processor.py)
- Handles multiple document formats:
  * PDF (.pdf)
  * Word (.doc, .docx)
  * Excel (.xls, .xlsx, .csv)
  * Text (.txt, .md)
  * Data files (.json, .yaml)
- Features:
  * Text extraction and normalization
  * Table detection and processing
  * Language detection (Persian/English)
  * Sentence tokenization
  * Bidirectional text handling
  * Persian text processing using Hazm
  * Detailed error handling and logging

## Processing Flow

1. Initial Upload:
   - File received through Multer
   - Stored in temporary directory
   - Basic validation performed

2. Queue Entry:
   - Create queue status entry
   - Calculate estimated processing time
   - Assign queue position

3. Directory Creation:
   - Build final directory path
   - Create necessary subdirectories
   - Move file from temp to raw_files

4. Document Processing:
   - Python processor analyzes document
   - Extracts text and tables
   - Handles language-specific processing
   - Normalizes content
   - Generates processing summary

5. AI Integration:
   - Upload processed files to OpenAI
   - Create AI assistant
   - Setup vector store for embeddings
   - Calculate token usage

6. Completion:
   - Store results in processed_files
   - Update queue status
   - Clean up temporary files
   - Generate final summary

## Error Handling

- File validation errors
- Processing failures
- Queue management issues
- AI integration problems
- Temporary file cleanup
- Detailed error logging

## Performance Considerations

- Queue management for parallel processing
- Estimated time calculations
- Progress tracking
- Automatic cleanup of old entries
- Efficient file movement
- Language-specific optimizations

## Monitoring and Status

- Real-time progress updates
- Queue position tracking
- Estimated completion times
- Error reporting
- Processing summaries
- AI integration status