# Document Processing System Setup

## Overview

This system processes multiple document formats and extracts sentences in both English and Persian, combining all results into a single consolidated JSON file.

## Supported File Types

- Documents: PDF, DOC, DOCX
- Spreadsheets: CSV, XLS, XLSX
- Text: TXT, MD
- Data: JSON, YAML/YML

## Directory Structure

```
exhibitions/
└── [exhibition_id]/
    └── boughten_booths/
        └── [boughten_booths_id]/
            └── agents/
                ├── raw_files/        # Original uploaded files
                └── processed_files/  # Contains processed_results.json
```

## Installation

1. Ensure Python 3.8+ and Node.js 14+ are installed

2. Install Python dependencies:

```bash
cd src/AI-PipeLine
pip install -r requirements.txt
```

3. Install Java (required for PDF processing):

```bash
# On Ubuntu/Debian
sudo apt-get install default-jre

# On Windows
# Download and install from https://www.java.com/en/download/
```

## How It Works

1. **File Upload**
   - Files are uploaded to the raw_files directory
   - System validates file types and content

2. **Language Detection**
   - Automatically detects if text is English or Persian
   - Uses appropriate tokenizer for each language:
     - NLTK for English
     - Hazm for Persian

3. **Automatic Processing**
   - Processing starts automatically after successful upload
   - Each file is processed based on its type
   - All results are combined into a single JSON file

## Output Format

The system generates a single `processed_results.json` containing:

```json
{
  "timestamp": "ISO timestamp",
  "input_directory": "path/to/input",
  "output_directory": "path/to/output",
  "files": [
    {
      "file_id": "unique_file_id",
      "original_name": "original_filename",
      "result": {
        "file_info": {
          "original_path": "path/to/original/file",
          "processed_path": "file_id",
          "mime_type": "file/mimetype",
          "processed_at": "timestamp"
        },
        "content": {
          "language": "en|fa",
          "sentences": ["Sentence 1", "Sentence 2", ...],
          "tables": [
            {
              "headers": ["Column1", "Column2", ...],
              "data": [["Row1Col1", "Row1Col2"], ...]
            }
          ]
        },
        "display": {
          "normalized_text": "normalized content",
          "display_text": "RTL-formatted text if applicable"
        }
      }
    }
  ],
  "consolidated": {
    "english": {
      "sentences": [
        {
          "file": "file_id",
          "sentence": "English sentence"
        }
      ],
      "tables": [
        {
          "file": "file_id",
          "table": {
            "headers": ["Column1", "Column2"],
            "data": [["Row1Col1", "Row1Col2"]]
          }
        }
      ]
    },
    "persian": {
      "sentences": [
        {
          "file": "file_id",
          "sentence": "Persian sentence"
        }
      ],
      "tables": [
        {
          "file": "file_id",
          "table": {
            "headers": ["Column1", "Column2"],
            "data": [["Row1Col1", "Row1Col2"]]
          }
        }
      ]
    }
  }
}
```

## Features

1. **Language Support**
   - English sentence tokenization using NLTK
   - Persian sentence tokenization using Hazm
   - Automatic language detection
   - RTL text handling for Persian

2. **Content Processing**
   - Text extraction from multiple file formats
   - Table extraction from PDFs and spreadsheets
   - Text normalization for Persian
   - Original file structure preservation

3. **Output Organization**
   - All results in a single JSON file
   - Separated English and Persian content
   - File-level and consolidated results
   - Maintains source file references

## Monitoring

- Check processing status using the `/status` endpoint
- View consolidated results in `processed_files/processed_results.json`
- Each file's results are tracked individually and in consolidated format

## Error Handling

- Failed files don't stop the batch process
- Errors are logged in the results JSON
- Invalid file types are skipped
- Processing errors don't affect file upload status

## Maintenance

To clean up all files for a specific exhibition/booth:

- Use the delete endpoint to remove both raw and processed files
- System automatically handles directory cleanup
