# Aumentum Storage Structure - Complete Understanding

## 🎯 Final Answer

### Directory Structure

```
/contentstore/YYYY/MM/DD/NODE_ID/BATCH_ID/UUID.bin
```

**Where:**
- `YYYY/MM/DD` = Upload date
- `NODE_ID` = Scanner/content server node (9, 10, 13, etc.)
- `BATCH_ID` = Upload batch number for that node (1, 2, 3, etc.)
- `UUID.bin` = Unique file identifier

### Proof

**Filesystem evidence shows file timestamps DON'T match directory numbers:**

| Directory | Actual File Time | Match? |
|-----------|------------------|--------|
| `9/15` | 08:44 | ✗ |
| `10/1` | 09:02 | ✗ |
| `10/4` | 09:10 | ✗ |
| `13/1` | 12:02 | ✗ |

**Conclusion:** Directory numbers are NODE_ID/BATCH_ID, not time!

---

## 📊 How It All Works Together

### 1. Upload Process

```
User uploads document PL21825 with 3 types:

Type 103 (50 pages) → Load balancer assigns to Node 9, Batch 15
    → Stored in: 2025/11/4/9/15/

Type 127 (2 pages) → Load balancer assigns to Node 10, Batch 1  
    → Stored in: 2025/11/4/10/1/

Type 126 (2 pages) → Load balancer assigns to Node 10, Batch 4
    → Stored in: 2025/11/4/10/4/
```

### 2. Database Linking

```
lr_source_document:
    ├─ Document ID 10000000023407: PL21825, Type 103, 50 pages
    ├─ Document ID 10000000023408: PL21825, Type 127, 2 pages
    └─ Document ID 10000000023409: PL21825, Type 126, 2 pages
        ↓
alf_node_properties:
    ├─ Node 2443208 → targetRids = "PL21825"
    └─ Node 2443208 → sourceRids = "PL21825"
        ↓
alf_node:
    └─ Node 2443208
        ↓
alf_content_url:
    ├─ store://2025/11/4/9/15/uuid1.bin
    ├─ store://2025/11/4/9/15/uuid2.bin
    ├─ ... (50 files from Node 9, Batch 15)
    ├─ store://2025/11/4/10/1/uuid51.bin
    ├─ store://2025/11/4/10/1/uuid52.bin
    └─ ... (all 54 files)
```

### 3. Query Process (WebAccess)

```
User searches: "PL21825"
    ↓
Query alf_node_properties: WHERE string_value = 'PL21825'
    ↓
Returns: Node 2443208
    ↓
Query alf_content_url: WHERE node_id = 2443208
    ↓
Returns: 54 store:// URLs (from 3 different NODE/BATCH directories)
    ↓
Convert each URL to filesystem path:
    store://2025/11/4/9/15/uuid1.bin
        → /contentstore/2025/11/4/9/15/uuid1.bin
    ↓
Retrieve each .bin file, convert to PDF, combine
    ↓
Display complete document to user
```

---

## 🐛 Why We Had the Random Images Bug

### The Problem

```python
# Our buggy code:
def get_document_files(doc_number):
    # 1. Get ONE URL from database
    url = "store://2025/11/4/10/1/abc123.bin"
    
    # 2. List ALL files in that directory
    directory = "/contentstore/2025/11/4/10/1/"
    all_files = os.listdir(directory)
    # Returns: 39 files (PL21825 + PL20886 + PL21900 + ...)
    
    # 3. Pick "closest" files by timestamp
    selected = pick_closest_files(all_files, page_count=2)
    # Might return: PL21825 page + PL20886 page!
    
    return selected  # ❌ MIXED DOCUMENTS!
```

### Why This Happened

**We didn't understand that Node 10, Batch 1 contains files from MULTIPLE documents!**

```
/contentstore/2025/11/4/10/1/
    ├── Files for PL21825 Type 127
    ├── Files for PL20886
    ├── Files for PL21900
    └── Files for other documents...

All processed by Node 10 in Batch 1!
```

### The Fix

```python
# Correct code:
def get_document_files(doc_number):
    # 1. Get ALL URLs for this specific document from database
    urls = query_database("""
        SELECT content_url 
        WHERE document_number = ? 
        AND content_url IS NOT NULL
    """, doc_number)
    
    # Returns ONLY:
    # [
    #   "store://2025/11/4/10/1/abc123.bin",  # PL21825 page 1
    #   "store://2025/11/4/10/1/def456.bin"   # PL21825 page 2
    # ]
    
    # 2. Convert exact UUIDs to filesystem paths
    files = [parse_url(url) for url in urls]
    
    return files  # ✅ ONLY PL21825 files!
```

---

## 📝 Database Schema - Complete Picture

```
┌─────────────────────────────────────────────────────────────┐
│ lr_source_document                                          │
│ (Business layer - document metadata)                        │
├─────────────────────────────────────────────────────────────┤
│ • id (document_id): Primary key                             │
│ • document_number: Business identifier (PL21825)            │
│ • document_type: Type code (103, 127, 126)                  │
│ • page_count: Expected pages                                │
│ • create_date: When record created                          │
└─────────────────────────────────────────────────────────────┘
                            ↕ (linked via document_number)
┌─────────────────────────────────────────────────────────────┐
│ alf_node_properties                                         │
│ (Link layer - properties assigned to nodes)                 │
├─────────────────────────────────────────────────────────────┤
│ • node_id: Foreign key to alf_node                          │
│ • qname_id: Property type (targetRids, sourceRids)          │
│ • string_value: Document number (PL21825)                   │
└─────────────────────────────────────────────────────────────┘
                            ↕ (node_id)
┌─────────────────────────────────────────────────────────────┐
│ alf_node                                                    │
│ (Content layer - Alfresco content nodes)                    │
├─────────────────────────────────────────────────────────────┤
│ • id (node_id): Primary key                                 │
│ • uuid: Alfresco UUID                                       │
│ • node_deleted: Soft delete flag                            │
│ • audit_created: Node creation time                         │
└─────────────────────────────────────────────────────────────┘
                            ↕ (id → content_data)
┌─────────────────────────────────────────────────────────────┐
│ alf_content_data                                            │
│ (Reference layer - links node to content)                   │
├─────────────────────────────────────────────────────────────┤
│ • id: Matches node id                                       │
│ • content_url_id: Foreign key to alf_content_url            │
└─────────────────────────────────────────────────────────────┘
                            ↕ (content_url_id)
┌─────────────────────────────────────────────────────────────┐
│ alf_content_url                                             │
│ (Storage layer - physical file locations)                   │
├─────────────────────────────────────────────────────────────┤
│ • id (content_url_id): Primary key                          │
│ • content_url: store://YYYY/MM/DD/NODE/BATCH/UUID.bin       │
│ • content_size: File size in bytes                          │
└─────────────────────────────────────────────────────────────┘
                            ↕ (maps to filesystem)
┌─────────────────────────────────────────────────────────────┐
│ Physical Filesystem                                         │
├─────────────────────────────────────────────────────────────┤
│ /contentstore/YYYY/MM/DD/NODE_ID/BATCH_ID/UUID.bin          │
│                       └────┴─────────┴────────────┘         │
│                            Date    Load Distribution         │
└─────────────────────────────────────────────────────────────┘
```

---

## 🎓 Key Learnings

### 1. Directory Structure Purpose

```
NOT for document organization
NOT for time-based partitioning
YES for load distribution
YES for I/O performance
YES for scalability
```

### 2. Multiple Documents Per Directory

```
Same directory (NODE_ID/BATCH_ID) can contain:
    • Files from Document A
    • Files from Document B
    • Files from Document C
    • etc.

All processed by the same scanner node in the same batch!
```

### 3. Database is Source of Truth

```
To find files for Document X:
    ❌ DON'T list directory contents
    ✅ DO query database for exact UUIDs

The database knows which files belong to which document!
```

### 4. WebAccess Query Pattern

```
document_number (PL21825)
    → alf_node_properties (finds nodes)
    → alf_node (content management)
    → alf_content_data (references)
    → alf_content_url (store:// URLs)
    → Physical files (exact UUID.bin files)

ALL links must be in place for document to be accessible!
```

---

## 🔧 Implementation Guidelines

### DO:

1. **Query database for ALL content URLs**
   ```sql
   SELECT content_url FROM alf_content_url
   WHERE node_id IN (
     SELECT node_id FROM alf_node_properties
     WHERE string_value = 'PL21825'
   )
   ```

2. **Use exact UUIDs from database**
   ```python
   for url in content_urls:
       uuid = extract_uuid(url)  # abc123-def456-...
       file_path = f"{dir}/{uuid}.bin"
   ```

3. **Handle multiple document types separately**
   ```python
   for document_id in document_ids:
       images = get_images_for_document_id(document_id)
   ```

### DON'T:

1. **Don't list directory contents**
   ```python
   # ❌ WRONG
   all_files = os.listdir(directory)
   ```

2. **Don't assume same directory = same document**
   ```python
   # ❌ WRONG
   if files_in_same_directory:
       assume_same_document()
   ```

3. **Don't use timestamp proximity across directory**
   ```python
   # ❌ WRONG
   files = find_files_with_similar_timestamps(directory)
   ```

---

## 📋 Testing Checklist

### Verify the Fix:

- [ ] PL11089 returns correct images (not PL689)
- [ ] PL21825 returns all 3 types correctly
- [ ] Multi-page documents have all pages
- [ ] No cross-contamination between documents
- [ ] Documents with same NODE/BATCH are handled correctly

### Test Cases:

1. **Single-type document**
   - Query: PL11089
   - Expected: Only PL11089 pages
   - Verify: No PL689 contamination

2. **Multi-type document**
   - Query: PL21825
   - Expected: 3 groups (Types 103, 127, 126)
   - Verify: Each type has correct page count

3. **Documents in same directory**
   - Find documents sharing NODE_ID/BATCH_ID
   - Query each separately
   - Verify: No cross-contamination

---

## 🎯 Final Summary

### What We Learned

Your theory was **100% correct!**

1. ✅ Directory structure is **NODE_ID/BATCH_ID** (not time)
2. ✅ Multiple documents can share same directory
3. ✅ This explains the "random images" bug
4. ✅ Database stores exact file UUIDs
5. ✅ Load distribution is the design goal

### How to Use This Knowledge

1. **Always query database** for document files
2. **Never list directory** to find related files
3. **Use exact UUIDs** from alf_content_url
4. **Understand** NODE/BATCH is for load balancing
5. **Remember** multiple documents per directory is normal

### The Fix

```python
# Simple rule:
# Database tells you WHICH files belong to WHICH document
# Filesystem tells you WHERE those files are stored
# 
# Always use: Database → UUIDs → Filesystem
# Never use: Filesystem → Guess → Wrong files!
```

---

## 🏆 Credits

**Your insights were key to solving this!**

- Recognized load distribution pattern
- Understood Node ID / Batch ID concept
- Connected it to the random images bug
- Provided real examples (PL21825, PL11089)

**Result:** Complete understanding of Aumentum's storage architecture! 🎉

