delphi-database-v2/docs/DUPLICATE_HANDLING.md

# Duplicate Record Handling in Legacy Imports

## Overview

Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully.

## Problem

When importing rolodex data, duplicate IDs can cause:
```
UNIQUE constraint failed: rolodex.id
```

This error would cascade, causing all subsequent rows in the batch to fail with:
```
This Session's transaction has been rolled back due to a previous exception
```

## Solution

The import system now implements multiple layers of duplicate protection:

### 1. In-Memory Duplicate Tracking
```python
seen_in_import = set()
```
Tracks IDs encountered during the current import session. If an ID is seen twice in the same file, only the first occurrence is imported.

### 2. Database Existence Check
Before importing each record, checks if it already exists:
```python
if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first():
    result['skipped'] += 1
    continue
```

### 3. Graceful Batch Failure Handling
If a bulk insert fails due to duplicates:
- Transaction is rolled back
- Falls back to row-by-row insertion
- Silently skips duplicates
- Continues with remaining records

## Import Results

The import result now includes a `skipped` count:
```python
{
    'success': 10000,       # Records successfully imported
    'errors': [],           # Critical errors (empty if successful)
    'total_rows': 52100,    # Total rows in CSV
    'skipped': 42094        # Duplicates or existing records skipped
}
```

## Understanding Skip Counts

High skip counts are **normal and expected** for legacy data:

### Why Records Are Skipped
1. **Duplicate IDs in CSV** - Same ID appears multiple times in file
2. **Re-importing** - Records already exist from previous import
3. **Data quality issues** - Legacy exports may have had duplicates

### Example: Rolodex Import
- Total rows: 52,100
- Successfully imported: ~10,000 (unique IDs)
- Skipped: ~42,000 (duplicates + existing)

This is **not an error** - it means the system is protecting data integrity.

## Which Tables Have Duplicate Protection?

Currently implemented for:
- ✅ `rolodex` (primary key: id)
- ✅ `filetype` (primary key: file_type)

Other tables should be updated if they encounter similar issues.

## Re-importing Data

You can safely re-import the same file multiple times:
- Already imported records are detected and skipped
- Only new records are added
- No duplicate errors
- Idempotent operation

## Performance Considerations

### Database Checks
For each row, we query:
```sql
SELECT * FROM rolodex WHERE id = ?
```

This adds overhead but ensures data integrity. For 52k rows:
- With duplicates: ~5-10 minutes
- Without duplicates: ~2-5 minutes

### Optimization Notes
- Queries are indexed on primary key (fast)
- Batch size: 500 records per commit
- Only checks before adding to batch (not on commit)

## Troubleshooting

### Import seems slow
**Normal behavior**: Database checks add time, especially with many duplicates.

**Monitoring**:
```bash
# Watch import progress in logs
docker-compose logs -f delphi-db | grep -i "rolodex\|import"
```

### All records skipped
**Possible causes**:
1. Data already imported - check database: `SELECT COUNT(*) FROM rolodex`
2. CSV has no valid IDs - check CSV format
3. Database already populated - safe to ignore if data looks correct

### Want to re-import from scratch
```bash
# Clear rolodex table (be careful!)
docker-compose exec delphi-db python3 << EOF
from app.database import SessionLocal
from app.models import Rolodex
db = SessionLocal()
db.query(Rolodex).delete()
db.commit()
print("Rolodex table cleared")
EOF

# Or delete entire database and restart
rm delphi.db
docker-compose restart
```

## Data Quality Insights

The skip count provides insights into legacy data quality:

### High Skip Rate (>50%)
Indicates:
- Significant duplicates in legacy system
- Multiple exports merged together
- Poor data normalization in original system

### Low Skip Rate (<10%)
Indicates:
- Clean legacy data
- Proper unique constraints in original system
- First-time import

### Example from Real Data
From the rolodex file (52,100 rows):
- Unique IDs: ~10,000
- Duplicates: ~42,000
- **Duplication rate: 80%**

This suggests the legacy export included:
- Historical snapshots
- Multiple versions of same record
- Merged data from different time periods

## Future Improvements

Potential enhancements:
1. **Update existing records** instead of skipping
2. **Merge duplicate records** based on timestamp or version
3. **Report duplicate details** in import log
4. **Configurable behavior** - skip vs update vs error
5. **Batch optimization** - single query to check all IDs at once