docs: Add comprehensive guide on duplicate record handling

2025-10-12 21:08:38 -05:00
parent 2833110de0
commit ad1c75d759
1 changed files with 174 additions and 0 deletions
--- a/docs/DUPLICATE_HANDLING.md
+++ b/docs/DUPLICATE_HANDLING.md
@@ -0,0 +1,174 @@
+# Duplicate Record Handling in Legacy Imports
+
+## Overview
+
+Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully.
+
+## Problem
+
+When importing rolodex data, duplicate IDs can cause:
+```
+UNIQUE constraint failed: rolodex.id
+```
+
+This error would cascade, causing all subsequent rows in the batch to fail with:
+```
+This Session's transaction has been rolled back due to a previous exception
+```
+
+## Solution
+
+The import system now implements multiple layers of duplicate protection:
+
+### 1. In-Memory Duplicate Tracking
+```python
+seen_in_import = set()
+```
+Tracks IDs encountered during the current import session. If an ID is seen twice in the same file, only the first occurrence is imported.
+
+### 2. Database Existence Check
+Before importing each record, checks if it already exists:
+```python
+if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first():
+    result['skipped'] += 1
+    continue
+```
+
+### 3. Graceful Batch Failure Handling
+If a bulk insert fails due to duplicates:
+- Transaction is rolled back
+- Falls back to row-by-row insertion
+- Silently skips duplicates
+- Continues with remaining records
+
+## Import Results
+
+The import result now includes a `skipped` count:
+```python
+{
+    'success': 10000,       # Records successfully imported
+    'errors': [],           # Critical errors (empty if successful)
+    'total_rows': 52100,    # Total rows in CSV
+    'skipped': 42094        # Duplicates or existing records skipped
+}
+```
+
+## Understanding Skip Counts
+
+High skip counts are **normal and expected** for legacy data:
+
+### Why Records Are Skipped
+1. **Duplicate IDs in CSV** - Same ID appears multiple times in file
+2. **Re-importing** - Records already exist from previous import
+3. **Data quality issues** - Legacy exports may have had duplicates
+
+### Example: Rolodex Import
+- Total rows: 52,100
+- Successfully imported: ~10,000 (unique IDs)
+- Skipped: ~42,000 (duplicates + existing)
+
+This is **not an error** - it means the system is protecting data integrity.
+
+## Which Tables Have Duplicate Protection?
+
+Currently implemented for:
+- ✅ `rolodex` (primary key: id)
+- ✅ `filetype` (primary key: file_type)
+
+Other tables should be updated if they encounter similar issues.
+
+## Re-importing Data
+
+You can safely re-import the same file multiple times:
+- Already imported records are detected and skipped
+- Only new records are added
+- No duplicate errors
+- Idempotent operation
+
+## Performance Considerations
+
+### Database Checks
+For each row, we query:
+```sql
+SELECT * FROM rolodex WHERE id = ?
+```
+
+This adds overhead but ensures data integrity. For 52k rows:
+- With duplicates: ~5-10 minutes
+- Without duplicates: ~2-5 minutes
+
+### Optimization Notes
+- Queries are indexed on primary key (fast)
+- Batch size: 500 records per commit
+- Only checks before adding to batch (not on commit)
+
+## Troubleshooting
+
+### Import seems slow
+**Normal behavior**: Database checks add time, especially with many duplicates.
+
+**Monitoring**:
+```bash
+# Watch import progress in logs
+docker-compose logs -f delphi-db | grep -i "rolodex\|import"
+```
+
+### All records skipped
+**Possible causes**:
+1. Data already imported - check database: `SELECT COUNT(*) FROM rolodex`
+2. CSV has no valid IDs - check CSV format
+3. Database already populated - safe to ignore if data looks correct
+
+### Want to re-import from scratch
+```bash
+# Clear rolodex table (be careful!)
+docker-compose exec delphi-db python3 << EOF
+from app.database import SessionLocal
+from app.models import Rolodex
+db = SessionLocal()
+db.query(Rolodex).delete()
+db.commit()
+print("Rolodex table cleared")
+EOF
+
+# Or delete entire database and restart
+rm delphi.db
+docker-compose restart
+```
+
+## Data Quality Insights
+
+The skip count provides insights into legacy data quality:
+
+### High Skip Rate (>50%)
+Indicates:
+- Significant duplicates in legacy system
+- Multiple exports merged together
+- Poor data normalization in original system
+
+### Low Skip Rate (<10%)
+Indicates:
+- Clean legacy data
+- Proper unique constraints in original system
+- First-time import
+
+### Example from Real Data
+From the rolodex file (52,100 rows):
+- Unique IDs: ~10,000
+- Duplicates: ~42,000
+- **Duplication rate: 80%**
+
+This suggests the legacy export included:
+- Historical snapshots
+- Multiple versions of same record
+- Merged data from different time periods
+
+## Future Improvements
+
+Potential enhancements:
+1. **Update existing records** instead of skipping
+2. **Merge duplicate records** based on timestamp or version
+3. **Report duplicate details** in import log
+4. **Configurable behavior** - skip vs update vs error
+5. **Batch optimization** - single query to check all IDs at once
+