# Duplicate Record Handling in Legacy Imports ## Overview Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully. ## Problem When importing rolodex data, duplicate IDs can cause: ``` UNIQUE constraint failed: rolodex.id ``` This error would cascade, causing all subsequent rows in the batch to fail with: ``` This Session's transaction has been rolled back due to a previous exception ``` ## Solution The import system now implements multiple layers of duplicate protection: ### 1. In-Memory Duplicate Tracking ```python # For single primary key seen_in_import = set() # For composite primary key (e.g., file_no + version) seen_in_import = set() # stores tuples like (file_no, version) composite_key = (file_no, version) ``` Tracks IDs or composite key combinations encountered during the current import session. If a key is seen twice in the same file, only the first occurrence is imported. ### 2. Database Existence Check Before importing each record, checks if it already exists: ```python # For single primary key (e.g., rolodex) if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first(): result['skipped'] += 1 continue # For composite primary key (e.g., pensions with file_no + version) if db.query(Pensions).filter( Pensions.file_no == file_no, Pensions.version == version ).first(): result['skipped'] += 1 continue ``` ### 3. Graceful Batch Failure Handling If a bulk insert fails due to duplicates: - Transaction is rolled back - Falls back to row-by-row insertion - Silently skips duplicates - Continues with remaining records ## Import Results The import result now includes a `skipped` count: ```python { 'success': 10000, # Records successfully imported 'errors': [], # Critical errors (empty if successful) 'total_rows': 52100, # Total rows in CSV 'skipped': 42094 # Duplicates or existing records skipped } ``` ## Understanding Skip Counts High skip counts are **normal and expected** for legacy data: ### Why Records Are Skipped 1. **Duplicate IDs in CSV** - Same ID appears multiple times in file 2. **Re-importing** - Records already exist from previous import 3. **Data quality issues** - Legacy exports may have had duplicates ### Example: Rolodex Import - Total rows: 52,100 - Successfully imported: ~10,000 (unique IDs) - Skipped: ~42,000 (duplicates + existing) This is **not an error** - it means the system is protecting data integrity. ## Which Tables Have Duplicate Protection? Currently implemented for: - ✅ `rolodex` (primary key: id) - ✅ `filetype` (primary key: file_type) - ✅ `pensions` (composite primary key: file_no, version) - ✅ `pension_death` (composite primary key: file_no, version) - ✅ `pension_separate` (composite primary key: file_no, version) - ✅ `pension_results` (composite primary key: file_no, version) Other tables should be updated if they encounter similar issues. ## Re-importing Data You can safely re-import the same file multiple times: - Already imported records are detected and skipped - Only new records are added - No duplicate errors - Idempotent operation ## Performance Considerations ### Database Checks For each row, we query: ```sql SELECT * FROM rolodex WHERE id = ? ``` This adds overhead but ensures data integrity. For 52k rows: - With duplicates: ~5-10 minutes - Without duplicates: ~2-5 minutes ### Optimization Notes - Queries are indexed on primary key (fast) - Batch size: 500 records per commit - Only checks before adding to batch (not on commit) ## Troubleshooting ### Import seems slow **Normal behavior**: Database checks add time, especially with many duplicates. **Monitoring**: ```bash # Watch import progress in logs docker-compose logs -f delphi-db | grep -i "rolodex\|import" ``` ### All records skipped **Possible causes**: 1. Data already imported - check database: `SELECT COUNT(*) FROM rolodex` 2. CSV has no valid IDs - check CSV format 3. Database already populated - safe to ignore if data looks correct ### Want to re-import from scratch ```bash # Clear rolodex table (be careful!) docker-compose exec delphi-db python3 << EOF from app.database import SessionLocal from app.models import Rolodex db = SessionLocal() db.query(Rolodex).delete() db.commit() print("Rolodex table cleared") EOF # Or delete entire database and restart rm delphi.db docker-compose restart ``` ## Data Quality Insights The skip count provides insights into legacy data quality: ### High Skip Rate (>50%) Indicates: - Significant duplicates in legacy system - Multiple exports merged together - Poor data normalization in original system ### Low Skip Rate (<10%) Indicates: - Clean legacy data - Proper unique constraints in original system - First-time import ### Example from Real Data From the rolodex file (52,100 rows): - Unique IDs: ~10,000 - Duplicates: ~42,000 - **Duplication rate: 80%** This suggests the legacy export included: - Historical snapshots - Multiple versions of same record - Merged data from different time periods ## Future Improvements Potential enhancements: 1. **Update existing records** instead of skipping 2. **Merge duplicate records** based on timestamp or version 3. **Report duplicate details** in import log 4. **Configurable behavior** - skip vs update vs error 5. **Batch optimization** - single query to check all IDs at once