- Document composite primary key handling for pension tables - Add code examples for both single and composite key duplicate detection - List all pension-related tables with duplicate protection
5.4 KiB
Duplicate Record Handling in Legacy Imports
Overview
Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully.
Problem
When importing rolodex data, duplicate IDs can cause:
UNIQUE constraint failed: rolodex.id
This error would cascade, causing all subsequent rows in the batch to fail with:
This Session's transaction has been rolled back due to a previous exception
Solution
The import system now implements multiple layers of duplicate protection:
1. In-Memory Duplicate Tracking
# For single primary key
seen_in_import = set()
# For composite primary key (e.g., file_no + version)
seen_in_import = set() # stores tuples like (file_no, version)
composite_key = (file_no, version)
Tracks IDs or composite key combinations encountered during the current import session. If a key is seen twice in the same file, only the first occurrence is imported.
2. Database Existence Check
Before importing each record, checks if it already exists:
# For single primary key (e.g., rolodex)
if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first():
result['skipped'] += 1
continue
# For composite primary key (e.g., pensions with file_no + version)
if db.query(Pensions).filter(
Pensions.file_no == file_no,
Pensions.version == version
).first():
result['skipped'] += 1
continue
3. Graceful Batch Failure Handling
If a bulk insert fails due to duplicates:
- Transaction is rolled back
- Falls back to row-by-row insertion
- Silently skips duplicates
- Continues with remaining records
Import Results
The import result now includes a skipped count:
{
'success': 10000, # Records successfully imported
'errors': [], # Critical errors (empty if successful)
'total_rows': 52100, # Total rows in CSV
'skipped': 42094 # Duplicates or existing records skipped
}
Understanding Skip Counts
High skip counts are normal and expected for legacy data:
Why Records Are Skipped
- Duplicate IDs in CSV - Same ID appears multiple times in file
- Re-importing - Records already exist from previous import
- Data quality issues - Legacy exports may have had duplicates
Example: Rolodex Import
- Total rows: 52,100
- Successfully imported: ~10,000 (unique IDs)
- Skipped: ~42,000 (duplicates + existing)
This is not an error - it means the system is protecting data integrity.
Which Tables Have Duplicate Protection?
Currently implemented for:
- ✅
rolodex(primary key: id) - ✅
filetype(primary key: file_type) - ✅
pensions(composite primary key: file_no, version) - ✅
pension_death(composite primary key: file_no, version) - ✅
pension_separate(composite primary key: file_no, version) - ✅
pension_results(composite primary key: file_no, version)
Other tables should be updated if they encounter similar issues.
Re-importing Data
You can safely re-import the same file multiple times:
- Already imported records are detected and skipped
- Only new records are added
- No duplicate errors
- Idempotent operation
Performance Considerations
Database Checks
For each row, we query:
SELECT * FROM rolodex WHERE id = ?
This adds overhead but ensures data integrity. For 52k rows:
- With duplicates: ~5-10 minutes
- Without duplicates: ~2-5 minutes
Optimization Notes
- Queries are indexed on primary key (fast)
- Batch size: 500 records per commit
- Only checks before adding to batch (not on commit)
Troubleshooting
Import seems slow
Normal behavior: Database checks add time, especially with many duplicates.
Monitoring:
# Watch import progress in logs
docker-compose logs -f delphi-db | grep -i "rolodex\|import"
All records skipped
Possible causes:
- Data already imported - check database:
SELECT COUNT(*) FROM rolodex - CSV has no valid IDs - check CSV format
- Database already populated - safe to ignore if data looks correct
Want to re-import from scratch
# Clear rolodex table (be careful!)
docker-compose exec delphi-db python3 << EOF
from app.database import SessionLocal
from app.models import Rolodex
db = SessionLocal()
db.query(Rolodex).delete()
db.commit()
print("Rolodex table cleared")
EOF
# Or delete entire database and restart
rm delphi.db
docker-compose restart
Data Quality Insights
The skip count provides insights into legacy data quality:
High Skip Rate (>50%)
Indicates:
- Significant duplicates in legacy system
- Multiple exports merged together
- Poor data normalization in original system
Low Skip Rate (<10%)
Indicates:
- Clean legacy data
- Proper unique constraints in original system
- First-time import
Example from Real Data
From the rolodex file (52,100 rows):
- Unique IDs: ~10,000
- Duplicates: ~42,000
- Duplication rate: 80%
This suggests the legacy export included:
- Historical snapshots
- Multiple versions of same record
- Merged data from different time periods
Future Improvements
Potential enhancements:
- Update existing records instead of skipping
- Merge duplicate records based on timestamp or version
- Report duplicate details in import log
- Configurable behavior - skip vs update vs error
- Batch optimization - single query to check all IDs at once