Files

HotSwapp 02d439cf8b Update duplicate handling docs to include pension tables

- Document composite primary key handling for pension tables
- Add code examples for both single and composite key duplicate detection
- List all pension-related tables with duplicate protection

2025-10-13 09:36:09 -05:00

5.4 KiB

Raw Blame History

Duplicate Record Handling in Legacy Imports

Overview

Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully.

Problem

When importing rolodex data, duplicate IDs can cause:

UNIQUE constraint failed: rolodex.id

This error would cascade, causing all subsequent rows in the batch to fail with:

This Session's transaction has been rolled back due to a previous exception

Solution

The import system now implements multiple layers of duplicate protection:

1. In-Memory Duplicate Tracking

# For single primary key
seen_in_import = set()

# For composite primary key (e.g., file_no + version)
seen_in_import = set()  # stores tuples like (file_no, version)
composite_key = (file_no, version)

Tracks IDs or composite key combinations encountered during the current import session. If a key is seen twice in the same file, only the first occurrence is imported.

2. Database Existence Check

Before importing each record, checks if it already exists:

# For single primary key (e.g., rolodex)
if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first():
    result['skipped'] += 1
    continue

# For composite primary key (e.g., pensions with file_no + version)
if db.query(Pensions).filter(
    Pensions.file_no == file_no,
    Pensions.version == version
).first():
    result['skipped'] += 1
    continue

3. Graceful Batch Failure Handling

If a bulk insert fails due to duplicates:

Transaction is rolled back
Falls back to row-by-row insertion
Silently skips duplicates
Continues with remaining records

Import Results

The import result now includes a skipped count:

{
    'success': 10000,       # Records successfully imported
    'errors': [],           # Critical errors (empty if successful)
    'total_rows': 52100,    # Total rows in CSV
    'skipped': 42094        # Duplicates or existing records skipped
}

Understanding Skip Counts

High skip counts are normal and expected for legacy data:

Why Records Are Skipped

Duplicate IDs in CSV - Same ID appears multiple times in file
Re-importing - Records already exist from previous import
Data quality issues - Legacy exports may have had duplicates

Example: Rolodex Import

Total rows: 52,100
Successfully imported: ~10,000 (unique IDs)
Skipped: ~42,000 (duplicates + existing)

This is not an error - it means the system is protecting data integrity.

Which Tables Have Duplicate Protection?

Currently implemented for:

✅ rolodex (primary key: id)
✅ filetype (primary key: file_type)
✅ pensions (composite primary key: file_no, version)
✅ pension_death (composite primary key: file_no, version)
✅ pension_separate (composite primary key: file_no, version)
✅ pension_results (composite primary key: file_no, version)

Other tables should be updated if they encounter similar issues.

Re-importing Data

You can safely re-import the same file multiple times:

Already imported records are detected and skipped
Only new records are added
No duplicate errors
Idempotent operation

Performance Considerations

Database Checks

For each row, we query:

SELECT * FROM rolodex WHERE id = ?

This adds overhead but ensures data integrity. For 52k rows:

With duplicates: ~5-10 minutes
Without duplicates: ~2-5 minutes

Optimization Notes

Queries are indexed on primary key (fast)
Batch size: 500 records per commit
Only checks before adding to batch (not on commit)

Troubleshooting

Import seems slow

Normal behavior: Database checks add time, especially with many duplicates.

Monitoring:

# Watch import progress in logs
docker-compose logs -f delphi-db | grep -i "rolodex\|import"

All records skipped

Possible causes:

Data already imported - check database: SELECT COUNT(*) FROM rolodex
CSV has no valid IDs - check CSV format
Database already populated - safe to ignore if data looks correct

Want to re-import from scratch

# Clear rolodex table (be careful!)
docker-compose exec delphi-db python3 << EOF
from app.database import SessionLocal
from app.models import Rolodex
db = SessionLocal()
db.query(Rolodex).delete()
db.commit()
print("Rolodex table cleared")
EOF

# Or delete entire database and restart
rm delphi.db
docker-compose restart

Data Quality Insights

The skip count provides insights into legacy data quality:

High Skip Rate (>50%)

Indicates:

Significant duplicates in legacy system
Multiple exports merged together
Poor data normalization in original system

Low Skip Rate (<10%)