- Document composite primary key handling for pension tables - Add code examples for both single and composite key duplicate detection - List all pension-related tables with duplicate protection
193 lines
5.4 KiB
Markdown
193 lines
5.4 KiB
Markdown
# Duplicate Record Handling in Legacy Imports
|
|
|
|
## Overview
|
|
|
|
Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully.
|
|
|
|
## Problem
|
|
|
|
When importing rolodex data, duplicate IDs can cause:
|
|
```
|
|
UNIQUE constraint failed: rolodex.id
|
|
```
|
|
|
|
This error would cascade, causing all subsequent rows in the batch to fail with:
|
|
```
|
|
This Session's transaction has been rolled back due to a previous exception
|
|
```
|
|
|
|
## Solution
|
|
|
|
The import system now implements multiple layers of duplicate protection:
|
|
|
|
### 1. In-Memory Duplicate Tracking
|
|
```python
|
|
# For single primary key
|
|
seen_in_import = set()
|
|
|
|
# For composite primary key (e.g., file_no + version)
|
|
seen_in_import = set() # stores tuples like (file_no, version)
|
|
composite_key = (file_no, version)
|
|
```
|
|
Tracks IDs or composite key combinations encountered during the current import session. If a key is seen twice in the same file, only the first occurrence is imported.
|
|
|
|
### 2. Database Existence Check
|
|
Before importing each record, checks if it already exists:
|
|
```python
|
|
# For single primary key (e.g., rolodex)
|
|
if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first():
|
|
result['skipped'] += 1
|
|
continue
|
|
|
|
# For composite primary key (e.g., pensions with file_no + version)
|
|
if db.query(Pensions).filter(
|
|
Pensions.file_no == file_no,
|
|
Pensions.version == version
|
|
).first():
|
|
result['skipped'] += 1
|
|
continue
|
|
```
|
|
|
|
### 3. Graceful Batch Failure Handling
|
|
If a bulk insert fails due to duplicates:
|
|
- Transaction is rolled back
|
|
- Falls back to row-by-row insertion
|
|
- Silently skips duplicates
|
|
- Continues with remaining records
|
|
|
|
## Import Results
|
|
|
|
The import result now includes a `skipped` count:
|
|
```python
|
|
{
|
|
'success': 10000, # Records successfully imported
|
|
'errors': [], # Critical errors (empty if successful)
|
|
'total_rows': 52100, # Total rows in CSV
|
|
'skipped': 42094 # Duplicates or existing records skipped
|
|
}
|
|
```
|
|
|
|
## Understanding Skip Counts
|
|
|
|
High skip counts are **normal and expected** for legacy data:
|
|
|
|
### Why Records Are Skipped
|
|
1. **Duplicate IDs in CSV** - Same ID appears multiple times in file
|
|
2. **Re-importing** - Records already exist from previous import
|
|
3. **Data quality issues** - Legacy exports may have had duplicates
|
|
|
|
### Example: Rolodex Import
|
|
- Total rows: 52,100
|
|
- Successfully imported: ~10,000 (unique IDs)
|
|
- Skipped: ~42,000 (duplicates + existing)
|
|
|
|
This is **not an error** - it means the system is protecting data integrity.
|
|
|
|
## Which Tables Have Duplicate Protection?
|
|
|
|
Currently implemented for:
|
|
- ✅ `rolodex` (primary key: id)
|
|
- ✅ `filetype` (primary key: file_type)
|
|
- ✅ `pensions` (composite primary key: file_no, version)
|
|
- ✅ `pension_death` (composite primary key: file_no, version)
|
|
- ✅ `pension_separate` (composite primary key: file_no, version)
|
|
- ✅ `pension_results` (composite primary key: file_no, version)
|
|
|
|
Other tables should be updated if they encounter similar issues.
|
|
|
|
## Re-importing Data
|
|
|
|
You can safely re-import the same file multiple times:
|
|
- Already imported records are detected and skipped
|
|
- Only new records are added
|
|
- No duplicate errors
|
|
- Idempotent operation
|
|
|
|
## Performance Considerations
|
|
|
|
### Database Checks
|
|
For each row, we query:
|
|
```sql
|
|
SELECT * FROM rolodex WHERE id = ?
|
|
```
|
|
|
|
This adds overhead but ensures data integrity. For 52k rows:
|
|
- With duplicates: ~5-10 minutes
|
|
- Without duplicates: ~2-5 minutes
|
|
|
|
### Optimization Notes
|
|
- Queries are indexed on primary key (fast)
|
|
- Batch size: 500 records per commit
|
|
- Only checks before adding to batch (not on commit)
|
|
|
|
## Troubleshooting
|
|
|
|
### Import seems slow
|
|
**Normal behavior**: Database checks add time, especially with many duplicates.
|
|
|
|
**Monitoring**:
|
|
```bash
|
|
# Watch import progress in logs
|
|
docker-compose logs -f delphi-db | grep -i "rolodex\|import"
|
|
```
|
|
|
|
### All records skipped
|
|
**Possible causes**:
|
|
1. Data already imported - check database: `SELECT COUNT(*) FROM rolodex`
|
|
2. CSV has no valid IDs - check CSV format
|
|
3. Database already populated - safe to ignore if data looks correct
|
|
|
|
### Want to re-import from scratch
|
|
```bash
|
|
# Clear rolodex table (be careful!)
|
|
docker-compose exec delphi-db python3 << EOF
|
|
from app.database import SessionLocal
|
|
from app.models import Rolodex
|
|
db = SessionLocal()
|
|
db.query(Rolodex).delete()
|
|
db.commit()
|
|
print("Rolodex table cleared")
|
|
EOF
|
|
|
|
# Or delete entire database and restart
|
|
rm delphi.db
|
|
docker-compose restart
|
|
```
|
|
|
|
## Data Quality Insights
|
|
|
|
The skip count provides insights into legacy data quality:
|
|
|
|
### High Skip Rate (>50%)
|
|
Indicates:
|
|
- Significant duplicates in legacy system
|
|
- Multiple exports merged together
|
|
- Poor data normalization in original system
|
|
|
|
### Low Skip Rate (<10%)
|
|
Indicates:
|
|
- Clean legacy data
|
|
- Proper unique constraints in original system
|
|
- First-time import
|
|
|
|
### Example from Real Data
|
|
From the rolodex file (52,100 rows):
|
|
- Unique IDs: ~10,000
|
|
- Duplicates: ~42,000
|
|
- **Duplication rate: 80%**
|
|
|
|
This suggests the legacy export included:
|
|
- Historical snapshots
|
|
- Multiple versions of same record
|
|
- Merged data from different time periods
|
|
|
|
## Future Improvements
|
|
|
|
Potential enhancements:
|
|
1. **Update existing records** instead of skipping
|
|
2. **Merge duplicate records** based on timestamp or version
|
|
3. **Report duplicate details** in import log
|
|
4. **Configurable behavior** - skip vs update vs error
|
|
5. **Batch optimization** - single query to check all IDs at once
|
|
|