Files
delphi-database-v2/docs/DUPLICATE_HANDLING.md

175 lines
4.6 KiB
Markdown

# Duplicate Record Handling in Legacy Imports
## Overview
Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully.
## Problem
When importing rolodex data, duplicate IDs can cause:
```
UNIQUE constraint failed: rolodex.id
```
This error would cascade, causing all subsequent rows in the batch to fail with:
```
This Session's transaction has been rolled back due to a previous exception
```
## Solution
The import system now implements multiple layers of duplicate protection:
### 1. In-Memory Duplicate Tracking
```python
seen_in_import = set()
```
Tracks IDs encountered during the current import session. If an ID is seen twice in the same file, only the first occurrence is imported.
### 2. Database Existence Check
Before importing each record, checks if it already exists:
```python
if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first():
result['skipped'] += 1
continue
```
### 3. Graceful Batch Failure Handling
If a bulk insert fails due to duplicates:
- Transaction is rolled back
- Falls back to row-by-row insertion
- Silently skips duplicates
- Continues with remaining records
## Import Results
The import result now includes a `skipped` count:
```python
{
'success': 10000, # Records successfully imported
'errors': [], # Critical errors (empty if successful)
'total_rows': 52100, # Total rows in CSV
'skipped': 42094 # Duplicates or existing records skipped
}
```
## Understanding Skip Counts
High skip counts are **normal and expected** for legacy data:
### Why Records Are Skipped
1. **Duplicate IDs in CSV** - Same ID appears multiple times in file
2. **Re-importing** - Records already exist from previous import
3. **Data quality issues** - Legacy exports may have had duplicates
### Example: Rolodex Import
- Total rows: 52,100
- Successfully imported: ~10,000 (unique IDs)
- Skipped: ~42,000 (duplicates + existing)
This is **not an error** - it means the system is protecting data integrity.
## Which Tables Have Duplicate Protection?
Currently implemented for:
-`rolodex` (primary key: id)
-`filetype` (primary key: file_type)
Other tables should be updated if they encounter similar issues.
## Re-importing Data
You can safely re-import the same file multiple times:
- Already imported records are detected and skipped
- Only new records are added
- No duplicate errors
- Idempotent operation
## Performance Considerations
### Database Checks
For each row, we query:
```sql
SELECT * FROM rolodex WHERE id = ?
```
This adds overhead but ensures data integrity. For 52k rows:
- With duplicates: ~5-10 minutes
- Without duplicates: ~2-5 minutes
### Optimization Notes
- Queries are indexed on primary key (fast)
- Batch size: 500 records per commit
- Only checks before adding to batch (not on commit)
## Troubleshooting
### Import seems slow
**Normal behavior**: Database checks add time, especially with many duplicates.
**Monitoring**:
```bash
# Watch import progress in logs
docker-compose logs -f delphi-db | grep -i "rolodex\|import"
```
### All records skipped
**Possible causes**:
1. Data already imported - check database: `SELECT COUNT(*) FROM rolodex`
2. CSV has no valid IDs - check CSV format
3. Database already populated - safe to ignore if data looks correct
### Want to re-import from scratch
```bash
# Clear rolodex table (be careful!)
docker-compose exec delphi-db python3 << EOF
from app.database import SessionLocal
from app.models import Rolodex
db = SessionLocal()
db.query(Rolodex).delete()
db.commit()
print("Rolodex table cleared")
EOF
# Or delete entire database and restart
rm delphi.db
docker-compose restart
```
## Data Quality Insights
The skip count provides insights into legacy data quality:
### High Skip Rate (>50%)
Indicates:
- Significant duplicates in legacy system
- Multiple exports merged together
- Poor data normalization in original system
### Low Skip Rate (<10%)
Indicates:
- Clean legacy data
- Proper unique constraints in original system
- First-time import
### Example from Real Data
From the rolodex file (52,100 rows):
- Unique IDs: ~10,000
- Duplicates: ~42,000
- **Duplication rate: 80%**
This suggests the legacy export included:
- Historical snapshots
- Multiple versions of same record
- Merged data from different time periods
## Future Improvements
Potential enhancements:
1. **Update existing records** instead of skipping
2. **Merge duplicate records** based on timestamp or version
3. **Report duplicate details** in import log
4. **Configurable behavior** - skip vs update vs error
5. **Batch optimization** - single query to check all IDs at once