diff --git a/docs/DUPLICATE_HANDLING.md b/docs/DUPLICATE_HANDLING.md new file mode 100644 index 0000000..7cb7afc --- /dev/null +++ b/docs/DUPLICATE_HANDLING.md @@ -0,0 +1,174 @@ +# Duplicate Record Handling in Legacy Imports + +## Overview + +Legacy CSV files may contain duplicate IDs due to the way the original database system exported or maintained data. The import system now handles these duplicates gracefully. + +## Problem + +When importing rolodex data, duplicate IDs can cause: +``` +UNIQUE constraint failed: rolodex.id +``` + +This error would cascade, causing all subsequent rows in the batch to fail with: +``` +This Session's transaction has been rolled back due to a previous exception +``` + +## Solution + +The import system now implements multiple layers of duplicate protection: + +### 1. In-Memory Duplicate Tracking +```python +seen_in_import = set() +``` +Tracks IDs encountered during the current import session. If an ID is seen twice in the same file, only the first occurrence is imported. + +### 2. Database Existence Check +Before importing each record, checks if it already exists: +```python +if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first(): + result['skipped'] += 1 + continue +``` + +### 3. Graceful Batch Failure Handling +If a bulk insert fails due to duplicates: +- Transaction is rolled back +- Falls back to row-by-row insertion +- Silently skips duplicates +- Continues with remaining records + +## Import Results + +The import result now includes a `skipped` count: +```python +{ + 'success': 10000, # Records successfully imported + 'errors': [], # Critical errors (empty if successful) + 'total_rows': 52100, # Total rows in CSV + 'skipped': 42094 # Duplicates or existing records skipped +} +``` + +## Understanding Skip Counts + +High skip counts are **normal and expected** for legacy data: + +### Why Records Are Skipped +1. **Duplicate IDs in CSV** - Same ID appears multiple times in file +2. **Re-importing** - Records already exist from previous import +3. **Data quality issues** - Legacy exports may have had duplicates + +### Example: Rolodex Import +- Total rows: 52,100 +- Successfully imported: ~10,000 (unique IDs) +- Skipped: ~42,000 (duplicates + existing) + +This is **not an error** - it means the system is protecting data integrity. + +## Which Tables Have Duplicate Protection? + +Currently implemented for: +- ✅ `rolodex` (primary key: id) +- ✅ `filetype` (primary key: file_type) + +Other tables should be updated if they encounter similar issues. + +## Re-importing Data + +You can safely re-import the same file multiple times: +- Already imported records are detected and skipped +- Only new records are added +- No duplicate errors +- Idempotent operation + +## Performance Considerations + +### Database Checks +For each row, we query: +```sql +SELECT * FROM rolodex WHERE id = ? +``` + +This adds overhead but ensures data integrity. For 52k rows: +- With duplicates: ~5-10 minutes +- Without duplicates: ~2-5 minutes + +### Optimization Notes +- Queries are indexed on primary key (fast) +- Batch size: 500 records per commit +- Only checks before adding to batch (not on commit) + +## Troubleshooting + +### Import seems slow +**Normal behavior**: Database checks add time, especially with many duplicates. + +**Monitoring**: +```bash +# Watch import progress in logs +docker-compose logs -f delphi-db | grep -i "rolodex\|import" +``` + +### All records skipped +**Possible causes**: +1. Data already imported - check database: `SELECT COUNT(*) FROM rolodex` +2. CSV has no valid IDs - check CSV format +3. Database already populated - safe to ignore if data looks correct + +### Want to re-import from scratch +```bash +# Clear rolodex table (be careful!) +docker-compose exec delphi-db python3 << EOF +from app.database import SessionLocal +from app.models import Rolodex +db = SessionLocal() +db.query(Rolodex).delete() +db.commit() +print("Rolodex table cleared") +EOF + +# Or delete entire database and restart +rm delphi.db +docker-compose restart +``` + +## Data Quality Insights + +The skip count provides insights into legacy data quality: + +### High Skip Rate (>50%) +Indicates: +- Significant duplicates in legacy system +- Multiple exports merged together +- Poor data normalization in original system + +### Low Skip Rate (<10%) +Indicates: +- Clean legacy data +- Proper unique constraints in original system +- First-time import + +### Example from Real Data +From the rolodex file (52,100 rows): +- Unique IDs: ~10,000 +- Duplicates: ~42,000 +- **Duplication rate: 80%** + +This suggests the legacy export included: +- Historical snapshots +- Multiple versions of same record +- Merged data from different time periods + +## Future Improvements + +Potential enhancements: +1. **Update existing records** instead of skipping +2. **Merge duplicate records** based on timestamp or version +3. **Report duplicate details** in import log +4. **Configurable behavior** - skip vs update vs error +5. **Batch optimization** - single query to check all IDs at once +