diff --git a/docs/DUPLICATE_HANDLING.md b/docs/DUPLICATE_HANDLING.md index 7cb7afc..5b6e377 100644 --- a/docs/DUPLICATE_HANDLING.md +++ b/docs/DUPLICATE_HANDLING.md @@ -22,16 +22,30 @@ The import system now implements multiple layers of duplicate protection: ### 1. In-Memory Duplicate Tracking ```python +# For single primary key seen_in_import = set() + +# For composite primary key (e.g., file_no + version) +seen_in_import = set() # stores tuples like (file_no, version) +composite_key = (file_no, version) ``` -Tracks IDs encountered during the current import session. If an ID is seen twice in the same file, only the first occurrence is imported. +Tracks IDs or composite key combinations encountered during the current import session. If a key is seen twice in the same file, only the first occurrence is imported. ### 2. Database Existence Check Before importing each record, checks if it already exists: ```python +# For single primary key (e.g., rolodex) if db.query(Rolodex).filter(Rolodex.id == rolodex_id).first(): result['skipped'] += 1 continue + +# For composite primary key (e.g., pensions with file_no + version) +if db.query(Pensions).filter( + Pensions.file_no == file_no, + Pensions.version == version +).first(): + result['skipped'] += 1 + continue ``` ### 3. Graceful Batch Failure Handling @@ -74,6 +88,10 @@ This is **not an error** - it means the system is protecting data integrity. Currently implemented for: - ✅ `rolodex` (primary key: id) - ✅ `filetype` (primary key: file_type) +- ✅ `pensions` (composite primary key: file_no, version) +- ✅ `pension_death` (composite primary key: file_no, version) +- ✅ `pension_separate` (composite primary key: file_no, version) +- ✅ `pension_results` (composite primary key: file_no, version) Other tables should be updated if they encounter similar issues.