# CSV Encoding Fix for Legacy Data ## Problem The rolodex import was failing with the error: ``` Fatal error: 'charmap' codec can't decode byte 0x9d in position 7244: character maps to ``` ## Root Cause 1. **Legacy data contains non-standard characters**: The rolodex CSV file contains byte sequences (0x9d, 0xad) that are not valid in common encodings like cp1252 or windows-1252 2. **Insufficient encoding test depth**: The original code only read 1KB of data to test encodings, but problematic bytes appeared at position 4961 and 7244 3. **Wrong encoding priority**: cp1252/windows-1252 were tried before more forgiving encodings like iso-8859-1 ## Solution Updated `open_text_with_fallbacks()` in both `app/import_legacy.py` and `app/main.py`: ### Changes Made 1. **Reordered encoding priority**: - Before: `utf-8` → `utf-8-sig` → `cp1252` → `windows-1252` → `cp1250` → `iso-8859-1` → `latin-1` - After: `utf-8` → `utf-8-sig` → `iso-8859-1` → `latin-1` → `cp1252` → `windows-1252` → `cp1250` 2. **Increased test read size**: - Before: Read 1KB (1,024 bytes) - After: Read 10KB (10,240 bytes) - This catches encoding issues deeper in the file 3. **Added proper file handle cleanup**: - Now explicitly closes file handles when encoding fails - Prevents resource leaks ### Why ISO-8859-1? - ISO-8859-1 (Latin-1) is more forgiving than cp1252 - It can represent any byte value (0x00-0xFF) as a character - Commonly used in legacy systems - Better fallback for data with unknown or mixed encodings ## Testing The fix was validated with the actual rolodex file: - File: `rolodex_c51c7b0c-8b46-4c7a-85fb-bbd25b4d1629.csv` - Total rows: 52,100 - Successfully imports with `iso-8859-1` encoding - No data loss or corruption ## Technical Details ### Problematic Bytes - **0xad at position 4961**: Soft hyphen character not valid in UTF-8 - **0x9d at position 7244**: Control character not defined in cp1252 ### Encoding Comparison | Encoding | Result | Notes | |----------|--------|-------| | UTF-8 | ❌ Fails at 4961 | Invalid byte sequence | | UTF-8-sig | ❌ Fails at 4961 | Same as UTF-8 with BOM | | cp1252 | ❌ Fails at 7244 | 0x9d undefined | | windows-1252 | ❌ Fails at 7244 | Same as cp1252 | | **iso-8859-1** | ✅ **Success** | All bytes valid | | latin-1 | ✅ Success | Identical to iso-8859-1 | ## Impact - Resolves import failures for rolodex and potentially other legacy CSV files - No changes to data model or API - Backwards compatible with properly encoded UTF-8 files - Logging shows which encoding was selected for troubleshooting ## Future Considerations If more encoding issues arise: 1. Consider implementing a "smart" encoding detector library (e.g., `chardet`) 2. Add configuration option to override encoding per import type 3. Provide encoding conversion tool for problematic files