2.8 KiB
2.8 KiB
CSV Encoding Fix for Legacy Data
Problem
The rolodex import was failing with the error:
Fatal error: 'charmap' codec can't decode byte 0x9d in position 7244: character maps to <undefined>
Root Cause
- Legacy data contains non-standard characters: The rolodex CSV file contains byte sequences (0x9d, 0xad) that are not valid in common encodings like cp1252 or windows-1252
- Insufficient encoding test depth: The original code only read 1KB of data to test encodings, but problematic bytes appeared at position 4961 and 7244
- Wrong encoding priority: cp1252/windows-1252 were tried before more forgiving encodings like iso-8859-1
Solution
Updated open_text_with_fallbacks() in both app/import_legacy.py and app/main.py:
Changes Made
-
Reordered encoding priority:
- Before:
utf-8→utf-8-sig→cp1252→windows-1252→cp1250→iso-8859-1→latin-1 - After:
utf-8→utf-8-sig→iso-8859-1→latin-1→cp1252→windows-1252→cp1250
- Before:
-
Increased test read size:
- Before: Read 1KB (1,024 bytes)
- After: Read 10KB (10,240 bytes)
- This catches encoding issues deeper in the file
-
Added proper file handle cleanup:
- Now explicitly closes file handles when encoding fails
- Prevents resource leaks
Why ISO-8859-1?
- ISO-8859-1 (Latin-1) is more forgiving than cp1252
- It can represent any byte value (0x00-0xFF) as a character
- Commonly used in legacy systems
- Better fallback for data with unknown or mixed encodings
Testing
The fix was validated with the actual rolodex file:
- File:
rolodex_c51c7b0c-8b46-4c7a-85fb-bbd25b4d1629.csv - Total rows: 52,100
- Successfully imports with
iso-8859-1encoding - No data loss or corruption
Technical Details
Problematic Bytes
- 0xad at position 4961: Soft hyphen character not valid in UTF-8
- 0x9d at position 7244: Control character not defined in cp1252
Encoding Comparison
| Encoding | Result | Notes |
|---|---|---|
| UTF-8 | ❌ Fails at 4961 | Invalid byte sequence |
| UTF-8-sig | ❌ Fails at 4961 | Same as UTF-8 with BOM |
| cp1252 | ❌ Fails at 7244 | 0x9d undefined |
| windows-1252 | ❌ Fails at 7244 | Same as cp1252 |
| iso-8859-1 | ✅ Success | All bytes valid |
| latin-1 | ✅ Success | Identical to iso-8859-1 |
Impact
- Resolves import failures for rolodex and potentially other legacy CSV files
- No changes to data model or API
- Backwards compatible with properly encoded UTF-8 files
- Logging shows which encoding was selected for troubleshooting
Future Considerations
If more encoding issues arise:
- Consider implementing a "smart" encoding detector library (e.g.,
chardet) - Add configuration option to override encoding per import type
- Provide encoding conversion tool for problematic files