docs: Add comprehensive documentation of CSV encoding fix

This commit is contained in:
HotSwapp
2025-10-12 19:19:56 -05:00
parent 7958556613
commit 89ff90a384

79
docs/ENCODING_FIX.md Normal file
View File

@@ -0,0 +1,79 @@
# CSV Encoding Fix for Legacy Data
## Problem
The rolodex import was failing with the error:
```
Fatal error: 'charmap' codec can't decode byte 0x9d in position 7244: character maps to <undefined>
```
## Root Cause
1. **Legacy data contains non-standard characters**: The rolodex CSV file contains byte sequences (0x9d, 0xad) that are not valid in common encodings like cp1252 or windows-1252
2. **Insufficient encoding test depth**: The original code only read 1KB of data to test encodings, but problematic bytes appeared at position 4961 and 7244
3. **Wrong encoding priority**: cp1252/windows-1252 were tried before more forgiving encodings like iso-8859-1
## Solution
Updated `open_text_with_fallbacks()` in both `app/import_legacy.py` and `app/main.py`:
### Changes Made
1. **Reordered encoding priority**:
- Before: `utf-8``utf-8-sig``cp1252``windows-1252``cp1250``iso-8859-1``latin-1`
- After: `utf-8``utf-8-sig``iso-8859-1``latin-1``cp1252``windows-1252``cp1250`
2. **Increased test read size**:
- Before: Read 1KB (1,024 bytes)
- After: Read 10KB (10,240 bytes)
- This catches encoding issues deeper in the file
3. **Added proper file handle cleanup**:
- Now explicitly closes file handles when encoding fails
- Prevents resource leaks
### Why ISO-8859-1?
- ISO-8859-1 (Latin-1) is more forgiving than cp1252
- It can represent any byte value (0x00-0xFF) as a character
- Commonly used in legacy systems
- Better fallback for data with unknown or mixed encodings
## Testing
The fix was validated with the actual rolodex file:
- File: `rolodex_c51c7b0c-8b46-4c7a-85fb-bbd25b4d1629.csv`
- Total rows: 52,100
- Successfully imports with `iso-8859-1` encoding
- No data loss or corruption
## Technical Details
### Problematic Bytes
- **0xad at position 4961**: Soft hyphen character not valid in UTF-8
- **0x9d at position 7244**: Control character not defined in cp1252
### Encoding Comparison
| Encoding | Result | Notes |
|----------|--------|-------|
| UTF-8 | ❌ Fails at 4961 | Invalid byte sequence |
| UTF-8-sig | ❌ Fails at 4961 | Same as UTF-8 with BOM |
| cp1252 | ❌ Fails at 7244 | 0x9d undefined |
| windows-1252 | ❌ Fails at 7244 | Same as cp1252 |
| **iso-8859-1** | ✅ **Success** | All bytes valid |
| latin-1 | ✅ Success | Identical to iso-8859-1 |
## Impact
- Resolves import failures for rolodex and potentially other legacy CSV files
- No changes to data model or API
- Backwards compatible with properly encoded UTF-8 files
- Logging shows which encoding was selected for troubleshooting
## Future Considerations
If more encoding issues arise:
1. Consider implementing a "smart" encoding detector library (e.g., `chardet`)
2. Add configuration option to override encoding per import type
3. Provide encoding conversion tool for problematic files