Files
delphi-database-v2/docs/ENCODING_FIX.md

2.8 KiB

CSV Encoding Fix for Legacy Data

Problem

The rolodex import was failing with the error:

Fatal error: 'charmap' codec can't decode byte 0x9d in position 7244: character maps to <undefined>

Root Cause

  1. Legacy data contains non-standard characters: The rolodex CSV file contains byte sequences (0x9d, 0xad) that are not valid in common encodings like cp1252 or windows-1252
  2. Insufficient encoding test depth: The original code only read 1KB of data to test encodings, but problematic bytes appeared at position 4961 and 7244
  3. Wrong encoding priority: cp1252/windows-1252 were tried before more forgiving encodings like iso-8859-1

Solution

Updated open_text_with_fallbacks() in both app/import_legacy.py and app/main.py:

Changes Made

  1. Reordered encoding priority:

    • Before: utf-8utf-8-sigcp1252windows-1252cp1250iso-8859-1latin-1
    • After: utf-8utf-8-sigiso-8859-1latin-1cp1252windows-1252cp1250
  2. Increased test read size:

    • Before: Read 1KB (1,024 bytes)
    • After: Read 10KB (10,240 bytes)
    • This catches encoding issues deeper in the file
  3. Added proper file handle cleanup:

    • Now explicitly closes file handles when encoding fails
    • Prevents resource leaks

Why ISO-8859-1?

  • ISO-8859-1 (Latin-1) is more forgiving than cp1252
  • It can represent any byte value (0x00-0xFF) as a character
  • Commonly used in legacy systems
  • Better fallback for data with unknown or mixed encodings

Testing

The fix was validated with the actual rolodex file:

  • File: rolodex_c51c7b0c-8b46-4c7a-85fb-bbd25b4d1629.csv
  • Total rows: 52,100
  • Successfully imports with iso-8859-1 encoding
  • No data loss or corruption

Technical Details

Problematic Bytes

  • 0xad at position 4961: Soft hyphen character not valid in UTF-8
  • 0x9d at position 7244: Control character not defined in cp1252

Encoding Comparison

Encoding Result Notes
UTF-8 Fails at 4961 Invalid byte sequence
UTF-8-sig Fails at 4961 Same as UTF-8 with BOM
cp1252 Fails at 7244 0x9d undefined
windows-1252 Fails at 7244 Same as cp1252
iso-8859-1 Success All bytes valid
latin-1 Success Identical to iso-8859-1

Impact

  • Resolves import failures for rolodex and potentially other legacy CSV files
  • No changes to data model or API
  • Backwards compatible with properly encoded UTF-8 files
  • Logging shows which encoding was selected for troubleshooting

Future Considerations

If more encoding issues arise:

  1. Consider implementing a "smart" encoding detector library (e.g., chardet)
  2. Add configuration option to override encoding per import type
  3. Provide encoding conversion tool for problematic files