I stored the same text file in various encodings with notepad (a fine tool for unicode diagnostics). I read in those files in python, as plain 8-bit streams. No surprises there:
Here is a breakdown of the BOMs (http://en.wikipedia.org/wiki/Byte-order_mark):
UTF-8 |
EF BB BF |
UTF-16 BE ("unicode big endian") |
FE FF |
UTF-16 LE ("unicode") |
FF FE |
decode
works as expected:
Of course it happened again: program works okay, unit-tests suck and must be debugged.
At this point, I have my unicoder (18:36), and the unit-tests show that it works.