unicoder.py

Unicoders purpose

Visual Studio 2008 can transparently handle unicode, i.e. developers can edit files with it without being aware how those files are encoded. Since programming is a collaborative effort and tastes (or requirements) vary, we have very little control over the encoding of the source files we process with embel.py.

After recurring agony and grief over stray unicode-encoded files (SVN, backups, etc.) we decided to embrace the 21st centure instead of fighting it and put graceful unicode-handling into embel.py.

unicoder.py reflects modest ambitions, namely:

before reading in a file, find out how it is encoded
remember how the file is encoded
decode the read-in text
when writing back the file, encode the embellished text in the way it was encoded before (i.e. how you remember it)

The code

Understand what "BOM"s are:
unicoding

UnicodeSmartRead receives a path, open the file behind that path and reads it in. The resulting text is checked for various BOMs to determine the encoding. That encoding is stored in a hash, with the absoulte path for the read file as the key and a string denoting the found encoding as its value (ISO-8859-1 if none is found). The global hash is named encodingHash.

A BOM is not considered part of the text.

UnicodeSmartWrite receives a path and a text. The text is written to the path, but it is written to the file with the encoding remembered in the encodingHash (and the BOM, if any, is prepended) to the encoded text. After writing, the entry for the written file's absolute path is removed from the hash.

Note that UnicodeSmartRead and UnicodeSmartWrite have to be balanced. If UnicodeSmartWrite can't find the (absolute) path for the file to write in the encodingHash, an exception is thrown. This is cheap and has the extra benefit of making confusing files (i.e. writing to a file different from the one we have read from) virtually impossible. In other words, unicoder.py assumes that only files that have been read by unicoder.py are written (back) by unicoder.py.

Unit-tests

The unit-tests check if the encoding is determined correctly for test files, located in the dummy subdirectory:

utf-8.txt
unicode.txt (UTF-16 little endian)
unicode-big-endian.txt (UTF-16 big endian)
ansi.txt (aka ISO-8859-1)

The base names of the files have been taken from the menu items in MS Windows' notepad. We used notepad to create these files.

The unit-tests also check whether files have been written back correctly by comparing them against prepared samples. The original file contents of the test files are kept in the -orig.txt files, the samples we compare the results from UnicodeSmartWrite are named -result.txt.

The unit-tests are cheap, could be refactored for squeezing out duplication – even more so if you rename the test-files in such a fashion that the base name is exactly as the strings denoting the encoding.