Unicoders purpose

Visual Studio 2008 can transparently handle unicode, i.e. developers can edit files with it without being aware how those files are encoded. Since programming is a collaborative effort and tastes (or requirements) vary, we have very little control over the encoding of the source files we process with embel.py.

After recurring agony and grief over stray unicode-encoded files (SVN, backups, etc.) we decided to embrace the 21st centure instead of fighting it and put graceful unicode-handling into embel.py.

unicoder.py reflects modest ambitions, namely:

The code

Understand what "BOM"s are:
unicoding

A BOM is not considered part of the text.

Note that UnicodeSmartRead and UnicodeSmartWrite have to be balanced. If UnicodeSmartWrite can't find the (absolute) path for the file to write in the encodingHash, an exception is thrown. This is cheap and has the extra benefit of making confusing files (i.e. writing to a file different from the one we have read from) virtually impossible. In other words, unicoder.py assumes that only files that have been read by unicoder.py are written (back) by unicoder.py.

Unit-tests

The unit-tests check if the encoding is determined correctly for test files, located in the dummy subdirectory:

The base names of the files have been taken from the menu items in MS Windows' notepad. We used notepad to create these files.

The unit-tests also check whether files have been written back correctly by comparing them against prepared samples. The original file contents of the test files are kept in the -orig.txt files, the samples we compare the results from UnicodeSmartWrite are named -result.txt.

The unit-tests are cheap, could be refactored for squeezing out duplication – even more so if you rename the test-files in such a fashion that the base name is exactly as the strings denoting the encoding.