25 Sep

UTF-8 with BOM

When localizing DragonScales 3 we experienced a baffling issue with an internal tool whose purpose is simply to replace text in a group of files. Those UTF-8 encoded files contain messages loaded by the game from the very beginning. However, after running the tool, the game started crashing when reading such files. By using an old buddy, fc /B, we found out that our tool was “injecting” a few extra bytes at the start of the file: EF BB BF. In short, the tool was altering the encoding of files from UTF-8 to UTF-8 with BOM. That was the cause for the crashing, as our game expects the files to be UTF-8 encoded without BOM.

What’s this BOM, anyway? Simply put, it’s just a sequence of bytes (EF BB BF) used to signal readers about the file being UTF-8 encoded. It seems such mark might be useful in some specific contexts, with some specific programs. Not our case, so we had to remove the BOM with a little batch script like this:

for /r ".\DE\scenes" %%i in (*.*) do (
  copy %%i .\tmp.txt /Y
  sed -i '1s/^\xEF\xBB\xBF//' .\tmp.txt
  attrib -R .\tmp.txt
  move /Y .\tmp.txt %%i
)

In this snippet we remove the BOM via sed. Files are those under a fictitious directory, .\DE\scenes. Those copies and attribs help to circumvent some problems with permissions of files created by our sed version on Windows.