txt2tags blog

Python script to use HTMLDOC with UTF-8 files

You know, HTMLDOC is a good tool to complement txt2tags features, specially to break an HTML file into multiple pages.

But the current version of HTMLDOC (1.8.x) has no Unicode support.

When you try to use it to convert or split an UTF-8 file, all the special characters (not ASCII) will be incorrect in the resulting HTML.

The Unicode support will be released on the 1.9 version, which is still in beta stage.

If you can't wait for the stable 1.9 release or are stuck into an old version and just want a quick solution to your messed files, try my Python script:

fix-htmldoc-utf8.py

It restores the original UTF-8 characters that HTMLDOC has messed.

You can use it as a filter (reads STDIN, results to STDOUT):

cat myfile.html | fix-htmldoc-utf8 > myfile-ok.html

You can inform the file and send the results to STDOUT:

fix-htmldoc-utf8 myfile.html > myfile-ok.html

Or you can use the -w option fix the file in place:

fix-htmldoc-utf8 -w myfile.html

Enjoy!

Posted Fri 27 June 2008 by aurelio in Tools