Python script to use HTMLDOC with UTF-8 files
You know, HTMLDOC is a good tool to complement txt2tags features, specially to break an HTML file into multiple pages.
But the current version of HTMLDOC (1.8.x) has no Unicode support.
When you try to use it to convert or split an UTF-8 file, all the special characters (not ASCII) will be incorrect in the resulting HTML.
The Unicode support will be released on the 1.9 version, which is still in beta stage.
If you can't wait for the stable 1.9 release or are stuck into an old version and just want a quick solution to your messed files, try my Python script:
It restores the original UTF-8 characters that HTMLDOC has messed.
You can use it as a filter (reads STDIN, results to STDOUT):
cat myfile.html | fix-htmldoc-utf8 > myfile-ok.html
You can inform the file and send the results to STDOUT:
fix-htmldoc-utf8 myfile.html > myfile-ok.html
Or you can use the
-w option fix the file in place:
fix-htmldoc-utf8 -w myfile.html