DocToHtml - Doc To HTML ConverterThe easy way to batch convert your Word docs to clean HTML/XHTML

DocToHtml - Sample HTML Code

We invested much of time and efforts into improving our algorithms of output HTML code optimization. The result is that in most cases, DocToHtml produces the smallest possible HTML code. It is several times smaller than code produced by MS Word itself. And, more important, it thus can be edited with your favorite HTML editor with ease!

Below there is a fragment from sample document "BUSINESS REQUIREMENTS SPECIFICATION" converted with DocToHtml.

This fragment on screen in MS Word (downscaled)

Code by MS Word 2003

Code by DocToHtml 2.50

Code by DocToHtml 2.50, without indentation

Size of the code produced by MS Word is 4 318 bytes, whereas the one from DocToHtml (without indentation mode) is only 614 bytes. And you can see yourself that it is very hard and painful to manually edit code from MS Word.

Note also that DocToHtml takes advantage of the fact that all cells in this table have the same font size and vertical-align properties, so instead of specifying them for every cell, they are specified only once at the whole table level. And so with background color and font-weight properties of the first table row. Also note that column widths are specified in percents, not in points. It allows you to use generated table in rubber design with ease, you do not need to worry about actual table width in pixels.

All these optimizations are possible because DocToHtml produces resulted HTML completely by itself, without using internal MS Word function (Although it utilizes MS Word to gain access to the original document). This approach allows us to generate highly optimized HTML code.

One can say here that there is free tool from Microsoft, called Office HTML Filter, which claims to delete all Office-specific tags and to leave only plain HTML. In reality, Office HTML Filter isn't very useful. It indeed does some clean-up, but much of messy code still remains even with the strictest cleaning options. Actually, it is very hard to clean HTML code produced by built-in Office converter to the level of unredundancy and painful integration to your website.

The same fragment cleaned with Office HTML Filter

As you can see, the code is still very inflated and practically uneditable. Office HTML Filter only strips out totally useless SPAN tags with LANG attribute and some Office-specific markup, but there is still a lot of garbage. Compare this to DocToHtml clean output!

And, of course, with DocToHtml you can fully customize output HTML code, to the level of every formatting attribute, which is not possible with Office HTML Filter. Even more, with DocToHtml, you can specify low-level HTML code characteristics, such as register of tag names, whether or not to use optional end tags, whether or not to use optional quotes for attribute values, and so on.

Screenshots of DocToHtml Conversion Options Dialog will give you an idea about how deep you can control output HTML code at the very fine level.

When you don't need all formatting

As mentioned above, with DocToHtml you can selectively omit certain formatting attributes. Office HTML Filter, also has checkboxes "Remove all STYLE elements" and "Remove standard CSS". Let's compare the output of DocToHtml with all checkboxes regarding formatting attributes turned OFF, and output of Office HTML Filter with mentioned two checkboxes ON.

Office HTML Filter without STYLE and CSS formatting

DocToHtml without output formatting for fonts and paragraphs, without indentation

In this mode, the document will not appear in browser exactly the same as the original document in MS Word. Note that you can't completely omit formatting with Office HTML Filter - for example, valign attribute, <b> tags will always present in the output document. With DocToHtml, on the contrary, you can omit formatting selectively for fonts, paragraphs, tables, BODY tag, for none of them, and for only certain attributes. In the last example, all formatting for fonts and paragraphs were stripped, while options for tables were set ON.

Note that for Office HTML Filter, there is redundant "width" attribute for each and every table cell. But in HTML, every cell belonging to a given table column, will have the same width shared with all other cells belonging to that column. So there is absolutely no need to specify width for cells which are not in the first row. But MS Word thinks differently, and Office HTML Filter can do nothing with it. Compare that to DocToHtml.

Size of selected fragment

Conversion methodMS Word 2003DocToHtmlOffice HTML FilterOffice HTML Filter without formattingDocToHtml without formatting
Size, Bytes4 3186142 835919549
Ratio compared to DocToHtml7:1-4.6:11.7:1-

These digits can be approximated to the whole document. So, if you want to preserve formatting, DocToHtml will produce several times more compact code than MS Word even with Office HTML Filter. It will significantly reduce bandwidth usage and improve download time. And, often much more important, you can edit resulted HTML code without any troubles.

Check yourself all that we are talking about here - download free 30-day trial

Download Now (7 MB)

All mentioned trademarks are property of their respective owners