DocToHtml - Doc To HTML ConverterThe easy way to batch convert your Word docs to clean HTML/XHTML

Read first three articles

DocToHtml and ePubs - Articles

What real users think and say about our product:

DocToHTML is a saver

I make e-books as a hobby for myself and others. Most of the e-book formats use a type of HTML as core. That means that the book source must be written in HTML. This can be difficult, as most writers are not knowledgeable about HTML, let alone style sheets and other variants like XHTML. Writers tend to keep at what they know and do best. That is writing books in a word processor; the most popular is still Microsoft Word. That poses a problem. You can save a Word document as a HTML page without a problem. However, the HTML code produced is a monster. It is bloated with all kind of additional code. The reason why the code is there is simple. It will enable Word to reload the HTML document and retrieve the same document. However, for an e-book this is not important. Of course you can also save the document as filtered HTML. The result is better, but still a big mess. It will take considerable time to clean the garbage.

My first 'solution' was to clean it up manually and/or via some (RegEx) Search&Replace commands. Although this saves time, it is still quite laborious. The next phase was creating a macro the converts the document into HTML and save that. That also gave me the opportunity to save style names so I could reference those in a style sheet. For most books it works fine, retaining things like italics, bolds, center and alike. Unfortunately, I did not manage to convert everything to my desire and some parts were not converted at all. It saves time, but not enough.

Then I tried out the 30-day trial of DocToHtml. It managed to save my time even more. It can create output to the version of HTML I want (even XHTML!), already do the split, create a style sheet as a separate file, even enriched with my own style sheets and much much more. My macro pales compares to this. This is really a time saver, since I can attune the output based on the document instead of 'one-macro-fits-all'.

Especially the style sheet function is really something I haven't seen before. Most converters will ignore styles and therefore will lose formatting, but DocToHtml will not only covert it, but also has the capabilities to enrich it. The handling of tables is unsurpassed and the handling of text-boxes is something else I have never seen.

So, if you are in the business of creating e-books, take a very good look at DocToHtml for converting those Word documents to sensible (x)HTML code! It will save time and help to shape the e-book.

Sander van der Linden
toxaris.nl

Many thanks to Sander van der Linden for the time taken to write this review.

#2

I really need DocToHTML for tables and field codes. The rest of the features are extremely useful, but are less urgent for me.

As I am automating my conversion from MS-Word-files to XML/XHTML-files, I have tried a lot of ways to do so. There is a load of things that I can cover with my macroses in MS-Word or with other software. But I did not find a solution which could satisfy all my needs.

One of the major problems I encounter in the conversion process is tables. MS-Word can't give you usable HTML code for this. They give you every code, even if irelevant like; this cell is not yellow, nor is it ...

And my own macroses in Word cannot recognize merged cells. This value is not registrated in a Word-VBA recognizable code: thus the colspan and rowspan values are always lost and should be inserted manually afterwards. Apparently, in Word these values are not written in a proper way. Try to reconstruct this in Word and you will see: it doesn't (always) correctly store the logical information on rows, columns and merging. It is like it uses values per cell. A cell is not spanned over two columns, but seems to be just wider than normal and nobody misses the merged cell, so Word pretends that it has never been there.

BUT, DocToHTML gives all this information: great job. I got this neat code below from DocToHTML. No messy Word attributes on all the things that you didn't put in. I inserted a Word document with a simple table: in the Word file the figures stand for {row,column}. If you enter this code into an HTML reader you will find the {row,column} figures in the right cells as the picture shows.

<table>
<tr>
<td colspan="2">Row 1,1+2 Two in one</td>
<td>Row 1 column 3</td>
<td rowspan="2">
1+2,4<br />
Span two rows</td></tr>
<tr>
<td>Row 2,1</td>
<td>2,2</td>
<td>2,3</td></tr>
<tr>
<td>3,1</td>
<td colspan="2">3,2+3</td>
<td>3,4</td></tr>
</table>

gives exactly the same as my Word table and without manual interaction. The screenshot is below:

The same applies to fieldcodes: perfect.

I truely recommend DocToHTML as the converter program you need to go from Word to (X)HTML. For instance working towards an ePublication: there is no software that produces clean ePub from scratch. But understanding what ePub needs, you can create the bunch of code from here.

I don't use the CSS generator, because I use my own CSS. But the fact that you can, gives everybody a head start. Same for all the things you can do with splitting, meta tags, choosing all kinds of different ways to do your conversion.

Jaap Prummel

Many thanks to Jaap Prummel for the time taken to write this review.

Read first three articles