Download your own copy of Doc To HTML Converter today!Prev Page Prev Page
Using DocToHtml
Getting Started
Conversion Options Dialog
Output Filenames
Template Data Editor
Meta Tags
Font Attr
Paragraph Attr
List Attr
Table Attr
Text Boxes
Body & Footer
Footnotes & Endnotes Options
HTML Template
XML & Charset & PG
Search & Replace
Progress Form
Batch Converter
Main Window
Settings Dialog
Command Line Support
Search & Replace Dialog
Installation Issues
Uninstallation Issues
Crash Recovery
Support for 64-Bit MS Word
Common Issues
Deep Troubleshooting
Registration Benefits
How to Buy
Support & Feedback
How to Speed up the Conversion
Unsupported Formatting
License Agreement
Privacy Policy
Change Log
Other Products

XML & Charset & PG Tab

XML & Charset Tab of DocToHtml Converter Options Dialog

Make XHTML code HTML-compatible—this option enables certain special actions to make the generated XHTML code backward-compatible with the browsers that understand only HTML markup. Those actions include inserting a space just before /> in single tags; duplicating the "id" attribute with "name" having the same value; and specifying the document encoding not only in the XML data type declaration at the beginning but also in the HTML META tag in the TITLE section. It is advisable to use this compatibility mode at all times. This checkbox is duplicated on the General Tab.

Do not insert XML declaration at the beginning—this option suppresses the insertion of the <?xml version="1.0" encoding="windows-1252"?> declaration (the actual encoding may differ). Normally the XML declaration is required by the standard, but some old HTML-only browsers may have a problem processing such documents. For example, Internet Explorer 6 will use the quirks mode when encountering such a declaration. So if compatibility with older browsers is required, just omit the declaration by enabling this option.

Don't enclose content of <STYLE>, <SCRIPT> tags into CDATA marks—this option suppresses the insertion of CDATA marks before and after the content of the STYLE and SCRIPT tags. CDATA marks are normally required to let an XML parser handle the < and > characters within included styles and scripts. But it is better not to use CDATA marks in HTML documents at all, otherwise some older browsers may be unable to display them properly.

Output Encoding is an option that lets you specify the codepage for the output (X)HTML and CSS files. You can use any encoding installed on your system, as well as use the Unicode pseudo-codepages UTF-8 and UTF-16. However, using UTF-7 is strongly discouraged as this format was developed as a workaround solution for older e-mail systems that could handle ASCII characters in the e-mail message’s character content stream only. Now this format is rarely used. Besides, the word wrapping functionality will not work with this encoding.

According to some surveys, currently the most used encoding for websites is UTF-8. It is a variable-character-length encoding scheme that can represent all Unicode code points. The character size is from 1 byte (for ASCII characters, which are represented by their 1-byte ASCII codes; pure ASCII text encoded in UTF-8 has exactly the same representation as when encoded in ASCII) to 4 bytes (for characters with Unicode code points in the range from U+10000 to U+1FFFFF).

Another popular variable-character-length encoding scheme is UTF-16. Its character size is 2 bytes for most characters, and 4 bytes (special sequences called “surrogate pairs”) for symbols whose codes do not fit into one 2-byte character. UTF-16(MSB) means that the Most Significant Byte goes first; this encoding is also known as BE, or Big Endian. UTF-16(LSB) means that the Least Significant Byte goes first; this encoding is also known as LE, or Little Endian. Windows uses LSB, or Little Endian, for the internal representation of all Unicode strings. So it is more efficient to use this option if the document will mostly be used on the Windows platform. To prevent any misinterpreting of the encoding, if UTF-16 is used, DocToHtml always puts the BOM (Byte Order Mask) 2-byte sequence in the beginning of the output file.

For multilingual documents, it is recommended to use a Unicode encoding. When a character from the input document is not representable in the output encoding, DocToHtml will insert an escape numeric entity with its Unicode code; for example, &#161; (the ¡ letter will be displayed). Such entities, although correctly recognized by all browsers, add to the output document size and decrease the readability of the (X)HTML code.

The word wrapping function relies on whitespace characters to determine potential break positions, and therefore will not work correctly with any hieroglyphics languages, such as Chinese, Japanese, or Korean.

If you do not see a desired encoding in this list, it means that it is not installed on your system. For instructions on how to install a codepage, please read the MSDN article “How to install a code page” on the Microsoft website. DocToHtml reads the list of available codepages only at launch, so remember to restart it after adding a new codepage to your system.

Windows does not provide any way to obtain the standard codepage name, which could be used as a value for the “charset” HTML property or the “encoding” XML property. So DocToHtml has a built-in list of codepages with their corresponding standard names. Currently it consists of 161 items. If your codepage is not on the list, you will see an appropriate warning. Please inform us about that, so that we can add a new codepage–name pair to the list.

Use controls at the bottom of this Tab to set parameters for the <RUBY> tags. “PG” stands for Phonetic Guide, a MS Word term to denote small characters above the regular text. They can be used, for example, to create Furigana or Pinyin.