XML & Charset & PG Tab
Make XHTML code HTML-compatible—this
option enables certain special actions to make the generated XHTML
code backward-compatible with the browsers that understand only
HTML markup. Those actions include inserting a space just before
/> in single tags; duplicating the
"id" attribute with "name" having the same value; and specifying the
document encoding not only in the XML data type declaration at the
beginning but also in the HTML META tag in the TITLE section. It is
advisable to use this compatibility mode at all times. This
checkbox is duplicated on the General Tab.
Do not insert XML declaration at the
beginning—this option suppresses the insertion of the
encoding="windows-1252"?> declaration (the actual
encoding may differ). Normally the XML declaration is required by
the standard, but some old HTML-only browsers may have a problem
processing such documents. For example, Internet Explorer 6 will
use the quirks mode when encountering such a declaration. So if
compatibility with older browsers is required, just omit the
declaration by enabling this option.
Don't enclose content of <STYLE>,
<SCRIPT> tags into CDATA marks—this option suppresses
the insertion of CDATA marks before and after the content of the
STYLE and SCRIPT tags. CDATA marks are normally required to let an
XML parser handle the < and > characters within included
styles and scripts. But it is better not to use CDATA marks in HTML
documents at all, otherwise some older browsers may be unable to
display them properly.
Encoding is an option that lets you specify the codepage for
the output (X)HTML and CSS files. You can use any encoding
installed on your system, as well as use the Unicode
pseudo-codepages UTF-8 and UTF-16. However, using UTF-7 is strongly
discouraged as this format was developed as a workaround solution
for older e-mail systems that could handle ASCII characters in the
e-mail message’s character content stream only. Now this format is
rarely used. Besides, the word
wrapping functionality will not work with this encoding.
According to some surveys, currently the most used encoding for
websites is UTF-8. It is a variable-character-length encoding
scheme that can represent all Unicode code points. The character
size is from 1 byte (for ASCII characters, which are represented by
their 1-byte ASCII codes; pure ASCII text encoded in UTF-8 has
exactly the same representation as when encoded in ASCII) to 4
bytes (for characters with Unicode code points in the range from
U+10000 to U+1FFFFF).
Another popular variable-character-length encoding scheme is
UTF-16. Its character size is 2 bytes for most characters, and 4
bytes (special sequences called “surrogate pairs”) for symbols
whose codes do not fit into one 2-byte character. UTF-16(MSB) means
that the Most Significant Byte goes first; this encoding is also
known as BE, or Big Endian. UTF-16(LSB) means that the Least
Significant Byte goes first; this encoding is also known as LE, or
Little Endian. Windows uses LSB, or Little Endian, for the internal
representation of all Unicode strings. So it is more efficient to
use this option if the document will mostly be used on the Windows
platform. To prevent any misinterpreting of the encoding, if UTF-16
is used, DocToHtml always puts the BOM (Byte Order Mask) 2-byte
sequence in the beginning of the output file.
For multilingual documents, it is recommended to use a Unicode
encoding. When a character from the input document is not
representable in the output encoding, DocToHtml will insert an
escape numeric entity with its Unicode code; for example,
¡ (the ¡ letter will be displayed). Such entities,
although correctly recognized by all browsers, add to the output
document size and decrease the readability of the (X)HTML code.
The word wrapping function
relies on whitespace characters to determine potential break
positions, and therefore will not work correctly with any
hieroglyphics languages, such as Chinese, Japanese, or Korean.
If you do not see a desired encoding in this list, it
means that it is not installed on your system. For instructions on
how to install a codepage, please read the MSDN article “How to install a code
page” on the Microsoft website. DocToHtml reads the list of
available codepages only at launch, so remember to restart it after
adding a new codepage to your system.
Windows does not provide any way to obtain the standard codepage
name, which could be used as a value for the “charset” HTML
property or the “encoding” XML property. So DocToHtml has a
built-in list of codepages with their corresponding standard names.
Currently it consists of 161 items. If your codepage is not on the
list, you will see an appropriate warning. Please inform us about
that, so that we can add a new codepage–name pair to the list.
Use controls at the bottom of this Tab to set parameters for the
<RUBY> tags. “PG” stands for
Phonetic Guide, a MS Word term to denote small
characters above the regular text. They can be used, for example,
to create Furigana or Pinyin.