Parsing tHTML
From html4all
The goal with tHTML parsing is to make the parsing somewhat independent of the serialization, and the serialization independent of the vocabulary so that the serialization handles the character encoding, namespace assignment, serialization rules, and parsing rules, while the HTML vocabulary is handled by a separate recommendation.
By evolving the tHTML parsing rules, we can make a new parsing specification that is compatible with existing content while making the serialization better capable of handling changes in the future. Currently, tHTML parsing is unique among HTML parsing options in that contrary to tHTML many algorithms make many (often incorrect) assumptions about the content it parses and so future vocabulary changes are difficult to make. For example adding tree tables to traditionally parsed HTML may not be possible without breaking existing content.
The HTML Document Stack
| Layer | Description | HTML4All Initiatives |
| Presentation | Presentation is left to CSS as the reference presentational layer as much as possible. In other words the default presentation of HTML documents is almost entirely expressible by CSS (HTML4All will separately propose enhancements to provide all default HTML presentational needs through CSS). | CSS Enhancements |
| HTML Infoset | Defines the set of object graphs representable by the HTML vocabulary along with: the document object model interfaces that allow the dynamic mutation of an object graph; and the the events fired when processing an HTML document by an interactive UA. This draft also includes: HTML Data Types, HTML QNameTypes, and HTML QName Vocabulary, HTML Events | HTML DOM and Namespace enhancements |
| Parsing / Serializing | Algorithms which define how to encode an HTML Object Graph into a specific serialization and decode a serialization into an HTML Object Graph | tHTML parsing (this draft) |
| Serialization | cHTML (Canonical HTML), XML, EXI, HTML 4.01 | cHTML |
The tHTML parsing should be aware of character encoding and namespace attributes in particular such as:
- charset attribute on the meta element
- adding the charset attribute to the root element for future improvements so that authors can eventually place the charset declaration in a more suitable place.
- tHTML processing applications should be hard-coded with the xmlns namespace as if the root element has the attribute xmlns:xmlns="http://www.w3.org/2000/xmlns/" as a way to bootstrap namespaces
While a tHTML compatible serialization is not necessarily the same as SGML it has its roots in SGML.
SGML------- tHTML (SGML with hard-coded tags, elements and content models and elaborate element and tag relocation algorithms for error-recovery) | | |---- XML---- XIF (the infoset/object graph of XML but with a compressed binary serialized representation)
So tHTML by its very definition is without a DTD, while its SGML counterpart expects a DTD declaration such as XML’s DTD declaration. In creating a DTD that matches tHTML, this XML DTD declaration is much closer to an accurate DTD than the DTDs used by W3C for its validator (e.g, using this declaration, the validator will not errantly treat the "/" as a NET-enabling start-tag which no tHTML implementations do and most if not all HTML authors also do not expect from their tHTML serializations).
Parsing algorithm should handle the idiosyncrasies of existing (often errant) content while also being made forward compatible (especially considering tHTMLs inability to use schema definitions).
This algorithm shares certain traits with current browser parsers and the HTML5 parsing algorithm. For example this algorithm will:
- parse existing elements the way existing tHTML parsers already do
In comparison to the limited HTML5 parsing algorithm, this algorithm:
- allows newly introduced elements to use the self-closing void syntax ("/>") in opening tags
- allows any unknown elements to be appended to the document head
- allows foreign namespaces to be processed in a manner similar to Namespaces in XML.
| HTML 4.1 | HTML 4.01 | XHTML1 | |
| tHTML | ✓ | ✓ | |
| SGML | ✓ | ✓ | |
| XML | ✓ | ✓ | |
| EXI | ✓ | ✓ |
- **For tHTML, SGML and XML, the HTML canonical source serialization is identical for HTML 4.1. No need to transform documents for use in one parser or the other.
The HTML4All tHTML parsing algorithm

