[html4all] pronunciation, homophones, and homographs

Sun May 25 10:26:08 PDT 2008

Hello 4all,

[apologies for the long email]

I've been doing some thinking about aural pronunciation of HTML and  
homophones and homographs for some time now. My thinking on this is  
much rawer and less refined than the other issues I've been compiling  
so I thought I'd solicit some direct feedback from this group and  
generate some discussion.

First, I want to address the issue of markup for pronunciation of  
unusual or newly introduced terms and abbreviations. Second, I want to  
address the possibility of markup for distinguishing homographs  
especially when those homographs are pronounced differently. So here  
are some of my thoughts on the topic:

1) My first thought is a bit off the topic of HTML5, but I think its  
important nonetheless. I have long wondered whether Unicode should  
have a separate block of characters just for phonemes (independent of  
the the latin script, the International phonetic alphabet, or any  
other phonetic alphabet). Right now, the International Phonetic  
Alphabet dominates Unicode which — despite its name — is a collection  
of graphemes largely from the Latin script that correspond one-to-one  
to each of the linguist identified phonemes. Not only is the IPA latin  
centered in terms of script, but its really an English language  
centered phonetic alphabet as well (in that the mnemonic employed are  
based on the most common English phonetic usage of these letters of  
the alphabet).

Instead, I was thinking we might introduce a separate block of phoneme  
characters that were glyph (even grapheme) independent. For example a  
voiced bilabial plosive is represented by the IPA by a 'p' and in  
unicode as the "Latin Letter P" (U+0070). What I'm suggesting is that  
this phoneme should have its own code assignment separate from a  
"Latin Letter P". Among other things this would facilitate the  
development of other truly international phonetic alphabets (where the  
mnemonic glyph representing the phoneme is drawn from the language of  
the authors and readers of the phonetic alphabet). This would  
represent a departure from the usual Unicode practice of having a  
representative glyph for every character. For these phonemes, the  
glyph would depend entirely on the specification of another standard  
(e.g.: IPA, Americanist phonetic notation, Hebrew Phonetic Alphabet)

Among other things, I think this would represent a major shift in  
mindset to raise awareness of the importance of aurally centered  
character encoding. Space is quickly running out in the basic  
multilingual plane (BMP) which facilitates character encoding storage  
with only 16 bits for each character (therefore these characters would  
take up 20 to 32 bits of drive storage per character), however, I  
don't think its a major issue because these phoneme characters would  
not be stored as much nor as often as other characters.

2) Second, I think its important to provide authors of semantic HTML  
documents the ability to add pronunciation information for abbreviated  
forms, newly introduced terms, unusual terms, and homographs. While  
CSS might be an appropriate place to control the level of chatter and  
verbosity involving pronunciation (i.e., expanded pronunciation of  
abbreviations or short-form pronunciations of abbreviations), the  
semantically relevant pronunciation belongs in the HTML document itself.

a) one way to do this would be to introduce a phoneme attribute —  
especially for the ABBR, DFN, VAR elements (and the proposed PN and  
TERM elements). This attribute would accept IPA or eventually the  
newly introduced Unicode phoneme script characters. This would be a  
very flexible and powerful approach offering authors complete  
facilities to specify pronunciations. However, one drawback is that  
knowledge of IPA and other phonetic alphabets is somewhat specialized  
and many authors may not be equipped to use this approach

b) to address this, another approach would be to add a homophone  
attribute. This way authors could specify a homophone for a term in a  
way that didn't involve knowledge of a phonetic alphabet. For example:

<abbr type='initialism' expressed-as='word' homophone='sequel' >SQL</ 
abbr>

or

<abbr type='initialism' expressed-as='word' pronounced='sequel' >SQL</ 
abbr>

This permits authors to add rough pronunciations to their documents  
with out intimate knowledge of IPA or other phonetic alphabets.

3) This approach got me thinking a bit about homographs and their  
pronunciation (and other machine processing) as well. I think some  
authors (especially for archival documents or those who really want to  
provide complete aural support) may want to provide further semantic  
encoding of homographs that could facilitate pronunciation and other  
machine processing. So adding to the homophone example in the previous  
thought, I was thinking we could add a homograph attribute to  
distinguish among homographs from various languages. This would be  
especially useful for those homographs that were the same part of  
speech and those where pronunciation mattered (read, lead, does,  
wind). I haven't been able to think of pronunciation dependent  
homographs that are also the same part (e.g., both nouns).

My thought for this, then is to have a homograph or hg attribute that  
accepts namespaced values to differentiate homographs. Perhaps  
something like:

<p>Looking at the herd of deer across the prairie, Joe said: I said  
<q>how <span hg='oxford:does1' >does</span> that buck do that with  
those <span hg='oxford:does2' >does</span></q><p>

This would require identifying and building an inventory of homographs  
that could be identified with this attribute. Perhaps a new initiative  
could get a private dictionary to contribute their copyrighted data or  
perhaps something from the public domain. My example, supposes that  
perhaps that a snapshot of Oxford American Dictionary’s homograph  
rankings, as of a certain date, is made available for use as a value  
for this attribute. In addition, HTML5 could automatically set the  
namespace for the HTML5 scope as xmlns:oxford='<some uri representing  
this snapshot of the Oxford American Dictionary>'. This may be a  
difficult part of the proposal to achieve. Also it needs to facilitate  
more internationalization than this example suggests. However, it does  
not matter so much the actual ordering of the homographs as much as  
being able to specifically identify which of the several homograph is  
being used in that place in the document.

If HTML had such an associated listing of ranked homographs, then this  
same listing could be used for the homophone attribute as well. For  
example:

<abbr type='initialism' expressed-as='word' homophone='oxford:does1'  
 >DUZ</abbr>

Any thoughts? Also I'd appreciate any other examples of pronunciation  
dependent homographs (especially where they are the same part of  
speech because that makes machines processing very difficult). This  
may be a fairly specialized use case, but it strikes me as something  
worthwhile: especially for HTML documents for archival purposes. In  
addition, I think authoring tools could assist authors in identifying  
key situations where homograph markup would be important to include  
even for casual everyday HTML like blogs and the like.

Take care,
Rob