[html4all] pronunciation, homophones, and homographs
Robert J Burns
rob at robburns.com
Sun May 25 10:26:08 PDT 2008
Hello 4all,
[apologies for the long email]
I've been doing some thinking about aural pronunciation of HTML and
homophones and homographs for some time now. My thinking on this is
much rawer and less refined than the other issues I've been compiling
so I thought I'd solicit some direct feedback from this group and
generate some discussion.
First, I want to address the issue of markup for pronunciation of
unusual or newly introduced terms and abbreviations. Second, I want to
address the possibility of markup for distinguishing homographs
especially when those homographs are pronounced differently. So here
are some of my thoughts on the topic:
1) My first thought is a bit off the topic of HTML5, but I think its
important nonetheless. I have long wondered whether Unicode should
have a separate block of characters just for phonemes (independent of
the the latin script, the International phonetic alphabet, or any
other phonetic alphabet). Right now, the International Phonetic
Alphabet dominates Unicode which — despite its name — is a collection
of graphemes largely from the Latin script that correspond one-to-one
to each of the linguist identified phonemes. Not only is the IPA latin
centered in terms of script, but its really an English language
centered phonetic alphabet as well (in that the mnemonic employed are
based on the most common English phonetic usage of these letters of
the alphabet).
Instead, I was thinking we might introduce a separate block of phoneme
characters that were glyph (even grapheme) independent. For example a
voiced bilabial plosive is represented by the IPA by a 'p' and in
unicode as the "Latin Letter P" (U+0070). What I'm suggesting is that
this phoneme should have its own code assignment separate from a
"Latin Letter P". Among other things this would facilitate the
development of other truly international phonetic alphabets (where the
mnemonic glyph representing the phoneme is drawn from the language of
the authors and readers of the phonetic alphabet). This would
represent a departure from the usual Unicode practice of having a
representative glyph for every character. For these phonemes, the
glyph would depend entirely on the specification of another standard
(e.g.: IPA, Americanist phonetic notation, Hebrew Phonetic Alphabet)
Among other things, I think this would represent a major shift in
mindset to raise awareness of the importance of aurally centered
character encoding. Space is quickly running out in the basic
multilingual plane (BMP) which facilitates character encoding storage
with only 16 bits for each character (therefore these characters would
take up 20 to 32 bits of drive storage per character), however, I
don't think its a major issue because these phoneme characters would
not be stored as much nor as often as other characters.
2) Second, I think its important to provide authors of semantic HTML
documents the ability to add pronunciation information for abbreviated
forms, newly introduced terms, unusual terms, and homographs. While
CSS might be an appropriate place to control the level of chatter and
verbosity involving pronunciation (i.e., expanded pronunciation of
abbreviations or short-form pronunciations of abbreviations), the
semantically relevant pronunciation belongs in the HTML document itself.
a) one way to do this would be to introduce a phoneme attribute —
especially for the ABBR, DFN, VAR elements (and the proposed PN and
TERM elements). This attribute would accept IPA or eventually the
newly introduced Unicode phoneme script characters. This would be a
very flexible and powerful approach offering authors complete
facilities to specify pronunciations. However, one drawback is that
knowledge of IPA and other phonetic alphabets is somewhat specialized
and many authors may not be equipped to use this approach
b) to address this, another approach would be to add a homophone
attribute. This way authors could specify a homophone for a term in a
way that didn't involve knowledge of a phonetic alphabet. For example:
<abbr type='initialism' expressed-as='word' homophone='sequel' >SQL</
abbr>
or
<abbr type='initialism' expressed-as='word' pronounced='sequel' >SQL</
abbr>
This permits authors to add rough pronunciations to their documents
with out intimate knowledge of IPA or other phonetic alphabets.
3) This approach got me thinking a bit about homographs and their
pronunciation (and other machine processing) as well. I think some
authors (especially for archival documents or those who really want to
provide complete aural support) may want to provide further semantic
encoding of homographs that could facilitate pronunciation and other
machine processing. So adding to the homophone example in the previous
thought, I was thinking we could add a homograph attribute to
distinguish among homographs from various languages. This would be
especially useful for those homographs that were the same part of
speech and those where pronunciation mattered (read, lead, does,
wind). I haven't been able to think of pronunciation dependent
homographs that are also the same part (e.g., both nouns).
My thought for this, then is to have a homograph or hg attribute that
accepts namespaced values to differentiate homographs. Perhaps
something like:
<p>Looking at the herd of deer across the prairie, Joe said: I said
<q>how <span hg='oxford:does1' >does</span> that buck do that with
those <span hg='oxford:does2' >does</span></q><p>
This would require identifying and building an inventory of homographs
that could be identified with this attribute. Perhaps a new initiative
could get a private dictionary to contribute their copyrighted data or
perhaps something from the public domain. My example, supposes that
perhaps that a snapshot of Oxford American Dictionary’s homograph
rankings, as of a certain date, is made available for use as a value
for this attribute. In addition, HTML5 could automatically set the
namespace for the HTML5 scope as xmlns:oxford='<some uri representing
this snapshot of the Oxford American Dictionary>'. This may be a
difficult part of the proposal to achieve. Also it needs to facilitate
more internationalization than this example suggests. However, it does
not matter so much the actual ordering of the homographs as much as
being able to specifically identify which of the several homograph is
being used in that place in the document.
If HTML had such an associated listing of ranked homographs, then this
same listing could be used for the homophone attribute as well. For
example:
<abbr type='initialism' expressed-as='word' homophone='oxford:does1'
>DUZ</abbr>
Any thoughts? Also I'd appreciate any other examples of pronunciation
dependent homographs (especially where they are the same part of
speech because that makes machines processing very difficult). This
may be a fairly specialized use case, but it strikes me as something
worthwhile: especially for HTML documents for archival purposes. In
addition, I think authoring tools could assist authors in identifying
key situations where homograph markup would be important to include
even for casual everyday HTML like blogs and the like.
Take care,
Rob
More information about the List_HTML4all.org
mailing list