[html4all] pronunciation, homophones, and homographs
Charles McCathieNevile
chaals at opera.com
Mon May 26 04:21:03 PDT 2008
On Sun, 25 May 2008 19:26:08 +0200, Robert J Burns <rob at robburns.com>
wrote:
> 1) My first thought is a bit off the topic of HTML5, but I think its
> important nonetheless. I have long wondered whether Unicode should
> have a separate block of characters just for phonemes (independent of
> the the latin script, the International phonetic alphabet, or any
> other phonetic alphabet)....
> Among other things, I think this would represent a major shift in
> mindset to raise awareness of the importance of aurally centered
> character encoding.
Not to put too fine a point on it :)
In principle it sounds like it could be interesting. In practice I am not
sure of the major benefit it offers over using IPA with a custom set of
glyphs - I understand that IPA is kind of messy, but I am somewhat
unconvinced that "voiced bilabial plosive" is more intuitive than "p" to
*anybody*. There is of course the problem that most languages simply don't
have a good written representation for most sounds. Languages that are
written highly phonetically like Spanish or Japanese don't have a way to
express many sounds ("v" - is that a voiced fricative? - isn't distinct
from "b" for most Spanish speakers, although those who know something of
other languages recognise that it is meant to signify something different
sometimes).
I would suggest that the Unicode consortium are more likely to take this
on than W3C. After all, W3C just uses Unicode, in general, and I don't
think there is any interest in them defining anything different - where
they have input they tend to take it directly to the Unicode Consortium
themselves.
> 2) Second, I think its important to provide authors of semantic HTML
> documents the ability to add pronunciation information for abbreviated
> forms, newly introduced terms, unusual terms, and homographs. While
> CSS might be an appropriate place to control the level of chatter and
> verbosity involving pronunciation (i.e., expanded pronunciation of
> abbreviations or short-form pronunciations of abbreviations), the
> semantically relevant pronunciation belongs in the HTML document itself.
The SSML phoneme [1] and sub elements attribute is designed to serve this
purpose. It is also possible to do this using ruby markup (which I note on
the what-wg that Ian has added to the spec - I wish he were a bit more
careful about ensuring the W3C working group were kept up to date), and it
is also possible to do it using CSS speech properties, in principle.
One of the problems that this introduces is that pronunciation is often
not that helpful. I have enormous difficulty knowing if americans are
saying "can" or "can't" due to their habit of pronouncing the latter like
a Norwegian would say "Kant" instead of how one would say "cahnt". And so
on and so forth (Roald Dahl wrote a very nice story about someone's
American Aunt once, playing on this).
[1] http://www.w3.org/TR/speech-synthesis/#edef_phoneme
[2] http://www.w3.org/TR/speech-synthesis/#edef_sub
> 3) This approach got me thinking a bit about homographs and their
> pronunciation (and other machine processing) as well...
See also SSML's say-as attribute [3,4]
> If HTML had such an associated listing of ranked homographs, then this
> same listing could be used for the homophone attribute as well. For
> example:
>
> <abbr type='initialism' expressed-as='word' homophone='oxford:does1'
> >DUZ</abbr>
I don't think HTML is going to include a list of homographs. I am certain
that we are not going to ship one by default, since for our most important
languages this would imply a large amount of extra footprint for a
miniscule amount of gain. Shipping 2000 MathML entities (very short things
that can be compressed easily) is already a hassle that we wish we didn't
have to deal with.
Hoping to get commercially published text given to HTML for free strikes
me as something close to wishful thinking, which rules it out for english.
Many other languages (French, Spanish, Icelandic, and probably
non-European langauges, although I am not sure) have an official
definition, so may be available. Whether the official definition is useful
on the Web is a moot point. Valencianos will point out that they speak a
language, although the Spanish government does not recognise it and the
Library of Congress (which manages a list of languages for ISO) apparently
has insufficient data - they are known to have collected very poor data
for a number of indigenous languages from at least Australia and South
America which clearly have sufficient texts to warrant more detailed
classification than "aboriginal language of [somewhere]".
I would suggest here that you look at the use of Ruby and XHTML+Voice
before going further down the path of inventing stuff - these things have
been dealt with before. (It turns out that the market for them is fairly
restricted, so people don't necessarily know of the solutions that have
already been implemented).
cheers
Chaals
--
Charles McCathieNevile Opera Software, Standards Group
je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals Try Opera 9.5: http://snapshot.opera.com
More information about the List_HTML4all.org
mailing list