[html4all] pronunciation, homophones, and homographs

Mon May 26 04:21:03 PDT 2008

On Sun, 25 May 2008 19:26:08 +0200, Robert J Burns <rob at robburns.com>  
wrote:

> 1) My first thought is a bit off the topic of HTML5, but I think its
> important nonetheless. I have long wondered whether Unicode should
> have a separate block of characters just for phonemes (independent of
> the the latin script, the International phonetic alphabet, or any
> other phonetic alphabet)....

> Among other things, I think this would represent a major shift in
> mindset to raise awareness of the importance of aurally centered
> character encoding.

Not to put too fine a point on it :)

In principle it sounds like it could be interesting. In practice I am not  
sure of the major benefit it offers over using IPA with a custom set of  
glyphs - I understand that IPA is kind of messy, but I am somewhat  
unconvinced that "voiced bilabial plosive" is more intuitive than "p" to  
*anybody*. There is of course the problem that most languages simply don't  
have a good written representation for most sounds. Languages that are  
written highly phonetically like Spanish or Japanese don't have a way to  
express many sounds ("v" - is that a voiced fricative? - isn't distinct  
 from "b" for most Spanish speakers, although those who know something of  
other languages recognise that it is meant to signify something different  
sometimes).

I would suggest that the Unicode consortium are more likely to take this  
on than W3C. After all, W3C just uses Unicode, in general, and I don't  
think there is any interest in them defining anything different - where  
they have input they tend to take it directly to the Unicode Consortium  
themselves.

> 2) Second, I think its important to provide authors of semantic HTML
> documents the ability to add pronunciation information for abbreviated
> forms, newly introduced terms, unusual terms, and homographs. While
> CSS might be an appropriate place to control the level of chatter and
> verbosity involving pronunciation (i.e., expanded pronunciation of
> abbreviations or short-form pronunciations of abbreviations), the
> semantically relevant pronunciation belongs in the HTML document itself.

The SSML phoneme [1] and sub elements attribute is designed to serve this  
purpose. It is also possible to do this using ruby markup (which I note on  
the what-wg that Ian has added to the spec - I wish he were a bit more  
careful about ensuring the W3C working group were kept up to date), and it  
is also possible to do it using CSS speech properties, in principle.

One of the problems that this introduces is that pronunciation is often  
not that helpful. I have enormous difficulty knowing if americans are  
saying "can" or "can't" due to their habit of pronouncing the latter like  
a Norwegian would say "Kant" instead of how one would say "cahnt". And so  
on and so forth (Roald Dahl wrote a very nice story about someone's  
American Aunt once, playing on this).

[1] http://www.w3.org/TR/speech-synthesis/#edef_phoneme
[2] http://www.w3.org/TR/speech-synthesis/#edef_sub

> 3) This approach got me thinking a bit about homographs and their
> pronunciation (and other machine processing) as well...

See also SSML's say-as attribute [3,4]

> If HTML had such an associated listing of ranked homographs, then this
> same listing could be used for the homophone attribute as well. For
> example:
>
> <abbr type='initialism' expressed-as='word' homophone='oxford:does1'
>  >DUZ</abbr>

I don't think HTML is going to include a list of homographs. I am certain  
that we are not going to ship one by default, since for our most important  
languages this would imply a large amount of extra footprint for a  
miniscule amount of gain. Shipping 2000 MathML entities (very short things  
that can be compressed easily) is already a hassle that we wish we didn't  
have to deal with.

Hoping to get commercially published text given to HTML for free strikes  
me as something close to wishful thinking, which rules it out for english.  
Many other languages (French, Spanish, Icelandic, and probably  
non-European langauges, although I am not sure) have an official  
definition, so may be available. Whether the official definition is useful  
on the Web is a moot point. Valencianos will point out that they speak a  
language, although the Spanish government does not recognise it and the  
Library of Congress (which manages a list of languages for ISO) apparently  
has insufficient data - they are known to have collected very poor data  
for a number of indigenous languages from at least Australia and South  
America which clearly have sufficient texts to warrant more detailed  
classification than "aboriginal language of [somewhere]".

I would suggest here that you look at the use of Ruby and XHTML+Voice  
before going further down the path of inventing stuff - these things have  
been dealt with before. (It turns out that the market for them is fairly  
restricted, so people don't necessarily know of the solutions that have  
already been implemented).

cheers

Chaals

-- 
Charles McCathieNevile  Opera Software, Standards Group
     je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals   Try Opera 9.5: http://snapshot.opera.com