[html4all] pronunciation, homophones, and homographs

Mon May 26 09:00:38 PDT 2008

HI Charles,

Thanks for the feedback.

On May 26, 2008, at 11:21 AM, Charles McCathieNevile wrote:

> On Sun, 25 May 2008 19:26:08 +0200, Robert J Burns <rob at robburns.com>
> wrote:
>
>> 1) My first thought is a bit off the topic of HTML5, but I think its
>> important nonetheless. I have long wondered whether Unicode should
>> have a separate block of characters just for phonemes (independent of
>> the the latin script, the International phonetic alphabet, or any
>> other phonetic alphabet)....
>
>> Among other things, I think this would represent a major shift in
>> mindset to raise awareness of the importance of aurally centered
>> character encoding.
>
> Not to put too fine a point on it :)
>
> In principle it sounds like it could be interesting. In practice I  
> am not
> sure of the major benefit it offers over using IPA with a custom set  
> of
> glyphs - I understand that IPA is kind of messy, but I am somewhat
> unconvinced that "voiced bilabial plosive" is more intuitive than  
> "p" to
> *anybody*.

That is precisely the reason for the proposal. It is intended to  
facilitate phonetic alphabets (or more broadly phonetic writing  
systems) where the user and author need not necessarily have an  
academic’s knowledge of phonetics and phonemes nor also strong  
knowledge of English or other Latin languages (which are the only ones  
where “p” will have any intuitive association with a phoneme)

> There is of course the problem that most languages simply don't
> have a good written representation for most sounds. Languages that are
> written highly phonetically like Spanish or Japanese don't have a  
> way to
> express many sounds ("v" - is that a voiced fricative? - isn't  
> distinct
> from "b" for most Spanish speakers, although those who know  
> something of
> other languages recognise that it is meant to signify something  
> different
> sometimes).

Actually, it is not true that most languages don't have a good written  
representation of most sounds. Most writing systems (non-ideographic  
ones) are phonetically based (most, more so than English which is such  
an agglomeration that its phonetic associations are blurred).  Rather  
it is the case that each language is different regarding which  
phonemes they represent through writing and use in speech. English too  
has phonemes it does not use and therefore turns to specialized  
graphemes to represent them in the phonetic alphabet. The same is true  
for Spanish in that it would not use "v" and "b" in the same way that  
the English-centric IPA uses "v" and "b". Again, this is to facilitate  
internationalization/localization of phoneme characters and facilitate  
the utilization of Unicode for a plurality of phonetic alphabets.

Keep in mind also that, though this is a significant change in Unicode  
(encoding phonemes rather than graphemes), it is also quite consistent  
with Unicodes project on the whole, where Unicode tries to provide  
implementation developers with algorithms, data repositories and  
methods to enhance localization/internationalization. By using one set  
of phonemes for all phonetic writing systems (meaning those that share  
the same phoneme classification, e.g: not Kana), interchange of such  
text data is greatly enhanced. Systems can easily draw glyphs from the  
appropriate font given a user’s preferences. While another use sees  
glyphs drawn from a different phonetic writing system for the same  
text document (likewise for input systems).

> I would suggest that the Unicode consortium are more likely to take  
> this
> on than W3C. After all, W3C just uses Unicode, in general, and I don't
> think there is any interest in them defining anything different -  
> where
> they have input they tend to take it directly to the Unicode  
> Consortium
> themselves.

Certainly. As I said, this first though (1) was a topic not for the  
HTML WG, but rather related to the other thoughts (2 and 3). Though  
certainly a liaison from the HTML WG would provide significant weight  
for the idea.

>> 2) Second, I think its important to provide authors of semantic HTML
>> documents the ability to add pronunciation information for  
>> abbreviated
>> forms, newly introduced terms, unusual terms, and homographs. While
>> CSS might be an appropriate place to control the level of chatter and
>> verbosity involving pronunciation (i.e., expanded pronunciation of
>> abbreviations or short-form pronunciations of abbreviations), the
>> semantically relevant pronunciation belongs in the HTML document  
>> itself.
>
> The SSML phoneme [1] and sub elements attribute is designed to serve  
> this
> purpose. It is also possible to do this using ruby markup (which I  
> note on
> the what-wg that Ian has added to the spec - I wish he were a bit more
> careful about ensuring the W3C working group were kept up to date),  
> and it
> is also possible to do it using CSS speech properties, in principle.
>
> One of the problems that this introduces is that pronunciation is  
> often
> not that helpful. I have enormous difficulty knowing if americans are
> saying "can" or "can't" due to their habit of pronouncing the latter  
> like
> a Norwegian would say "Kant" instead of how one would say "cahnt".  
> And so
> on and so forth (Roald Dahl wrote a very nice story about someone's
> American Aunt once, playing on this).
>
> [1] http://www.w3.org/TR/speech-synthesis/#edef_phoneme
> [2] http://www.w3.org/TR/speech-synthesis/#edef_sub

There would certainly be some issues here with localization/ 
internationalization. However, this part of the proposal was much more  
focussed on the accessibility and machine processing parts more than  
the internationalization part like my Unicode suggestion. I'm sorry  
for causing that confusion.

>> 3) This approach got me thinking a bit about homographs and their
>> pronunciation (and other machine processing) as well...
>
> See also SSML's say-as attribute [3,4]

Yes, I think that's a good example. Perhaps “say-as” would be a more  
appropriate attribute name for an HTML attribute as well.

>
>
>> If HTML had such an associated listing of ranked homographs, then  
>> this
>> same listing could be used for the homophone attribute as well. For
>> example:
>>
>> <abbr type='initialism' expressed-as='word' homophone='oxford:does1'
>>
>> DUZ</abbr>
>
> I don't think HTML is going to include a list of homographs. I am  
> certain
> that we are not going to ship one by default, since for our most  
> important
> languages this would imply a large amount of extra footprint for a
> miniscule amount of gain. Shipping 2000 MathML entities (very short  
> things
> that can be compressed easily) is already a hassle that we wish we  
> didn't
> have to deal with.

Yes, I understand. I didn't mean to say that HTML5 would compile the  
list of homographs or that browsing UAs would need to bundle a  
dictionary of homographs. These are simply meant for authors and  
authoring tools and so that users of applications that perform machine  
processing of semantics (including text-to-speech applications) could  
make use of them.

> Hoping to get commercially published text given to HTML for free  
> strikes
> me as something close to wishful thinking, which rules it out for  
> english.

Yeah, I probably shouldn't have used those words. If I had a team of  
lawyers behind me, they'd be scolding me right now for even saying  
that. What I meant was that a published dictionary could be used by  
way of reference: though frozen for a specific edition. If I were the  
publisher of a dictionary that would be my wildest dream to have the  
international format for writing documents reference my homographs in  
their attribute values.

>
> Many other languages (French, Spanish, Icelandic, and probably
> non-European langauges, although I am not sure) have an official
> definition, so may be available. Whether the official definition is  
> useful
> on the Web is a moot point. Valencianos will point out that they  
> speak a
> language, although the Spanish government does not recognise it and  
> the
> Library of Congress (which manages a list of languages for ISO)  
> apparently
> has insufficient data - they are known to have collected very poor  
> data
> for a number of indigenous languages from at least Australia and South
> America which clearly have sufficient texts to warrant more detailed
> classification than "aboriginal language of [somewhere]".

I recently read a quote from Stalin that was nevertheless profound. It  
went something like "The difference between a language and a dialect  
is that the language has a navy behind it and the dialect does  
not." :-) It's not important for every last dialect to have a codified  
rank designation for homographs, but where distinctions exist, the  
more that do the better.

> I would suggest here that you look at the use of Ruby and XHTML+Voice
> before going further down the path of inventing stuff - these things  
> have
> been dealt with before. (It turns out that the market for them is  
> fairly
> restricted, so people don't necessarily know of the solutions that  
> have
> already been implemented).

Thanks for those suggestions. However, Ruby is definitely not what I'm  
thinking about with these thoughts. XHTML+Voice also appears much more  
heavy-weight than I was looking for. Again, the suggestions I'm making  
(or the thoughts I'm having) relate to a way to facilitate  
differentiating homographs in HTML for machine processing and  
pronunciation purposes). These attributes I'm proposing would likely  
not often have significant visual rending (unlike Ruby) nor even a  
major role in text-to-speech (as in X+V) or audible web applications.  
Instead these suggestions simply provide a few simple and easy  
facilities for authors of documents (typically not web application  
documents) to make such homograph differentiations and pronunciation  
specifications where desired. For general browsing UAs, like Opera no  
additional implementation norms would be involved. Improvements to  
implementations would only be necessary for those UAs doing the work  
of differentiating homographs, (though perhaps even those UAs can make  
most of these distinctions without markup; that's what I'm trying to  
determine).

Again, keep in mind that this proposal is something to benefit users  
and authors. In general it is costless for implementations, except for  
those specialized implementations that want to make use of this extra  
information. The only downside I can see is that adding these few  
attributes might — together with adding other facilities — add too  
many facilities to HTML and overwhelm authors: especially those  
authors reading the spec and trying to learn it properly. To me the  
only way to deal with that concern is to build a draft of the spec  
with all of these things and then weigh the issue of whether anything  
can or should be trimmed. Obviously the other issues would surround  
whether the attributes should have a different name; whether elements  
would better suit the semantics than attributes; and whether these  
facilities should be handled by a complimentary technology such as CSS  
(the separation of concerns).

Take care,
Rob