[html4all] pronunciation, homophones, and homographs
Robert J Burns
rob at robburns.com
Mon May 26 09:00:38 PDT 2008
HI Charles,
Thanks for the feedback.
On May 26, 2008, at 11:21 AM, Charles McCathieNevile wrote:
> On Sun, 25 May 2008 19:26:08 +0200, Robert J Burns <rob at robburns.com>
> wrote:
>
>> 1) My first thought is a bit off the topic of HTML5, but I think its
>> important nonetheless. I have long wondered whether Unicode should
>> have a separate block of characters just for phonemes (independent of
>> the the latin script, the International phonetic alphabet, or any
>> other phonetic alphabet)....
>
>> Among other things, I think this would represent a major shift in
>> mindset to raise awareness of the importance of aurally centered
>> character encoding.
>
> Not to put too fine a point on it :)
>
> In principle it sounds like it could be interesting. In practice I
> am not
> sure of the major benefit it offers over using IPA with a custom set
> of
> glyphs - I understand that IPA is kind of messy, but I am somewhat
> unconvinced that "voiced bilabial plosive" is more intuitive than
> "p" to
> *anybody*.
That is precisely the reason for the proposal. It is intended to
facilitate phonetic alphabets (or more broadly phonetic writing
systems) where the user and author need not necessarily have an
academic’s knowledge of phonetics and phonemes nor also strong
knowledge of English or other Latin languages (which are the only ones
where “p” will have any intuitive association with a phoneme)
> There is of course the problem that most languages simply don't
> have a good written representation for most sounds. Languages that are
> written highly phonetically like Spanish or Japanese don't have a
> way to
> express many sounds ("v" - is that a voiced fricative? - isn't
> distinct
> from "b" for most Spanish speakers, although those who know
> something of
> other languages recognise that it is meant to signify something
> different
> sometimes).
Actually, it is not true that most languages don't have a good written
representation of most sounds. Most writing systems (non-ideographic
ones) are phonetically based (most, more so than English which is such
an agglomeration that its phonetic associations are blurred). Rather
it is the case that each language is different regarding which
phonemes they represent through writing and use in speech. English too
has phonemes it does not use and therefore turns to specialized
graphemes to represent them in the phonetic alphabet. The same is true
for Spanish in that it would not use "v" and "b" in the same way that
the English-centric IPA uses "v" and "b". Again, this is to facilitate
internationalization/localization of phoneme characters and facilitate
the utilization of Unicode for a plurality of phonetic alphabets.
Keep in mind also that, though this is a significant change in Unicode
(encoding phonemes rather than graphemes), it is also quite consistent
with Unicodes project on the whole, where Unicode tries to provide
implementation developers with algorithms, data repositories and
methods to enhance localization/internationalization. By using one set
of phonemes for all phonetic writing systems (meaning those that share
the same phoneme classification, e.g: not Kana), interchange of such
text data is greatly enhanced. Systems can easily draw glyphs from the
appropriate font given a user’s preferences. While another use sees
glyphs drawn from a different phonetic writing system for the same
text document (likewise for input systems).
> I would suggest that the Unicode consortium are more likely to take
> this
> on than W3C. After all, W3C just uses Unicode, in general, and I don't
> think there is any interest in them defining anything different -
> where
> they have input they tend to take it directly to the Unicode
> Consortium
> themselves.
Certainly. As I said, this first though (1) was a topic not for the
HTML WG, but rather related to the other thoughts (2 and 3). Though
certainly a liaison from the HTML WG would provide significant weight
for the idea.
>> 2) Second, I think its important to provide authors of semantic HTML
>> documents the ability to add pronunciation information for
>> abbreviated
>> forms, newly introduced terms, unusual terms, and homographs. While
>> CSS might be an appropriate place to control the level of chatter and
>> verbosity involving pronunciation (i.e., expanded pronunciation of
>> abbreviations or short-form pronunciations of abbreviations), the
>> semantically relevant pronunciation belongs in the HTML document
>> itself.
>
> The SSML phoneme [1] and sub elements attribute is designed to serve
> this
> purpose. It is also possible to do this using ruby markup (which I
> note on
> the what-wg that Ian has added to the spec - I wish he were a bit more
> careful about ensuring the W3C working group were kept up to date),
> and it
> is also possible to do it using CSS speech properties, in principle.
>
> One of the problems that this introduces is that pronunciation is
> often
> not that helpful. I have enormous difficulty knowing if americans are
> saying "can" or "can't" due to their habit of pronouncing the latter
> like
> a Norwegian would say "Kant" instead of how one would say "cahnt".
> And so
> on and so forth (Roald Dahl wrote a very nice story about someone's
> American Aunt once, playing on this).
>
> [1] http://www.w3.org/TR/speech-synthesis/#edef_phoneme
> [2] http://www.w3.org/TR/speech-synthesis/#edef_sub
There would certainly be some issues here with localization/
internationalization. However, this part of the proposal was much more
focussed on the accessibility and machine processing parts more than
the internationalization part like my Unicode suggestion. I'm sorry
for causing that confusion.
>> 3) This approach got me thinking a bit about homographs and their
>> pronunciation (and other machine processing) as well...
>
> See also SSML's say-as attribute [3,4]
Yes, I think that's a good example. Perhaps “say-as” would be a more
appropriate attribute name for an HTML attribute as well.
>
>
>> If HTML had such an associated listing of ranked homographs, then
>> this
>> same listing could be used for the homophone attribute as well. For
>> example:
>>
>> <abbr type='initialism' expressed-as='word' homophone='oxford:does1'
>>
>> DUZ</abbr>
>
> I don't think HTML is going to include a list of homographs. I am
> certain
> that we are not going to ship one by default, since for our most
> important
> languages this would imply a large amount of extra footprint for a
> miniscule amount of gain. Shipping 2000 MathML entities (very short
> things
> that can be compressed easily) is already a hassle that we wish we
> didn't
> have to deal with.
Yes, I understand. I didn't mean to say that HTML5 would compile the
list of homographs or that browsing UAs would need to bundle a
dictionary of homographs. These are simply meant for authors and
authoring tools and so that users of applications that perform machine
processing of semantics (including text-to-speech applications) could
make use of them.
> Hoping to get commercially published text given to HTML for free
> strikes
> me as something close to wishful thinking, which rules it out for
> english.
Yeah, I probably shouldn't have used those words. If I had a team of
lawyers behind me, they'd be scolding me right now for even saying
that. What I meant was that a published dictionary could be used by
way of reference: though frozen for a specific edition. If I were the
publisher of a dictionary that would be my wildest dream to have the
international format for writing documents reference my homographs in
their attribute values.
>
> Many other languages (French, Spanish, Icelandic, and probably
> non-European langauges, although I am not sure) have an official
> definition, so may be available. Whether the official definition is
> useful
> on the Web is a moot point. Valencianos will point out that they
> speak a
> language, although the Spanish government does not recognise it and
> the
> Library of Congress (which manages a list of languages for ISO)
> apparently
> has insufficient data - they are known to have collected very poor
> data
> for a number of indigenous languages from at least Australia and South
> America which clearly have sufficient texts to warrant more detailed
> classification than "aboriginal language of [somewhere]".
I recently read a quote from Stalin that was nevertheless profound. It
went something like "The difference between a language and a dialect
is that the language has a navy behind it and the dialect does
not." :-) It's not important for every last dialect to have a codified
rank designation for homographs, but where distinctions exist, the
more that do the better.
> I would suggest here that you look at the use of Ruby and XHTML+Voice
> before going further down the path of inventing stuff - these things
> have
> been dealt with before. (It turns out that the market for them is
> fairly
> restricted, so people don't necessarily know of the solutions that
> have
> already been implemented).
Thanks for those suggestions. However, Ruby is definitely not what I'm
thinking about with these thoughts. XHTML+Voice also appears much more
heavy-weight than I was looking for. Again, the suggestions I'm making
(or the thoughts I'm having) relate to a way to facilitate
differentiating homographs in HTML for machine processing and
pronunciation purposes). These attributes I'm proposing would likely
not often have significant visual rending (unlike Ruby) nor even a
major role in text-to-speech (as in X+V) or audible web applications.
Instead these suggestions simply provide a few simple and easy
facilities for authors of documents (typically not web application
documents) to make such homograph differentiations and pronunciation
specifications where desired. For general browsing UAs, like Opera no
additional implementation norms would be involved. Improvements to
implementations would only be necessary for those UAs doing the work
of differentiating homographs, (though perhaps even those UAs can make
most of these distinctions without markup; that's what I'm trying to
determine).
Again, keep in mind that this proposal is something to benefit users
and authors. In general it is costless for implementations, except for
those specialized implementations that want to make use of this extra
information. The only downside I can see is that adding these few
attributes might — together with adding other facilities — add too
many facilities to HTML and overwhelm authors: especially those
authors reading the spec and trying to learn it properly. To me the
only way to deal with that concern is to build a draft of the spec
with all of these things and then weigh the issue of whether anything
can or should be trimmed. Obviously the other issues would surround
whether the attributes should have a different name; whether elements
would better suit the semantics than attributes; and whether these
facilities should be handled by a complimentary technology such as CSS
(the separation of concerns).
Take care,
Rob
More information about the List_HTML4all.org
mailing list