[html4all] Content-type negotiation as an alternative to ALT, LONGDESC and fallback
Robert Burns
rob at robburns.com
Mon Aug 27 20:43:06 PDT 2007
On Aug 27, 2007, at 7:51 PM, Leif Halvard Silli wrote:
> Hi Rob, [a long letter again]
>
> 2007-08-27 01:38:15 +0200 Robert Burns:
>>> Could you expand on what you mean by «within HTML»? And do you
>>> see this
>>> as a method for bridging the online/offline gap for authors?
>
>> [...] but the file format itself should be treated as authoritative
>> (this contrasts with the HTTP specification that wants to treat
>> whatever the server says as authoritative).
>
> So, if the content says "XHTML", you could not serve is as text by
> setting extensions to .txt?
Well, yes I think that would be a better approach. The current
practice collapses two separate things into one property of the file:
one of which is not a property of the file. That is an XHTML file is
just that: an XHTML file. Authors and users may want to handle that
file as something else (HTML or text, for example). However, its
still an XHTML files. What I'm suggesting is that we separate those
two things. Servers can send MIME type headers to reduce network
traffic, but those headers should reflect the content type of the
file and not the way the author or the user would like to handle the
file. That could be handled either through HTML markup (e.g., <a
href='http://www.example.com/afile' handlehrefas='text/plain' >View
the source of this XML file</a>). Or it could be handled through URLs
(e.g., http://www.example.com/afile&handleas=text/plain).
In addition to separating these properties (a file's type and the
file's type handling), I'm saying that the authoritative information
about a file should — as much as possible — be stored within the file
itself. This includes the file's type,, its language(s), its
character encoding, and any other relevant metadata. Most formats
support this internal metadata.
The advantage of extracting the metadata and storing it as file
system attributes (including filename extensions) or in SQL databases
or sent as HTTP headers is that it can make finding files and
negotiating and querying content faster and more efficient. However,
I think the file's content itself should always be treated as
authoritative. You example of an MP3 file that no longer has its
extension is an example of what I'm talking about. Losing a filename
extension should not render the file useless. An application that
handles MP3s should be able to recognize it as an MP3 even if the OS
and the filesystem do not provide that information. Renaming it with
an .mp3 filename extension will help locate and index the file, but
it shouldn't be required for an application to handle the file
(including an HTTP server).
Apache and IIS both already have the ability to sniff the files type
from the file's contents. Filename extensions or other metadata can
make that quicker and simpler. However, they cannot both be
authoritative. If there's a conflict there should be a way of
specifying how that conflict should be resolved and the only thing
that makes sense is for the file's content to determine its type (or
its language, etc). The example you raised about charset encoding
again underscores what I"m talking about. The character encoding is
determined by the bytes mapping to characters. Like the UTF encodings
It should be determined in that byte encoding and not through some
extracted piece of metadata written as charset=-'UTF-8'. That's
another example of the separation I'm talking about. The charset
attribute is a separate piece of metadata about a file that can get
out of sync.
>> So in moving alternate equivalent fallback handling from within
>> HTML markup
>> (using elements and attributes) to the HTTP server through content
>> negotiation is a complication (perhaps needless complexity). So my
>> inclination is to try to make everything work with a single file
>> (with no
>> server overhead).
>
> I suppose you want to link, via elements and attributes, to the
> alternate content. And not keep it inside the file.
>
>> Then the server can provide header information and content
>> negotiation that reduces network traffic, but it shouldn't be
>> treated as
>> authoritative.
>
> Above you said the file extension should not be authorative - for
> the server, I guess. Here you say that the header info from the
> server should not be authorative - by the UA then?
>
>> 'Consider this example.
>>
>> 1) [...] the content declares its type [for the server] [...]
>>
>> 2) [...] browser [...] users environment to negotiate [...]
>>
>> Using server headers in this situation is very efficient. It
>> allows the UA
>> to query the server to get at the exact files it needs. If it
>> could not
>> perform these queries through content negotiation, the UA would
>> need to
>> discover and download one file after the other to see what files
>> it needed
>> to satisfy the user's needs. That leads to much more network
>> traffic.
>
> Is the only difference - in theory - from todays situation, that
> the server does not look at the extensions, but at the head section
> of the files instead?
>
> Possible cons: How do you preserve cool URIs in this situation?
> URIs which are just as cool whichever version of the content the
> users is served?
A Mac OS like approach would arrange it more like this:
MyHTMLFile (actually a folder but presented as a file)
|
---
ru (an html file in russian)
en (an html file in english)
no (an html file in norwegian)
fr (an html file in french)
en-us (an html file in the US variant of english)
etc.
Again, the filename extension is just a piece of extracted metadata.
It could instead be stored in a separate type filesystem attribute on
the HFS+ filesystem. Most modern filesystems now support the
addition of any arbitrary filesystem attribute. Filesystems may be
easier for users than MIME types, but localized strings are even
easier. Eliminating filename extensions in favor of a separate type
attribute allows the system to present the type in the user's native
language. I'm not saying we should eliminate filename extensions, I
('m just trying to promote a broader way of thinking about the problem.
> I ask because: Since we're both know Mac OS 9, we know about free
> form file names there. But this freedom comes at a cost. The user
> must invent those extensions instead. Of course, the user/author
> can choose to follow those conventions anyhow, but if there is no
> befenefit, other than the private order for the author, then we
> will end up having "book.mp3" as alternetive content for
> "article.mov" - instead of "article.mp3" for "article.mov". E.g.
> when the users/readers look at the filename, they will wonder if it
> is really meant to be the same thing. The goes if "book.ru" is
> supposed to be replacement for "article.en".
>
>> So to me the server's ability to negotiate content and provide
>> informative
>> headers is great. However, the insistence that these headers must be
>> authoritative is a problem because its easy to mis- configure a
>> server or
>
> As a consequence of the heaeders not needing to be authorative,
> what would we get? The headers says this file is Russian, but your
> browser know better? How? When it has loaded the file?
The idea is to try to make sure the headers are accurate, but not to
expect those headers to achieve the impossible (to know better about
a file than the file itself). Its the same thing with filesystem
attributes. Since I can change the filename extension on any file on
my system, we should treat the filename extensions as authoritative.
Rather it should merely be though of as a convenient extraction of
the file's actual type.
>> otherwise make mistakes in doing so. Also for an author that
>> specifically
>> wants a file treated as another type (like treating an HTML file
>> as raw
>> text), such authors may not have control over the server or even
>> know where
>> the document will be hosted. T o me this suggests it is better — in
>> handling the separation of concerns properly — to place as much as
>> possible within the document itself. The XML norm for an XML
>> declaration
>> that includes character encoding is one example. The existence — in
>> formats such as PDF and PNG and many others — of a few bytes at the
>> beginning of the binary file that show up as 'pdf' or 'png' is
>> another
>> example.
>
> Forgetting the content negotiation issues for a while (they mostly
> arent so relevant when it comes to choice between encodings): So
> the server picks this from the file. Good, mostly, except when the
> author wants to overrule this info. (For many file formats, the
> author has no practical way to edit this information - unless file
> extension menthos is available, which is probably one of the
> reasons extensions is usually preferred.)
Again, I think this relates to the type handling of the file rather
than the type of the file. Locally, this could be handled through a
contextual menu or through a info panel (or through a menu within the
application itself). Another filesystem attribute could be about the
type handling of a file as opposed to its actual type. As i suggested
above this could be handled through markup and URLs for HTML and
over the network respectively.
>> Using the file format itself to specify its own metadat is a much
>> safer
>> path. It ensures that there's no separation between what a file
>> is and the
>> metadata that describes it.
>
> The «safer» argument is only true if the author has difficulties in
> affecting this info. It is not difficult providing wrong charset
> info in the META element. It's syntax is also difficult to remember
> - unlike the extensions. And why is safer an issue? Perhaps simple
> (to change) is more important than safe (from changing)? Editors/
> Authors want it to be simple to edit.
I don't think this approach limits or eliminates any capabilities for
authors and users. Again, you have to think separately about a file's
type and a file's type handling.
> The «freedom» of iTunes is that the name is meaningless, except to
> the machine itself - which only needs it for identifying the item
> as a independent «thing».
>
>> For example, if you use iTunes you can actually
>> move all of your files from one computer to another. This can be
>> through a
>> path that has no server or filesystem metadata. You can rename
>> the files to
>> pure gibberish. However, when you drag those files on iTunes,
>> every piece of
>> metadata about the file will be added to the new iTunes library
>> (this
>> doesn't count the filename itself which is usually not meaningful
>> and
>> unintelligible; it also doesn't include the playlists and perhaps
>> album
>> art). This is how I think it should be. The file is the
>> authoritative source
>> of the files metadata. Any extraction of that metadata —
>> including the
>> setting of the filename extension to '.ac4' — should be derivative.
>
> Well, the meta info about the charset/encoding, is derivative
> whether it is given here or there. The same goes for other meta
> info. If iTunes is a pattern, then I suppose that you have bad luck
> if you change the extension from .ac4 to .mp3 or to .txt - it will
> then be read the wrong way. At least outside iTunes.
Right, but that's because applications are not following the
appropriate way to deal with this metadata. They're relying on a
derivative filename extensions rather than on the file's inherent type.
Take care,
Rob
More information about the List_HTML4all.org
mailing list