[html4all] Content-type negotiation as an alternative to ALT, LONGDESC and fallback

Mon Aug 27 20:43:06 PDT 2007

On Aug 27, 2007, at 7:51 PM, Leif Halvard Silli wrote:

> Hi Rob, [a long letter again]
>
> 2007-08-27 01:38:15 +0200 Robert Burns:
>>> Could you expand on what you mean by «within HTML»? And do you  
>>> see  this
>>> as a method for bridging the online/offline gap for authors?
>
>> [...] but the file format itself should be treated as authoritative
>> (this contrasts with the HTTP specification that wants to treat
>> whatever the server says as authoritative).
>
> So, if the content says "XHTML", you could not serve is as text by  
> setting extensions to .txt?

Well, yes I think that would be a better approach. The current  
practice collapses two separate things into one property of the file:  
one of which is not a property of the file. That is an XHTML file is  
just that: an XHTML file. Authors and users may want to handle that  
file as something else (HTML or text, for example). However, its  
still an XHTML files. What I'm suggesting is that we separate those  
two things. Servers can send MIME type headers to reduce network  
traffic, but those headers should reflect the content type of the  
file and not the way the author or the user would like to handle the  
file. That could be handled either through HTML markup (e.g., <a  
href='http://www.example.com/afile' handlehrefas='text/plain' >View  
the source of this XML file</a>). Or it could be handled through URLs  
(e.g., http://www.example.com/afile&handleas=text/plain).

In addition to separating these properties (a file's type and the  
file's type handling), I'm saying that the authoritative information  
about a file should — as much as possible — be stored within the file  
itself. This includes the file's type,, its language(s), its  
character encoding, and any other relevant metadata. Most formats  
support this internal metadata.

The advantage of extracting the metadata and storing it as file  
system attributes (including filename extensions) or in SQL databases  
or sent as HTTP headers is that it can make finding files and  
negotiating and querying content faster and more efficient. However,  
I think the file's content itself should always be treated as  
authoritative. You example of an MP3 file that no longer has its  
extension is an example of what I'm talking about. Losing a filename  
extension should not render the file useless. An application that  
handles MP3s should be able to recognize it as an MP3 even if the OS  
and the filesystem do not provide that information. Renaming it with  
an .mp3 filename extension will help locate and index the file, but  
it shouldn't be required for an application to handle the file  
(including an HTTP server).

Apache and IIS both already have the ability to sniff the files type  
from the file's contents. Filename extensions or other metadata can  
make that quicker and simpler. However, they cannot both be  
authoritative. If there's a conflict there should be a way of  
specifying how that conflict should be resolved and the only thing  
that makes sense is for the file's content to determine its type (or  
its language, etc). The example you raised about charset encoding  
again underscores what I"m talking about. The character encoding is  
determined by the bytes mapping to characters. Like the UTF encodings  
It should be determined in that byte encoding and not through some  
extracted piece of metadata written as charset=-'UTF-8'. That's  
another example of the separation I'm talking about. The charset  
attribute is a separate piece of metadata about a file that can get  
out of sync.

>> So in moving alternate equivalent fallback handling from within  
>> HTML  markup
>> (using elements and attributes) to the HTTP server through  content
>> negotiation is a complication (perhaps needless complexity).  So my
>> inclination is to try to make everything work with a single  file  
>> (with no
>> server overhead).
>
> I suppose you want to link, via elements and attributes, to the  
> alternate content. And not keep it inside the file.
>
>> Then the server can provide header  information and content
>> negotiation that reduces network traffic, but  it shouldn't be  
>> treated as
>> authoritative.
>
> Above you said the file extension should not be authorative - for  
> the server, I guess. Here you say that the header info from the  
> server should not be authorative - by the UA then?
>
>> 'Consider this example.
>>
>> 1) [...] the  content declares its type [for the server] [...]
>>
>> 2) [...] browser [...] users environment to negotiate [...]
>>
>> Using server headers in this situation is very efficient. It  
>> allows  the UA
>> to query the server to get at the exact files it needs. If it   
>> could not
>> perform these queries through content negotiation, the UA  would  
>> need to
>> discover and download one file after the other to see  what files  
>> it needed
>> to satisfy the user's needs. That  leads to much  more network  
>> traffic.
>
> Is the only difference - in theory - from todays situation, that  
> the server does not look at the extensions, but at the head section  
> of the files instead?
>
> Possible cons: How do you preserve cool URIs in this situation?  
> URIs which are just as cool whichever version of the content the  
> users is served?

A Mac OS  like approach would arrange it more like this:

MyHTMLFile (actually a folder but presented as a file)
  |
  ---
     ru (an html file in russian)
     en (an html file in english)
     no (an html file in norwegian)
     fr (an html file in french)
     en-us (an html file in the US variant of english)
     etc.

Again, the filename extension is just a piece of extracted metadata.  
It could instead be stored in a separate type filesystem attribute on  
the HFS+ filesystem.  Most modern filesystems now support the  
addition of any arbitrary filesystem attribute.  Filesystems may be  
easier for users than MIME types, but localized strings are even  
easier. Eliminating filename extensions in favor of a separate type  
attribute allows the system to present the type in the user's native  
language. I'm not saying we should eliminate filename extensions, I 
('m just trying to promote a broader way of thinking about the problem.

> I ask because: Since we're both know Mac OS 9, we know about free  
> form file names there. But this freedom comes at a cost. The user  
> must invent those extensions instead. Of course, the user/author  
> can choose to follow those conventions anyhow, but if there is no  
> befenefit, other than the private order for the author, then we  
> will end up having "book.mp3" as alternetive content for  
> "article.mov" - instead of "article.mp3" for "article.mov". E.g.  
> when the users/readers look at the filename, they will wonder if it  
> is really meant to be the same thing. The goes if "book.ru" is  
> supposed to be replacement for "article.en".
>
>> So to me the server's ability to negotiate content and provide   
>> informative
>> headers is great. However, the insistence that these  headers must be
>> authoritative is a problem because its easy to mis- configure a  
>> server or
>
> As a consequence of the heaeders not needing to be authorative,  
> what would we get? The headers says this file is Russian, but your  
> browser know better? How? When it has loaded the file?

The idea is to try to make sure the headers are accurate, but not to  
expect those headers to achieve the impossible (to know better about  
a file than the file itself). Its the same thing with filesystem  
attributes. Since I can change the filename extension on any file on  
my system, we should treat the filename extensions as authoritative.  
Rather it should merely be though of as a convenient extraction of  
the file's actual type.

>> otherwise make mistakes in doing so. Also for  an author that  
>> specifically
>> wants a file treated as another type  (like treating an HTML file  
>> as raw
>> text), such authors may not have  control over the server or even  
>> know where
>> the document will be  hosted. T o me this suggests it is better — in
>> handling the  separation of concerns  properly — to place as much as
>> possible  within the document itself. The XML norm for an  XML  
>> declaration
>> that  includes character encoding is one example. The existence — in
>> formats such as PDF and PNG and many others — of a few bytes at the
>> beginning of the binary file that show up as 'pdf' or 'png' is   
>> another
>> example.
>
> Forgetting the content negotiation issues for a while (they mostly  
> arent so relevant when it comes to choice between encodings): So  
> the server picks this from the file. Good, mostly, except when the  
> author wants to overrule this info. (For many file formats, the  
> author has no practical way to edit this information - unless file  
> extension menthos is available, which is probably one of the  
> reasons extensions is usually preferred.)

Again, I think this relates to the type handling of the file rather  
than the type of the file. Locally, this could be handled through a  
contextual menu or through a info panel (or through a menu within the  
application itself). Another filesystem attribute could be about the  
type handling of a file as opposed to its actual type. As i suggested  
above this could be handled through markup   and URLs for HTML and  
over the network respectively.

>> Using the file format itself to specify its own metadat is a much   
>> safer
>> path. It ensures that there's no separation between what a file   
>> is and the
>> metadata that describes it.
>
> The «safer» argument is only true if the author has difficulties in  
> affecting this info. It is not difficult providing wrong charset  
> info in the META element. It's syntax is also difficult to remember  
> - unlike the extensions. And why is safer an issue? Perhaps simple  
> (to change) is more important than safe (from changing)? Editors/ 
> Authors want it to be simple to edit.

I don't think this approach limits or eliminates any capabilities for  
authors and users. Again, you have to think separately about a file's  
type and a file's type handling.

> The «freedom» of iTunes is that the name is meaningless, except to  
> the machine itself - which only needs it for identifying the item  
> as a independent «thing».
>
>> For example, if you use iTunes  you can actually
>> move all of your files from one computer to another.  This can be  
>> through a
>> path that has no server or filesystem metadata.  You can rename  
>> the files to
>> pure gibberish. However, when you drag  those files on iTunes,  
>> every piece of
>> metadata about the file will be  added to the new iTunes library  
>> (this
>> doesn't count the filename  itself which is usually not meaningful  
>> and
>> unintelligible; it also  doesn't include the playlists and perhaps  
>> album
>> art). This is how I  think it should be. The file is the  
>> authoritative source
>> of the files  metadata. Any extraction of that metadata —  
>> including the
>> setting of  the filename extension to '.ac4' — should be derivative.
>
> Well, the meta info about the charset/encoding, is derivative  
> whether it is given here or there. The same goes for other meta  
> info. If iTunes is a pattern, then I suppose that you have bad luck  
> if you change the extension from .ac4 to .mp3 or to .txt - it will  
> then be read the wrong way. At least outside iTunes.

Right, but that's because applications are not following the  
appropriate way to deal with this metadata. They're relying on a  
derivative filename extensions rather than on the file's inherent type.

Take care,
Rob