User:Mjb/Workspace

From Offset
< User:Mjb
Revision as of 03:33, 29 June 2005 by Mjb (talk | contribs) (moved quotes)
Jump to navigationJump to search

Important links

  • IETF and the RFC Standards Process, from The Art of Unix Programming by Eric Steven Raymond, emphasizes that Internet RFCs and standards tend to be based more on actual implementation than pie-in-the-sky theory
  • File system info by Chris Giese in addition to providing various FAT technical details, gives additional details about encodings, legal characters, and limitations of FAT12, FAT16, VFAT, FAT32, NTFS, ext2, ISO9660, Joliet, and HFS+
  • Wikipedia:en:Comparison of file systems covers a lot of ground, and links to separate articles about each file system. Check the discussion page as well.
  • This page from IBM's WebSphere CORBA documentation is an example of an implementation expecting to see ":" and "\" in a 'file' URL
  • This Lynx documentation shows how an implementation might treat '~' specially in a 'file' URL

Mailing list posts and notable quotes

  • 'file' URI conventions (13 July 2004) - Mike Brown brings up many issues that complicate the mapping of file system paths to URIs
  • What to do about file: (19 August 2004) - Paul Hoffman points to a now-expired Internet-Draft that was just RFC 1738's 'file' URI section pulled out into a separate document, and asks about courses of action:
    • Publish it as-is (which would accomplish nothing other than hastening the retirement of RFC 1738)
      • "no" — Larry Masinter
    • Prescribe what implementations SHOULD do, knowing that such a prescription is bound to break many/most existing implementations
      • "this would be useful if it were accompanied by documentation of the caveats." — Larry Masinter
    • List many more interpretations that current implementations use, but not say whether or not to do them
      • avoid making gratuitous recommendations, but not ALL recommendations — Larry Masinter
    • Say more about the wide variety of interpretations, but don't list them so as not to confuse readers
      • It's more useful to describe useful. — Larry Masinter

An RFC that says, essentially, "Internet Explorer on post-4.0 versions on Windows platforms does X, while Gecko-based engines on linux platforms do Y, on Windows platforms do Z, while the popular LWP perl library does W, java.net.URI does U…" would feel profoundly weird to me. — Tim Bray [1]

Not all RFCs prescribe standards, and this is information that would be profoundly useful to the Internet community. … It would be excellent to have a single reasonably authoritative place to go, rather to have to run one's own experiments all the time. — John Cowan [2]

File system path to URI

When converting from any file system path to a URI, questions to consider include the following.

For what kind of file system is the path?

  • MS-DOS and Windows: FAT16, VFAT, FAT32, NTFS
  • Unix-like OSes: UFS, UFS2, ext2, ext3, ReiserFS V3, Reiser4
  • Legacy Mac OS: HFS+

There are differences in how these file systems store directory entries, what characters they allow, how paths manifest in internal APIs, and how paths manifest to the end user of the OS.

If the path's file system is not known…

…what should you do?

  • Assume the path is from a default file system for the local OS? Many OSes offer a choice of file systems. How can you be sure you got it right? Is there a "good enough", file system-agnostic fallback?
  • Maybe just reject the path? IOW, just say that the file system type must be known.

If the path's file system is not recognized…

…what should you do?

  • Reject the path?
  • Use a default algorithm, like just prepending 'file:' and doing whatever percent-encoding is required?

Is the path 'absolute'?

  • If it's a UNIX path, whether it starts with "/" is the only qualification, I believe.
  • If it's a Windows path, it could be absolute if it matches the regular expression ^(\\|[A-Za-z]:) - that is, it either starts with "\" or a drivespec (an ASCII-range letter followed by ":").

If the path is not absolute…

…what should you do?

  • Reject it?
  • Create a relative URI reference? ('the/path')
  • Create an RFC 3986-compliant, but RFC 1738-offending, URI like 'file:the/path'?
  • Attempt to make the path absolute by interpreting it to be relative to the local host's 'current working directory', if such a concept exists in the local OS? What if the path is for some other file system?
  • And do you make it absolute according to the file system's conventions first, or do you do an RFC 3986 conformant resolution of a relative URI reference ('the/path') against the base URI that is derived from the current working directory?

Does the path contain same- or parent- (. or .., for example) references?

Do you attempt to collapse dot segments (or equivalent) in the path or in the resulting URI? Does it depend on whether the path or URI is absolute? A reason to collapse dot segments in an absolute URI is so that the URI can be suitable for use as a base URI for RFC 3986 conformant resolution.

Is the mapping between segments in the filesystem path and segments in the path component of the URI well-defined?

On Unix file systems, it should be sufficient to percent-encode all non-unreserved characters. Note that '/' may appear *within* a segment, though (you can put a slash in a filename), so be sure to apply percent-encoding to each segment individually.

On Windows, complications abound. (I think I cover these below)

If the path purports to be for a particular OS, but does not match that OS's syntax for a path, e.g. 'C:/autoexec.bat' on Windows…

  • Reject the path?
  • Be as lenient as possible, e.g. replace '/' with '\' for Windows?
  • What about '9:\autoexec.bat' on Windows (bad drivespec)? acceptable?

If the path is provided as a sequence of Unicode characters…

  • Form the URI by leaving unreserved characters as-is, and percent-encoding the rest, using UTF-8 as the basis? (RFC 3986 default)
  • Use some other encoding more appropriate to the path's OS?

If the path is provided as a sequence of bytes

(not Unicode characters, with no additional info about encoding)…

  • Reject it because it can't be decoded to Unicode?
  • Assume a default encoding? based on...? How confident can you be about, say, a file system default encoding? (probably not very, on Unix)
  • Attempt no decode; just form the URI by converting to unreserved characters only those bytes that, when decoded as ASCII, correspond to unreserved characters, and percent-encoding the rest of the bytes individually?

For a Windows path, is it in the form of a local path or a UNC path?

("local" may not be the right term)

  • local, absolute, with drivespec: C:\autoexec.bat
  • local, absolute, no drivespec: \autoexec.bat
  • local, relative: the\path
  • UNC: \\host\share\autoexec.bat
  • Do you map the UNC host name to the authority component? Don't forget to percent-encode.
  • Do you leave the UNC share name as the first segment of the path component, or..? And don't forget to percent-encode.

Exceptional UNC paths

Networked instances of Windows do weird things like refer to network printers like this: '\\http://192.168.0.1/printername', and refer to shared drives like this: '\\sharename\$d$\autoexec.bat'. When are these conventions used? I saw the former today, and the latter a few years back on NT4 systems. Are they documented anywhere, and do you want to attempt to deal with them?

Windows case normalization

For a Windows path, do you do any case normalization, e.g. in the drivespec? ('c:' -> 'C:')

Windows and colon characters

Windows uses ":" in the drivespec (and nowhere else, currently). ":" is a reserved character in a URI, but does not need to be percent-encoded in a path segment. Therefore, 'file:///C:/autoexec.bat' is acceptable as a URI, and is equivalent to 'file:///C%3A/autoexec.bat.

There is a convention of using "|", e.g. 'file:///C|/autoexec.bat', I believe because of the ambiguities that arise when you have situations like 'C:/foo' as a relative URL being resolved against, say, 'file:/autoexec.bat' or 'file:C:/autoexec.bat' and so on - things that appear in the wild and may(?) have been canon at one time, but don't play nicely with any relative resolution algorithms.

I haven't much sympathy for "|" and feel it should be deprecated as much as possible. Resolvers should continue to accept it and treat it as synonymous with a drivespec ":". On that note, though, should they treat all "|" as ":", or just those that appear to be a drivespec?

If ":" or "|" ever become legal characters in Windows paths… then what.

Empty path segments

Empty segments in the path: collapse them? Depends on OS?

This gets tricky round-tripping on Windows with UNC paths.. I'd have to experiment again to give you some good examples though. I decided not to worry about it too much).