URI/File scheme/Path mapping
When converting from any file system path to a URI, you may wish to consider issues like the following:
- 1 For what kind of file system is the path?
- 2 If the path's file system is not known…
- 3 If the path's file system is not recognized…
- 4 Is the path 'absolute'?
- 5 If the path is not absolute…
- 6 Does the path contain same- or parent- (. or .., for example) references?
- 7 Is the mapping between segments in the filesystem path and segments in the path component of the URI well-defined?
- 8 If the path purports to be for a particular OS, but does not match that OS's syntax for a path, e.g. 'C:/autoexec.bat' on Windows…
- 9 If the path is provided as a sequence of Unicode characters…
- 10 If the path is provided as a sequence of bytes
- 11 For a Windows path, is it in the form of a local path or a UNC path?
- 12 Exceptional UNC paths
- 13 Windows case normalization
- 14 Windows and colon characters
- 15 Empty path segments
For what kind of file system is the path?
- MS-DOS and Windows: FAT16, VFAT, FAT32, NTFS
- Unix-like OSes: UFS, UFS2, ext2, ext3, ReiserFS V3, Reiser4
- Legacy Mac OS: HFS+
There are differences in how these file systems store directory entries, what characters they allow, how paths manifest in internal APIs, and how paths manifest to the end user of the OS.
If the path's file system is not known…
…what should you do?
- Assume the path is from a default file system for the local OS? Many OSes offer a choice of file systems. How can you be sure you got it right? Is there a "good enough", file system-agnostic fallback?
- Maybe just reject the path? IOW, just say that the file system type must be known.
If the path's file system is not recognized…
…what should you do?
- Reject the path?
- Use a default algorithm, like just prepending 'file:' and doing whatever percent-encoding is required?
Is the path 'absolute'?
- If it's a UNIX path, whether it starts with "/" is the only qualification, I believe.
- If it's a Windows path, it could be absolute if it matches the regular expression ^(\\|[A-Za-z]:) - that is, it either starts with "\" or a drivespec (an ASCII-range letter followed by ":").
If the path is not absolute…
…what should you do?
- Reject it?
- Create a relative URI reference? ('the/path')
- Create an RFC 3986-compliant, but RFC 1738-offending, URI like 'file:the/path'?
- Attempt to make the path absolute by interpreting it to be relative to the local host's 'current working directory', if such a concept exists in the local OS? What if the path is for some other file system?
- And do you make it absolute according to the file system's conventions first, or do you do an RFC 3986 conformant resolution of a relative URI reference ('the/path') against the base URI that is derived from the current working directory?
Does the path contain same- or parent- (. or .., for example) references?
Do you attempt to collapse dot segments (or equivalent) in the path or in the resulting URI? Does it depend on whether the path or URI is absolute? A reason to collapse dot segments in an absolute URI is so that the URI can be suitable for use as a base URI for RFC 3986 conformant resolution.
Is the mapping between segments in the filesystem path and segments in the path component of the URI well-defined?
On Unix file systems, it should be sufficient to percent-encode all non-unreserved characters. Note that '/' may appear *within* a segment, though (you can put a slash in a filename), so be sure to apply percent-encoding to each segment individually.
On Windows, complications abound. (I think I cover these below)
If the path purports to be for a particular OS, but does not match that OS's syntax for a path, e.g. 'C:/autoexec.bat' on Windows…
- Reject the path?
- Be as lenient as possible, e.g. replace '/' with '\' for Windows?
- What about '9:\autoexec.bat' on Windows (bad drivespec)? acceptable?
If the path is provided as a sequence of Unicode characters…
- Form the URI by leaving unreserved characters as-is, and percent-encoding the rest, using UTF-8 as the basis? (RFC 3986 default)
- Use some other encoding more appropriate to the path's OS?
If the path is provided as a sequence of bytes
(not Unicode characters, with no additional info about encoding)…
- Reject it because it can't be decoded to Unicode?
- Assume a default encoding? based on...? How confident can you be about, say, a file system default encoding? (probably not very, on Unix)
- Attempt no decode; just form the URI by converting to unreserved characters only those bytes that, when decoded as ASCII, correspond to unreserved characters, and percent-encoding the rest of the bytes individually?
For a Windows path, is it in the form of a local path or a UNC path?
("local" may not be the right term)
- local, absolute, with drivespec: C:\autoexec.bat
- local, absolute, no drivespec: \autoexec.bat
- local, relative: the\path
- UNC: \\host\share\autoexec.bat
- Do you map the UNC host name to the authority component? Don't forget to percent-encode.
- Do you leave the UNC share name as the first segment of the path component, or..? And don't forget to percent-encode.
Exceptional UNC paths
Networked instances of Windows do weird things like refer to network printers like this: '\\http://192.168.0.1/printername', and refer to shared drives like this: '\\server\d$\autoexec.bat'.
"$" at the end of a share name is a naming convention that causes it to be hidden from network browsers and 'net view'.  The format of a UNC path is \\server\share\path\filename.
- Does this present any problems?
Windows case normalization
For a Windows path, do you do any case normalization, e.g. in the drivespec? ('c:' -> 'C:')
Windows and colon characters
Windows uses ":" in the drivespec (and nowhere else, currently). ":" is a reserved character in a URI, but does not need to be percent-encoded in a path segment. Therefore, 'file:///C:/autoexec.bat' is acceptable as a URI, and is equivalent to 'file:///C%3A/autoexec.bat.
There is a convention of using "|", e.g. 'file:///C|/autoexec.bat', I believe because of the ambiguities that arise when you have situations like 'C:/foo' as a relative URL being resolved against, say, 'file:/autoexec.bat' or 'file:C:/autoexec.bat' and so on - things that appear in the wild and may(?) have been canon at one time, but don't play nicely with any relative resolution algorithms.
I haven't much sympathy for "|" and feel it should be deprecated as much as possible. Resolvers should continue to accept it and treat it as synonymous with a drivespec ":". On that note, though, should they treat all "|" as ":", or just those that appear to be a drivespec?
If ":" or "|" ever become legal characters in Windows paths… then what.
Empty path segments
Empty segments in the path: collapse them? Depends on OS?
This gets tricky round-tripping on Windows with UNC paths.