Closed Bug 132008 Opened 18 years ago Closed 4 years ago
Enhancement of URL scheme for compressed directories
IDEA Some time ago, Mozilla introduced a special protocol named `jar://' which was used as a decompress encapsulating layer over other protocols (such as `http://') or implicitly over the `file://' protocol. The syntax of the `jar://' protocol was something like: jar://<scheme>://<host>/<path>/<zipfile>!<file> However, the special syntax is not trivial for the user, and not compliant with the relevant RFC's. I guess this is one of the reasons this protocol was removed and isn't working in recent Mozilla browsers. In this enhancement proposal, I would like to introduce a different approach for supporting browsing through compressed files, which is complaint with RFC 2396 and more transparent to the user. As this proposal is about a feature that is now hidden, I hope the capability to browse a file through a compressed directory will be enabled again, and won't be limited to ZIP archives only. Adding the option to browse through the archive file, transparent to the user, would have two main benefits. First, the user won't mess with uncompressing the archive and wasting space for the uncompressed files. Second, in the case of remote requests, the transmitted IP-datagrams are compressed; hence reducing network overhead when compared to standard HTTP requests. URL SYNTAX The standard URL syntax, according to RFC 1808, is: <scheme>://<host>/<path>;<params>?<query>#<fragment> The fragment component identifies the file section the browser should focus on. Till now, this component was used only for HTML files. For any other file type, such as binary files or compressed files, the fragment part had no use at all, and consequentially was ignored both by the browser and the server. The concept of fragmentation itself can be extended to support focusing on a file within an archive. Therefore, I would like to purpose an enhancement for the usage of the URL's fragment component. My proposal is to use this field as a mark for the browser, to mark which file to extract and display properly. For example, let's assume we got the following URL as an http request: http://domain.com/path/site.zip#index.html Assuming the file site.zip in http://domain.com/path/ exists, we should get it as a respond for the request, since the server ignores the fragment part of the URL. If the proposal will be implemented, the browser should extract the file index.html from the archive, and display it as any normal HTML file. Furthermore, the base path of index.html is the root of the archive. Therefore, linked objects, such as images, should be extracted from the archive too, exactly as those objects would have been requested if index.html was a regular file in the server. The purposed enhancement makes the fragment component behave as an additional path for the destination file. As a result, a few more considerations must be taken care of when implementing this enhancement. Every relative path in the destination file should regard the archive file itself as a subdirectory in the path. This is quite similar to the way a few file-managers (i.e., MC) are scanning through an archive file. If a file /site.zip#pages/index.html has a link to an image in "../pics/background.gif", the image is expected to be found at /site.zip#pics/background.gif. Note that in this example, the archive file itself has an internal directory structure. There are no restrictions on relative path, and it may go beyond the scope of the current archive file. Also, there should be no restrictions on nested archives. Theoretically, one may address the file /site.zip#archives/yesterday.zip#log.html#09_00. Note that the first two hash signs are intercepted as files within archives, where the third one is intercepted as a section marker inside the log.html HTML file. IMPROVEMENTS AND OPTIMIZATIONS When an archive is accessed, finding and extracting a specific file may require a full scan through the archive. However, for a standard archive file, the full scan may be performed only ones, or even less. Some archives include an index of all the files within the archive. This way, accessing a compressed file may be done by reading or requesting only small fragment of the archive. Multiple volume ZIP files always have an index structure in the end of the last volume of the archive file. The index is sometimes added to a single volume ZIP file, even though usually not. Other archive files might not have an index. In this case, the browser should scan through the archive ones, when the archive is first accessed or requested. During the scan, it will collect the information for each file inside the archive and store the assembled index in the internal cache. Now accessing a file inside the archive would be much faster, as the file's exact location is known. Requesting a large archive from a remote server only to browse a single file inside it is not efficient. However, if the archive has an index structure in its end, the browser may avoid downloading the whole archive. In order to avoid a full download, the browser should first ask for the archive size, and then use a partial file read (supported by most ftp and http servers) to ask for the last bytes of the archive. Usually, for an archive with an index those bytes contain a pointer to the location of the index structure inside the archive. If this is the case, the browser should send a second partial read and request the index structure of the archive. Now the browser can calculate the exact location of the file inside the archive, initiate a third partial read request and uncompress the file. Note that this optimization may not be applicable in some conditions, and may worsen the performance if the archive size is rather small, so careful consideration is needed before implementing it. Note that if this optimization is not carried out, the browser will request all the archive, as it does anyway now. LIMITATIONS Reintroducing the `jar://' protocol (with or without the above extended form), has two limitations or weak points. As mentioned in the previous chapter, downloading a large archive from a remote server is not advised, as the actual information needed is only a fraction of the archive file itself. The solution was to request the archive index first, and then request the actual file inside the archive. However, not all the archives carry a full index in their tail. For indexless archives, the browser will have no option but to download the full archive file and create its own archive index in the internal cache. The second limitation is more problematic and regards the support of other compression methods rather than ZIP. In a normal archive like ZIP, ARJ and other `normal' compressing methods, each file is compressed separately. However, many of the newer compressing algorithms, such as RAR, ACE and TGZ (TARred GZIP) may create `solid archive'. A solid archive is an archive that was created by compressing all the files as one big concatenated file. Unfortunately, uncompressing a single file from a solid archive requires uncompressing all the files reside before it in the archive. When this is the case, all the indexing techniques discussed in the previous chapter are not relevant. Furthermore, now the browser must uncompress all the files the first time, or repeat a full uncompress procedure for each bunch of requests. In the first case, the browser would require unbounded amount of space for the uncompressed files. In the second case, we get a huge overhead from the uncompressing process. As this is limitation does not have a simple solution, I think supporting solid archives should be left aside for a future version, if supported at all. COMPATABILITY WITH CURRENT STANDARD (RFC 1808/2397) According to RFC 1808, everything beyond the first hash sign is the fragment section of the URL, and the URL itself may not include any hash signs. If the URL is compliant with the RFC standard, everything beyond the hash sign is ignored, and therefore the suggested extension doesn't compromise the proper parsing of standard URL format. With the exception of few exotic browsers [translation -- the MSIE family], which try to intercept the file type according to its internal structure, normal browsers the intercept the file type according to its extension. Therefore, a normal browser may safely regard the fragment component according to the file extension. For any file other than the supported archive types, the browser should intercept the fragment component as usual. For the archive files, previously the browser was usually asking to save the file, disregarding the fragment component anyway. REFERECE RFC 1738 - Uniform Resource Locators (URL) [Draft Standard] RFC 1808 - Relative Uniform Resource Locators [Proposed Standard] RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax [Proposed Standard] Bugzilla BUG #64286 - The jar protocol
QA futuring of various bugs to focus on remaning bugs. If you think your bug needs immediate attention for Mozilla 1.0, set the milestone to "--". I will review these bugs (again) later.
Target Milestone: --- → Future
I believe the enhancement request should be at least thoughfully considered, for three reasons. A. The basic feature does not require much work, as the mechanism is already available in the browser. Also, the demand has been raised in the past few times, and probably will arise again in the future. B. It extents the capabilities of the browser without any payment. C. As this suggestion define a variant to the URL scheme, if there is a chance to implement it in the future, I think it will be wise to implement it before a major release.
Target Milestone: Future → ---
+helpwanted - some engineers can decide if they want to future this. If they do future it, please do not move the milestone back unless you find someone to own and code the feature.
Assignee: new-network-bugs → dveditz
Target Milestone: --- → Future
bug 64286: web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar) extends the compressed protocol scheme to single-file saved pages, specifically including write support.
Well, it's been a long time since I've looked at this bug and bug 64286. When I wrote my comments on bug 64286, I hadn't discovered this one yet, so it's funny that Zvi and I arrived at pretty much the same conclusions. I just want to add that I think the management of the "fragment" should be dependent upon the mime type. This can be tricky sometimes because some web servers aren't properly configured and wont give proper mime types, so in cases where this flaw is obvious, Mozilla can perhaps take countermeasures and attempt to treat it as the apparent compressed file type, unless the magic fails (i.e., file header). I agree that solid archive support should be left until later. I'm quite behind on HTTP changes and a lot has changed in Mozilla. Firefox now has an "Offline Storage" which appears to be separate from the cache. This would be the perfect mechanism for storing such files, which ideally wouldn't change often. I am not familiar with how this is communicated (I presume some HTTP header?) or what the rules are around it. Finally, Zir proposed downloading only the portions of a file that are needed; an excellent idea that I hadn't considered previously. This would then require both a data file and a contents or hash file that tells which portions of the data are downloaded and which aren't, or, more ideally, the hashes for each segment of the file and rather or not it's downloaded (and correct) can be determined at runtime, thus behaving a bit more like a torrent peer (when downloading). As a programmer, this could be quite a nice feature for online API specifications (javadocs, doxygen, posix spec, etc.).
(In reply to comment #7) > As a programmer, this could be quite a nice feature for online API > specifications (javadocs, doxygen, posix spec, etc.). I'm not sure what you're getting at, but FYI Mozilla's jar: file support already works for accessing compressed resources. If you save a web site as a .zip file you can browse jar:file:///path/to/saved/archive.zip!/some/page.html and links and images work; if the web server serves the right mime type or you set network.jar.open-unsafe-types in about:config to true, you can browse the remote ZIP file without saving it locally. This new scheme does seem to be an improvement.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.