Closed Bug 132008 Opened 22 years ago Closed 8 years ago

Enhancement of URL scheme for compressed directories

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX
Future

People

(Reporter: zdevir, Unassigned)

References

Details

(Keywords: helpwanted)

IDEA

Some time ago, Mozilla introduced a special protocol named  `jar://' which was used as a 
decompress encapsulating layer over other protocols (such as `http://') or implicitly over the 
`file://' protocol. 
The syntax of the `jar://' protocol was something 
like:
	jar://<scheme>://<host>/<path>/<zipfile>!<file>

However, the special syntax is 
not trivial for the user, and not compliant with the relevant RFC's. I guess this is one of the 
reasons this protocol was removed and isn't working in recent Mozilla browsers.

In this 
enhancement proposal, I would like to introduce a different approach for supporting browsing 
through compressed files, which is complaint with RFC 2396 and more transparent to the user. 
As 
this proposal is about a feature that is now hidden, I hope the capability to browse a file through a 
compressed directory will be enabled again, and won't be limited to ZIP archives only.

Adding the 
option to browse through the archive file, transparent to the user, would have two main benefits. 
First, the user won't mess with uncompressing the archive and wasting space for the uncompressed 
files. Second, in the case of remote requests, the transmitted IP-datagrams are compressed; 
hence reducing network overhead when compared to standard HTTP requests.


URL SYNTAX

The 
standard URL syntax, according to RFC 1808, 
is:
	<scheme>://<host>/<path>;<params>?<query>#<fragment>

The fragment component 
identifies the file section the browser should focus on. Till now, this component was used only 
for HTML files. For any other file type, such as binary files or compressed files, the fragment 
part had no use at all, and consequentially was ignored both by the browser and the server. 

The 
concept of fragmentation itself can be extended to support focusing on a file within an archive. 
Therefore, I would like to purpose an enhancement for the usage of the URL's fragment component. 
My proposal is to use this field as a mark for the browser, to mark which file to extract and display 
properly.
For example, let's assume we got the following URL as an http 
request:
	http://domain.com/path/site.zip#index.html
Assuming the file site.zip in 
http://domain.com/path/ exists, we should get it as a respond for the request, since the server 
ignores the fragment part of the URL.
If the proposal will be implemented, the browser should 
extract the file index.html from the archive, and display it as any normal HTML file. 
Furthermore, the base path of index.html is the root of the archive. Therefore, linked objects, 
such as images, should be extracted from the archive too, exactly as those objects would have been 
requested if index.html was a regular file in the server.

The purposed enhancement makes the 
fragment component behave as an additional path for the destination file. As a result, a few more 
considerations must be taken care of when implementing this enhancement.
Every relative path in 
the destination file should regard the archive file itself as a subdirectory in the path. This is 
quite similar to the way a few file-managers (i.e., MC) are scanning through an archive file. If a 
file /site.zip#pages/index.html has a link to an image in "../pics/background.gif", the image 
is expected to be found at /site.zip#pics/background.gif. Note that in this example, the 
archive file itself has an internal directory structure. 
There are no restrictions on relative 
path, and it may go beyond the scope of the current archive file.

Also, there should be no 
restrictions on nested archives. Theoretically, one may address the file 
/site.zip#archives/yesterday.zip#log.html#09_00. Note that the first two hash signs are 
intercepted as files within archives, where the third one is intercepted as a section marker 
inside the log.html HTML file.


IMPROVEMENTS AND OPTIMIZATIONS

When an archive is accessed, 
finding and extracting a specific file may require a full scan through the archive. However, for a 
standard archive file, the full scan may be performed only ones, or even less. 
Some archives 
include an index of all the files within the archive. This way, accessing a compressed file may be 
done by reading or requesting only small fragment of the archive. Multiple volume ZIP files 
always have an index structure in the end of the last volume of the archive file. The index is 
sometimes added to a single volume ZIP file, even though usually not.
Other archive files might 
not have an index. In this case, the browser should scan through the archive ones, when the archive 
is first accessed or requested. During the scan, it will collect the information for each file 
inside the archive and store the assembled index in the internal cache. Now accessing a file 
inside the archive would be much faster, as the file's exact location is known.

Requesting a large 
archive from a remote server only to browse a single file inside it is not efficient. However, if 
the archive has an index structure in its end, the browser may avoid downloading the whole 
archive.
In order to avoid a full download, the browser should first ask for the archive size, and 
then use a partial file read (supported by most ftp and http servers) to ask for the last bytes of the 
archive. Usually, for an archive with an index those bytes contain a pointer to the location of the 
index structure inside the archive. If this is the case, the browser should send a second partial 
read and request the index structure of the archive. Now the browser can calculate the exact 
location of the file inside the archive, initiate a third partial read request and uncompress the 
file. 

Note that this optimization may not be applicable in some conditions, and may worsen the 
performance if the archive size is rather small, so careful consideration is needed before 
implementing it. 
Note that if this optimization is not carried out, the browser will request all 
the archive, as it does anyway now. 


LIMITATIONS

Reintroducing the `jar://' protocol (with or 
without the above extended form), has two limitations or weak points.

As mentioned in the 
previous chapter, downloading a large archive from a remote server is not advised, as the actual 
information needed is only a fraction of the archive file itself. The solution was to request the 
archive index first, and then request the actual file inside the archive. 
However, not all the 
archives carry a full index in their tail. For indexless archives, the browser will have no option 
but to download the full archive file and create its own archive index in the internal cache.

The 
second limitation is more problematic and regards the support of other compression methods 
rather than ZIP.
In a normal archive like ZIP, ARJ and other `normal' compressing methods, each 
file is compressed separately. However, many of the newer compressing algorithms, such as RAR, 
ACE and TGZ (TARred GZIP) may create `solid archive'. A solid archive is an archive that was 
created by compressing all the files as one big concatenated file. Unfortunately, 
uncompressing a single file from a solid archive requires uncompressing all the files reside 
before it in the archive. 
When this is the case, all the indexing techniques discussed in the 
previous chapter are not relevant. Furthermore, now the browser must uncompress all the files 
the first time, or repeat a full uncompress procedure for each bunch of requests. 
In the first 
case, the browser would require unbounded amount of space for the uncompressed files. In the 
second case, we get a huge overhead from the uncompressing process.
As this is limitation does not 
have a simple solution, I think supporting solid archives should be left aside for a future 
version, if supported at all.


COMPATABILITY WITH CURRENT STANDARD (RFC 1808/2397)

According to 
RFC 1808, everything beyond the first hash sign is the fragment section of the URL, and the URL 
itself may not include any hash signs. If the URL is compliant with the RFC standard, everything 
beyond the hash sign is ignored, and therefore the suggested extension doesn't compromise the 
proper parsing of standard URL format.

With the exception of few exotic browsers [translation -- 
the MSIE family], which try to intercept the file type according to its internal structure, 
normal browsers the intercept the file type according to its extension. Therefore, a normal 
browser may safely regard the fragment component according to the file extension. For any file 
other than the supported archive types, the browser should intercept the fragment component as 
usual. For the archive files, previously the browser was usually asking to save the file, 
disregarding the fragment component anyway.


REFERECE

RFC 1738 - Uniform Resource Locators (URL) 
[Draft Standard]
RFC 1808 - Relative Uniform Resource Locators [Proposed Standard]
RFC 2396 - 
Uniform Resource Identifiers (URI): Generic Syntax [Proposed Standard]
Bugzilla BUG #64286 - 
The jar protocol
QA futuring of various bugs to focus on remaning bugs. If you think your bug
needs immediate attention for Mozilla 1.0, set the milestone to "--". I will
review these bugs (again) later.
Target Milestone: --- → Future
Status: UNCONFIRMED → NEW
Ever confirmed: true
I believe the enhancement request should be at least thoughfully considered, for
three reasons.
A. The basic feature does not require much work, as the mechanism is already
available in the browser. Also, the demand has been raised in the past few
times, and probably will arise again in the future.
B. It extents the capabilities of the browser without any payment.
C. As this suggestion define a variant to the URL scheme, if there is a chance
to implement it in the future, I think it will be wise to implement it before a
major release.
Target Milestone: Future → ---
+helpwanted - some engineers can decide if they want to future this. If they do
future it, please do not move the milestone back unless you find someone to own
and code the feature.
Keywords: helpwanted
--> dan
Assignee: new-network-bugs → dveditz
Target Milestone: --- → Future
bug 64286: web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar)

extends the compressed protocol scheme to single-file saved pages, 
specifically including write support.
Blocks: 64286
Assignee: dveditz → nobody
Well, it's been a long time since I've looked at this bug and bug 64286.  When I wrote my comments on bug 64286, I hadn't discovered this one yet, so it's funny that Zvi and I arrived at pretty much the same conclusions.  I just want to add that I think the management of the "fragment" should be dependent upon the mime type.  This can be tricky sometimes because some web servers aren't properly configured and wont give proper mime types, so in cases where this flaw is obvious, Mozilla can perhaps take countermeasures and attempt to treat it as the apparent compressed file type, unless the magic fails (i.e., file header).

I agree that solid archive support should be left until later.  I'm quite behind on HTTP changes and a lot has changed in Mozilla. Firefox now has an "Offline Storage" which appears to be separate from the cache.  This would be the perfect mechanism for storing such files, which ideally wouldn't change often.  I am not familiar with how this is communicated (I presume some HTTP header?) or what the rules are around it.

Finally, Zir proposed downloading only the portions of a file that are needed; an excellent idea that I hadn't considered previously.  This would then require both a data file and a contents or hash file that tells which portions of the data are downloaded and which aren't, or, more ideally, the hashes for each segment of the file and rather or not it's downloaded (and correct) can be determined at runtime, thus behaving a bit more like a torrent peer (when downloading).

As a programmer, this could be quite a nice feature for online API specifications (javadocs, doxygen, posix spec, etc.).
(In reply to comment #7)
> As a programmer, this could be quite a nice feature for online API
> specifications (javadocs, doxygen, posix spec, etc.).

I'm not sure what you're getting at, but FYI Mozilla's jar: file support already works for accessing compressed resources.  If you save a web site as a .zip file you can browse jar:file:///path/to/saved/archive.zip!/some/page.html and links and images work; if the web server serves the right mime type or you set network.jar.open-unsafe-types in about:config to true, you can browse the remote ZIP file without saving it locally.

This new scheme does seem to be an improvement.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.