Closed
Bug 132008
Opened 23 years ago
Closed 9 years ago
Enhancement of URL scheme for compressed directories
Categories
(Core :: Networking, enhancement)
Core
Networking
Tracking
()
RESOLVED
WONTFIX
Future
People
(Reporter: zdevir, Unassigned)
References
Details
(Keywords: helpwanted)
IDEA
Some time ago, Mozilla introduced a special protocol named `jar://' which was used as a
decompress encapsulating layer over other protocols (such as `http://') or implicitly over the
`file://' protocol.
The syntax of the `jar://' protocol was something
like:
jar://<scheme>://<host>/<path>/<zipfile>!<file>
However, the special syntax is
not trivial for the user, and not compliant with the relevant RFC's. I guess this is one of the
reasons this protocol was removed and isn't working in recent Mozilla browsers.
In this
enhancement proposal, I would like to introduce a different approach for supporting browsing
through compressed files, which is complaint with RFC 2396 and more transparent to the user.
As
this proposal is about a feature that is now hidden, I hope the capability to browse a file through a
compressed directory will be enabled again, and won't be limited to ZIP archives only.
Adding the
option to browse through the archive file, transparent to the user, would have two main benefits.
First, the user won't mess with uncompressing the archive and wasting space for the uncompressed
files. Second, in the case of remote requests, the transmitted IP-datagrams are compressed;
hence reducing network overhead when compared to standard HTTP requests.
URL SYNTAX
The
standard URL syntax, according to RFC 1808,
is:
<scheme>://<host>/<path>;<params>?<query>#<fragment>
The fragment component
identifies the file section the browser should focus on. Till now, this component was used only
for HTML files. For any other file type, such as binary files or compressed files, the fragment
part had no use at all, and consequentially was ignored both by the browser and the server.
The
concept of fragmentation itself can be extended to support focusing on a file within an archive.
Therefore, I would like to purpose an enhancement for the usage of the URL's fragment component.
My proposal is to use this field as a mark for the browser, to mark which file to extract and display
properly.
For example, let's assume we got the following URL as an http
request:
http://domain.com/path/site.zip#index.html
Assuming the file site.zip in
http://domain.com/path/ exists, we should get it as a respond for the request, since the server
ignores the fragment part of the URL.
If the proposal will be implemented, the browser should
extract the file index.html from the archive, and display it as any normal HTML file.
Furthermore, the base path of index.html is the root of the archive. Therefore, linked objects,
such as images, should be extracted from the archive too, exactly as those objects would have been
requested if index.html was a regular file in the server.
The purposed enhancement makes the
fragment component behave as an additional path for the destination file. As a result, a few more
considerations must be taken care of when implementing this enhancement.
Every relative path in
the destination file should regard the archive file itself as a subdirectory in the path. This is
quite similar to the way a few file-managers (i.e., MC) are scanning through an archive file. If a
file /site.zip#pages/index.html has a link to an image in "../pics/background.gif", the image
is expected to be found at /site.zip#pics/background.gif. Note that in this example, the
archive file itself has an internal directory structure.
There are no restrictions on relative
path, and it may go beyond the scope of the current archive file.
Also, there should be no
restrictions on nested archives. Theoretically, one may address the file
/site.zip#archives/yesterday.zip#log.html#09_00. Note that the first two hash signs are
intercepted as files within archives, where the third one is intercepted as a section marker
inside the log.html HTML file.
IMPROVEMENTS AND OPTIMIZATIONS
When an archive is accessed,
finding and extracting a specific file may require a full scan through the archive. However, for a
standard archive file, the full scan may be performed only ones, or even less.
Some archives
include an index of all the files within the archive. This way, accessing a compressed file may be
done by reading or requesting only small fragment of the archive. Multiple volume ZIP files
always have an index structure in the end of the last volume of the archive file. The index is
sometimes added to a single volume ZIP file, even though usually not.
Other archive files might
not have an index. In this case, the browser should scan through the archive ones, when the archive
is first accessed or requested. During the scan, it will collect the information for each file
inside the archive and store the assembled index in the internal cache. Now accessing a file
inside the archive would be much faster, as the file's exact location is known.
Requesting a large
archive from a remote server only to browse a single file inside it is not efficient. However, if
the archive has an index structure in its end, the browser may avoid downloading the whole
archive.
In order to avoid a full download, the browser should first ask for the archive size, and
then use a partial file read (supported by most ftp and http servers) to ask for the last bytes of the
archive. Usually, for an archive with an index those bytes contain a pointer to the location of the
index structure inside the archive. If this is the case, the browser should send a second partial
read and request the index structure of the archive. Now the browser can calculate the exact
location of the file inside the archive, initiate a third partial read request and uncompress the
file.
Note that this optimization may not be applicable in some conditions, and may worsen the
performance if the archive size is rather small, so careful consideration is needed before
implementing it.
Note that if this optimization is not carried out, the browser will request all
the archive, as it does anyway now.
LIMITATIONS
Reintroducing the `jar://' protocol (with or
without the above extended form), has two limitations or weak points.
As mentioned in the
previous chapter, downloading a large archive from a remote server is not advised, as the actual
information needed is only a fraction of the archive file itself. The solution was to request the
archive index first, and then request the actual file inside the archive.
However, not all the
archives carry a full index in their tail. For indexless archives, the browser will have no option
but to download the full archive file and create its own archive index in the internal cache.
The
second limitation is more problematic and regards the support of other compression methods
rather than ZIP.
In a normal archive like ZIP, ARJ and other `normal' compressing methods, each
file is compressed separately. However, many of the newer compressing algorithms, such as RAR,
ACE and TGZ (TARred GZIP) may create `solid archive'. A solid archive is an archive that was
created by compressing all the files as one big concatenated file. Unfortunately,
uncompressing a single file from a solid archive requires uncompressing all the files reside
before it in the archive.
When this is the case, all the indexing techniques discussed in the
previous chapter are not relevant. Furthermore, now the browser must uncompress all the files
the first time, or repeat a full uncompress procedure for each bunch of requests.
In the first
case, the browser would require unbounded amount of space for the uncompressed files. In the
second case, we get a huge overhead from the uncompressing process.
As this is limitation does not
have a simple solution, I think supporting solid archives should be left aside for a future
version, if supported at all.
COMPATABILITY WITH CURRENT STANDARD (RFC 1808/2397)
According to
RFC 1808, everything beyond the first hash sign is the fragment section of the URL, and the URL
itself may not include any hash signs. If the URL is compliant with the RFC standard, everything
beyond the hash sign is ignored, and therefore the suggested extension doesn't compromise the
proper parsing of standard URL format.
With the exception of few exotic browsers [translation --
the MSIE family], which try to intercept the file type according to its internal structure,
normal browsers the intercept the file type according to its extension. Therefore, a normal
browser may safely regard the fragment component according to the file extension. For any file
other than the supported archive types, the browser should intercept the fragment component as
usual. For the archive files, previously the browser was usually asking to save the file,
disregarding the fragment component anyway.
REFERECE
RFC 1738 - Uniform Resource Locators (URL)
[Draft Standard]
RFC 1808 - Relative Uniform Resource Locators [Proposed Standard]
RFC 2396 -
Uniform Resource Identifiers (URI): Generic Syntax [Proposed Standard]
Bugzilla BUG #64286 -
The jar protocol
QA futuring of various bugs to focus on remaning bugs. If you think your bug
needs immediate attention for Mozilla 1.0, set the milestone to "--". I will
review these bugs (again) later.
Target Milestone: --- → Future
Updated•23 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
I believe the enhancement request should be at least thoughfully considered, for
three reasons.
A. The basic feature does not require much work, as the mechanism is already
available in the browser. Also, the demand has been raised in the past few
times, and probably will arise again in the future.
B. It extents the capabilities of the browser without any payment.
C. As this suggestion define a variant to the URL scheme, if there is a chance
to implement it in the future, I think it will be wise to implement it before a
major release.
Target Milestone: Future → ---
+helpwanted - some engineers can decide if they want to future this. If they do
future it, please do not move the milestone back unless you find someone to own
and code the feature.
Keywords: helpwanted
bug 64286: web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar)
extends the compressed protocol scheme to single-file saved pages,
specifically including write support.
Blocks: 64286
Updated•16 years ago
|
Assignee: dveditz → nobody
Comment 7•16 years ago
|
||
Well, it's been a long time since I've looked at this bug and bug 64286. When I wrote my comments on bug 64286, I hadn't discovered this one yet, so it's funny that Zvi and I arrived at pretty much the same conclusions. I just want to add that I think the management of the "fragment" should be dependent upon the mime type. This can be tricky sometimes because some web servers aren't properly configured and wont give proper mime types, so in cases where this flaw is obvious, Mozilla can perhaps take countermeasures and attempt to treat it as the apparent compressed file type, unless the magic fails (i.e., file header).
I agree that solid archive support should be left until later. I'm quite behind on HTTP changes and a lot has changed in Mozilla. Firefox now has an "Offline Storage" which appears to be separate from the cache. This would be the perfect mechanism for storing such files, which ideally wouldn't change often. I am not familiar with how this is communicated (I presume some HTTP header?) or what the rules are around it.
Finally, Zir proposed downloading only the portions of a file that are needed; an excellent idea that I hadn't considered previously. This would then require both a data file and a contents or hash file that tells which portions of the data are downloaded and which aren't, or, more ideally, the hashes for each segment of the file and rather or not it's downloaded (and correct) can be determined at runtime, thus behaving a bit more like a torrent peer (when downloading).
As a programmer, this could be quite a nice feature for online API specifications (javadocs, doxygen, posix spec, etc.).
(In reply to comment #7)
> As a programmer, this could be quite a nice feature for online API
> specifications (javadocs, doxygen, posix spec, etc.).
I'm not sure what you're getting at, but FYI Mozilla's jar: file support already works for accessing compressed resources. If you save a web site as a .zip file you can browse jar:file:///path/to/saved/archive.zip!/some/page.html and links and images work; if the web server serves the right mime type or you set network.jar.open-unsafe-types in about:config to true, you can browse the remote ZIP file without saving it locally.
This new scheme does seem to be an improvement.
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•