Open Bug 64286 Opened 19 years ago Updated 4 years ago
web archive save/view support (KDE/Konqueror-like .war, .zip, .tgz, .tar
.gz, .tar .bz2, tb2, .tbz2, .jar)
40.29 KB, application/octet-stream
It would be very practical to be able to browse directly any .zip, .tgz, .jar, or .tar file, like: file:///home/me/thing.zip/index.html or: http://my.site.com/dir/myarch.tgz/page1.html We would obviously need a new mime-type, like application/browseable-archive, activeable at the web-server side. While browsing the archive, the root directory should be resetted to the root of the archive. Then if the page http://my.site.com/dir/myarch.tgz/page1.html contains a link like: <a href="/dir2/file2.html"> it would be accessible with the URL: http://my.site.com/dir/myarch.tgz/dir2/file2.html corresponding to the file /dir2/file2.html in myarch.zip. It would greatly simplify off-line browsing, saving pages with images, web-site transporting and updating.....
Summary: .zip, .tgz, .tar, or .jar archive off/on-line browsing → [RFE] .zip, .tgz, .tar, or .jar archive off/on-line browsing
This sound quite good, I would really like this feature, too. Changing component to Networking and Status to new.
Status: UNCONFIRMED → NEW
Component: Browser-General → Networking
Ever confirmed: true
over to default owners. cc'ing mscott and bill in case they're interested...
Assignee: asa → neeti
QA Contact: doronr → tever
we already have the basics via the jar protocol. unfortunately it doesn't seem to return directory lists. i think we really need a browse protocol [possibly implied]. browse:ftp:, browse:file: browse:jar:file: browse:jar:ftp:. One thing i'm not sure about is whether we can use jar through multiple archives. jar:jar:file:a.jar!/b.jar!/
mass move, v2. qa to me.
QA Contact: tever → benc
This can also be used in a Save As capacity, like MHTML; see bug 82118 (perhaps bug 82118 should also depend on this?) as well as bug 18764 and bug 40873. The potential for Save As in archive files (tar, zip, tgz, ...) is great; you can preserve all files without changing the content of any (as "Save As Web Page, Complete" and what MHTML would do). The preservation would be performed in a index file of some sort (say "archive.index" or "archive.conf"), which would be arranged with full URI followed by a space and then the archive file's name (then a newline); the first entry would be the page viewed by Mozilla. This archive.index would also allow Mozilla to retain http protocol since it could simply look for the file to determine whether or not to display the conents. ...I seem to be the only person interested in the preservation of original content! (example: an image on a page pointed to CNN is less likely a hoax than a local image.) (another suggestion of mine is to allow an option to add a <base href=uri> line to a saved html document where uri is the pages's full URI.)
FYI: A Similiar enhancement I proposed -- BUG #132008.
I just wrote the following script: http://relativity.yi.org/rss/permalink/1021881873-1.shtml It works within Konqueror and allows browsing of HTML files and their page requisites offline. It is just an index.html file wrapped in a .tar.gz but named .war Mozilla should/could support the same mechanism. I don't know why it doesn't already...
Another point I wanted to make. Konqueror supports a WAR format (web archive) that is just a tar.gz but renamed to a .war extension. When this bug is fixed... it would be nice if Mozilla supported konqueror .war files. I will attach a testcase. It should be pretty easy to support. 1. See if the file being opened is .war 2. Handle it as tar.gz 3. If index.html exists automatically view it.
Once we support a tar.gz format it should be easy to support Konqueror style .war files. See my comments.
*** Bug 149871 has been marked as a duplicate of this bug. ***
Could this be implemented with Gnome VFS? http://www.ximian.com/devzone/tech/gnome-vfs.html Also... this bug seems to have stagnated.. Is anyone working on this? Could we get a Target Milestone other than the ambiguous and non-specific "Future" I think this one is important.
moving neeti's futured bugs for triaging.
Assignee: neeti → new-network-bugs
Component: Networking → File Handling
Summary: [RFE] .zip, .tgz, .tar, or .jar archive off/on-line browsing → .zip, .tgz, .tar, or .jar archive off/on-line browsing
added as dependency to tracking bug 82118 and mentioned in bug 176722 i would REALLY love to see this working in 1.4a ...it already exists in many browsers and .war compressed web archive save/view support would be a 'killer app' feature. added self to cc
summary change was: .zip, .tgz, .tar, or .jar archive off/on-line browsing is now: web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar) hope that makes this bug more visible (and clear) should this bug depend on bug 132008?
Summary: .zip, .tgz, .tar, or .jar archive off/on-line browsing → web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar)
And when the mechanism works, it shouldn't be too complicated to implement CHM (BUG 123320) and MHTML (BUG 18764) filters too, just as alternative decompressors. BTW, I once tried to compile Mozilla under Windows. Does anyone know where I can find MSDEV project for it?
Assignee: new-network-bugs → law
QA Contact: benc → petersen
*** Bug 186283 has been marked as a duplicate of this bug. ***
I have a sample for this see jar:http://www.geocities.com/bijumaillist/mozilla/mozilla.zip!/mozilla.htm (if your are unable to see it try http://geocities.com/bijumaillist/go.html#jar:http://www.geocities.com/bijumaillist/mozilla/mozilla.zip!/mozilla.htm or the snipped url http://snipurl.com/56pg ) (In reply to comment #3) > One thing i'm not sure about is whether we can use jar through multiple > archives. jar:jar:file:a.jar!/b.jar!/ to see example for jar inside jar goto http://forums.mozillazine.org/viewtopic.php?t=55620 (But you cant see the dir listing)
The JAR protocol enables viewing/entering ZIP (=JAR/XPI) files. This feature exists for a long time, and is a good start. In bug 132008, I suggested that rather than using the complicated scheme of jar:http://www.geocities.com/bijumaillist/mozilla/mozilla.zip!/mozilla.htm using http://www.geocities.com/bijumaillist/mozilla/mozilla.zip#mozilla.htm Which if far more simple to easier to remember and use, and does not require the definition of new schemes for any archive formats. Beside of ZIP/JAR/XPI, there are a lot of other archive formats which are not readible by mozilla using any scheme (WAR, TAR/TGZ, MHTML, CHM, etc.).
(Apologies if this comment is redundant.) You cannot use '#' in these URIs because that breaks relative URI references in the HTML files in the archive. For example, if a document 'http://example.com/foo.war#bar.html' links to 'baz.html' then this resolves to 'http://example.com/baz.html', which is not what you wanted. More importantly, I don't think you need this feature at all for HTTP URIs. It would make sense to use URIs like 'http://example.com/foo/bar.html', and let the web server worry about extracting resources from the WAR file. You get compression and so on as HTTP 1.1 features. This is how some servers already work (using WAR files in the Java servlet sense). On the other hand, it does make sense to store archives locally as file URIs (as in 'file:///home/bill/foo.war/bar.html' or 'file:///home/bill/foo.mhtml/bar.html'). But they should not need any special syntax to be processed by Mozilla: after establishing that there is no file '/home/bill/foo.war/bar.html', Mozilla could recheck the path element by element looking for archives or other special cases. In short, you don't need a new protocol for pulling things out of archives: for most URI schemes, it is the server's problem, and for local archives, a fairly simple elaboration of the 'file' protocol handler can do the job.
You comment is not redundant, but posting it in BUG 132008 might have been more suitable. First, you cannot use the slash sign as archive separator. In http://example.com/foo.war/bar.html, foo.war is considered as the path for bar.html. The browser asks the server for foo.war/bar.html. Since this file doesn't exist (foo.war is file and not path), the server gets the 404.html file (with 200 OK respond). It doesn't even aware that it got an error page. I think # sign is more suitable since it is already used as the fragmentation marker in the standard URI scheme. Naturally fragments of an archive file are the files inside it. Also this behavior does not contradict the relevant RFC's. As for your example, If baz.html is linked at http://example.com/foo/bar.html, you expect is to be resolved as http://example.com/foo/baz.html. Similarly, if it is linked at http://example.com/foo.war#bar.html, it will be resolved as http://example.com/foo.war#baz.html. There is only one non-obvious behavior in my proposal. If http://example.com/foo.war#path/bar.html links to /baz.html, should it be resolved to http://example.com/baz.html, or to http://example.com/foo.war#bar.html. In other words, what is the root of http://example.com/foo.war#path/bar.html.
(In reply to comment #22) > You comment is not redundant, but posting it in BUG 132008 might have been more > suitable. > > First, you cannot use the slash sign as archive separator. In > http://example.com/foo.war/bar.html, foo.war is considered as the path for > bar.html. The browser asks the server for foo.war/bar.html. Since this file > doesn't exist (foo.war is file and not path), the server gets the 404.html file > (with 200 OK respond). It doesn't even aware that it got an error page. > > I think # sign is more suitable since it is already used as the fragmentation > marker in the standard URI scheme. Naturally fragments of an archive file are > the files inside it. Also this behavior does not contradict the relevant RFC's. > > As for your example, If baz.html is linked at http://example.com/foo/bar.html, > you expect is to be resolved as http://example.com/foo/baz.html. > Similarly, if it is linked at http://example.com/foo.war#bar.html, it will be > resolved as http://example.com/foo.war#baz.html. > > There is only one non-obvious behavior in my proposal. If > http://example.com/foo.war#path/bar.html links to /baz.html, should it be > resolved to http://example.com/baz.html, or to http://example.com/foo.war#bar.html. > In other words, what is the root of http://example.com/foo.war#path/bar.html. Remember that the "root" gets determined by the client (browser) and I personally think that “#” should indicate the root of the archive file. Remember, once we specify a URI to an archive file we don’t care about the server where that came from anymore, only the archive it’s self and it’s contents. Also, keep in mind that # may be used to designate a place within an html file. So if I encounter the link "bar.html#credits" within the file bar.html located in the archive foo.zip (or foo.war, a war file in J2EE terms is actually designed to not be disclosed by the application server considering that it usually contains info that could be used to compromise the server so that's probably not a good example, but whatever). So if I use the path http://example.com/foo.war#bar.html to access the file bar.html and it has a link of "#credits" then do we end up with "http://example.com/foo.war#bar.html#credits" or "http://example.com/foo.war#credits"? To clarify this issue, let's first define a few things based off of RFC 2396 "Uniform Resource Identifiers (URI): Generic Syntax" (http://www.faqs.org/rfcs/rfc2396.html). A "URI Reference" in section 4 of RFC 2396 is defined as follows: URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] So in example above of “http://example.com/foo.war#credits” using RFC 2396 rules, "http://example.com/foo.war" would be the URI (absolute in this case) and "credits" would be the fragment (note the "#" excluded because it's just the character that marks the end of the URI and the beginning of the fragment). Thusly, in the example of “http://example.com/foo.war#bar.html#credits”, “http://example.com/foo.war#credits” is the URI again and “bar.html#credits” is the fragment. This time, only the 1st “#” character was excluded as per the RFC, it says nothing about striping out subsequent “#” characters. In section 4.1 of RFC 2396, is states that "the format and interpretation of fragment identifiers is dependent on the media type" by media type it means MIME type and refers to the MIME RFC. Thusly, it would seem to make sense that a resolution should occur in a hierarchical manner using the “#” as a delimiter between levels of fragmentation as well as the delimiter between the URI and the fragmentation it’s self. Each level of fragmentation should be resolved based upon the MIME type of that item being referred to. For right now, the only MIME types that I am aware of respecting (or commonly respecting) URI fragments are text/html and our proposal (which would probably include all of the compressed MIME types). I DO NOT recommend or advocate creating a new MIME type as mentioned in an earlier post because there are already MIME types existing for each of these compressed formats. So by these rules, let’s look at the implications of a few examples. URI Reference: http://example.com/foo.zip#credits The mime type of foo.zip is application/zip which would tell Mozilla to download and CACHE the file foo.zip and then retrieve the file named credits from the root folder of the archive if it exists. URI Reference: http://example.com/foo.zip#bar.html#credits Once again, the mime type of foo.zip would be application/zip. Mozilla looks at the 1st item in the fragment list which is “bar.html”, and would retrieve bar.html from the root of the archive. Then Mozilla will examine the MIME type of this bar.html and which is text/html. Since Mozilla knows how to handle fragments for text/html files it will look for an anchor with this name. URI Reference: http://example.com/fu.zip#bar.tar.gz#iraq.jar#src/forignpolicy/screwups/invasions/LetsGetSadam.java Here is a more complex example. Assuming that fu.zip is a valid application/zip (.zip) file that contains a file in it’s root named bar.tar.gz that is a valid application/x-gtar (gzipped tape archive) file that contains a file in it’s root named iraq.jar that is a valid application/java-archive file that contains the file in the path src/forignpolicy/screwups/invasions named LetsGetSadam.java then Mozilla should handle the item LetsGetSadam.java according to it’s rules for this MIME type if it has one. I believe that all .java files get treated as “text/plain” if they are identified at all. Some key things to remember is that the web server is only going to give us MIME types for the object defined by the URI it’s self. Mozilla is going to be responsible for identifying mime types for everything else the same way that it would if a file on the local machine were entered with the file:// protocol. In summary, this offering should provide a clean method of solving the problem that as Zvi Devir bought up complies with relevant RFCs while is flexible, powerful and simple in implementation. I would encourage people to lease read this thread well and relevant RFCs and post any other ideas, criticisms or recommendations. Daniel
One more noteworthy item on archive caching If the archive file is being retrieved remotely, I think that it's very important that the archive file be cached (otherwise the advantage of retrieving the file from the archive on the server is sorta lost). This probably brings up a very important point. By default, Mozilla allocates 50MB of disk cache. Using archives like this may invoke the need for greater amounts of cache space. This may mean that at some point we will have to consider having a separate setting (and/or folder) for cached archives to allow adaquate room for these archives to be cached while not allowing small images and junk files to sit around accumulating in numbers that cause directory indexes to become a performance problem (either at the app or OS/file system level)
MAF (http://maf.mozdev.org/) is an extension which might be useful.
(In reply to comment #25) > MAF (http://maf.mozdev.org/) is an extension which might be useful. Yes, MAF is definately a solution to our problem. It knows to load and save the uncompressed MHTML Microsoft archive as well as zip-archived special formats. Thanks to the developers! I hope it can be integrated into Firefox as a standard save method. This is a very basic function. Up to now, it's using some external commands and scripts. Now, the only thing is to agree on a standard. I don't want to save all my pages into the *.maff format and later some other format will be standard. So for now I only trust the Microsoft .mht MHTML format that MAF can write. It will have a long future I guess. Can some of the core developers decide for an "official" mozilla archive standard ? What about just compressed MHTML, so it's close to the official RFC.
*** Bug 279557 has been marked as a duplicate of this bug. ***
*** Bug 289868 has been marked as a duplicate of this bug. ***
(In reply to comment #25 and comment #26) MAF would not fulfill the requirements of this feature request. MAF does provide a means to save and view a single web page with it's images, objects and metadata, but it has some crucial limitations, specifically that it is limited to a single page and, that it uses a proprietary Microsoft standard. The significance of this feature would be greatly diluted if that is what it was limited to. The ability to of the browser to access any number of files in any type of archive would greatly augment Mozilla. This type of support should not be very difficult to implement and would inherently provide the ability to read a web page from a MAF file. It should also be designed in such a way that existing rules aren't broken requiring "exceptions" to those rules within the code (this makes code get ugly fast). This is why I propose the solution in comment #23 that adheres to all relevant RFCs and would produce a mechanism for this to be implemented. Let me give you one more example. Browse to http://java.sun.com/j2se/1.5.0/download.jsp and download the J2SE 5.0 Documentation (you have to click an "I agree" and such). This is a 44MB .zip file and contains 10,779 files, most of which are fairly small. Once you decompress it, it expands to 223MB. However, because files are allocated in blocks (varying in size from 512 bytes to 4096), there is an average of between 256 and 2048 bytes of wasted space per file. So this means an additional 2.6MB to 21MB of disk space used taking up to 244MB of disk space. Also, most disks aren't formatted with 512 byte blocks anymore so the number is more likely to be at the higher end. As you can see, being able to access this directly from the 44MB .zip archive would be very helpful. Also, I have a correction to my comment #23, the second sentence of the 4th paragraph should read as follows: Thusly, in the example of “http://example.com/foo.war#bar.html#credits”, “http://example.com/foo.war” is the URI again and “bar.html#credits” is the fragment. Daniel
If Firefox started supporting saving as WAR, then it would be a great thing because currently Konqueror supports WAR, but WAR files cannot be sent to people who do not have Linux/KDE. If Firefox supports WAR, people on all platforms will be able to read WAR files and then the usage of WARs will grow. My humble opinion, and my vote.
Summary: web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar) → web archive save/view support (KDE/Konqueror-like .war, .zip, .tgz, .tar.gz, .tar.bz2, tb2, .tbz2, .jar)
Assignee: law → nobody
QA Contact: chrispetersen → file-handling
Product: Core → Firefox
Target Milestone: Future → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.