Open Bug 64286 Opened 20 years ago Updated 4 years ago

web archive save/view support (KDE/Konqueror-like .war, .zip, .tgz, .tar.gz, .tar.bz2, tb2, .tbz2, .jar)

Categories

(Firefox :: File Handling, enhancement)

enhancement
Not set

Tracking

()

People

(Reporter: forgeau, Unassigned)

References

Details

Attachments

(1 file)

It would be very practical to be able to browse directly any .zip, .tgz, .jar,
or .tar file, like:

file:///home/me/thing.zip/index.html

or:

http://my.site.com/dir/myarch.tgz/page1.html

	We would obviously need a new mime-type, like application/browseable-archive,
activeable at the web-server side.

	While browsing the archive, the root directory should be resetted to the root
of the archive. Then if the page
http://my.site.com/dir/myarch.tgz/page1.html
contains a link like:

 <a href="/dir2/file2.html">

it would be accessible with the URL:

http://my.site.com/dir/myarch.tgz/dir2/file2.html

corresponding to the file /dir2/file2.html in myarch.zip.

	It would greatly simplify off-line browsing, saving pages with images, web-site
transporting and updating.....
Summary: .zip, .tgz, .tar, or .jar archive off/on-line browsing → [RFE] .zip, .tgz, .tar, or .jar archive off/on-line browsing
This sound quite good, I would really like this feature, too.
Changing component to Networking and Status to new.
Status: UNCONFIRMED → NEW
Component: Browser-General → Networking
Ever confirmed: true
over to default owners. cc'ing mscott and bill in case they're interested...
Assignee: asa → neeti
QA Contact: doronr → tever
we already have the basics via the jar protocol. unfortunately it doesn't 
seem to return directory lists.

i think we really need a browse protocol [possibly implied]. browse:ftp:, 
browse:file: browse:jar:file: browse:jar:ftp:.

One thing i'm not sure about is whether we can use jar through multiple 
archives. jar:jar:file:a.jar!/b.jar!/
Target Milestone: --- → Future
I would *really* love to see this.  I think that it would allow Mozilla to work
with distributed (P2P) applications without a web server.  I posted this
messages to the .general newsgroup.

=========

I have wanted to have this for about 4 years now.  I think the technology is
finally starting to come around... thanks to Mozilla :)

Basically what I want is to be able to save a WHOLE web site structure into a
file and then have that file portable so that I can hand whole sets of
documentation and cool websites to friends or e-mail.  This also helps out with
application distribution as you can package all your XSLT, javascript, XHTML,
gifs, jpgs, Java, etc in one file and give that out to clients as one single
download.

In Neal Stevensons "Snow Crash" they talked about a "Hyper Card" which seemed
like a business card but actually had ~'huge stores of information' attached to
the card.  I thought that a HyperCard was a perfect name for this but now I
think it should be called a HTML Archive or Content Archive (car).  (WAR is
already used in Java circles as a Web Archive for the Servlet API).  HyperCard
is also a copyrighted name of Apple Computer and can't be used :(

Basically the jar: protocol in Mozilla solves this.  you should be able to
create a jar file and then enter a URL like:

jar:resource:///home/burton/test.har!/index.html

The only problem is that when I do this Mozilla just says:

> Document jar:resource:///home/burton/test.har!/index.html loaded successfully

But I don't get any HTML in the browser. ( I should see a Hello World message)

It might be possible to write a javascript and then use protozilla to come up
with a car:// URI which will be smart enough to display index.html from the
CAR.  

This isn't theoretical.  I need this for our Reptile project
(http://www.openprivacy.org/projects/reptile.shtml) because we depend heavily on
smart clients.  The goal is to hand over a reptile.car file to the client which
would contain java, xml, css and javascript so that the whole application can
function within a P2P environment *without* the server.
mass move, v2.
qa to me.
QA Contact: tever → benc
This can also be used in a Save As capacity, like MHTML; see bug 82118 (perhaps
bug 82118 should also depend on this?) as well as bug 18764 and bug 40873.  

The potential for Save As in archive files (tar, zip, tgz, ...) is great; you
can preserve all files without changing the content of any (as "Save As Web
Page, Complete" and what MHTML would do).  The preservation would be performed
in a index file of some sort (say "archive.index" or "archive.conf"), which
would be arranged with full URI followed by a space and then the archive file's
name (then a newline); the first entry would be the page viewed by Mozilla. 
This archive.index would also allow Mozilla to retain http protocol since it
could simply look for the file to determine whether or not to display the conents.

...I seem to be the only person interested in the preservation of original
content! (example: an image on a page pointed to CNN is less likely a hoax than
a local image.)

(another suggestion of mine is to allow an option to add a <base href=uri> line
to a saved html document where uri is the pages's full URI.)
FYI: A Similiar enhancement I proposed -- BUG #132008.
I just wrote the following script:

http://relativity.yi.org/rss/permalink/1021881873-1.shtml

It works within Konqueror and allows browsing of HTML files and their page
requisites offline.

It is just an index.html file wrapped in a .tar.gz but named .war

Mozilla should/could support the same mechanism.  I don't know why it doesn't
already...
Another point I wanted to make.

Konqueror supports a WAR format (web archive) that is just a tar.gz but renamed
to a .war extension.

When this bug is fixed... it would be nice if Mozilla supported konqueror .war
files.  I will attach a testcase.

It should be pretty easy to support.

1.  See if the file being opened is .war
2.  Handle it as tar.gz
3.  If index.html exists automatically view it.
Once we support a tar.gz format it should be easy to support Konqueror style
.war files.  See my comments.
*** Bug 149871 has been marked as a duplicate of this bug. ***
Could this be implemented with Gnome VFS?

http://www.ximian.com/devzone/tech/gnome-vfs.html

Also... this bug seems to have stagnated..  Is anyone working on this?  Could we
get a Target Milestone other than the ambiguous and non-specific "Future"

I think this one is important.
moving neeti's futured bugs for triaging.
Assignee: neeti → new-network-bugs
Component: Networking → File Handling
Summary: [RFE] .zip, .tgz, .tar, or .jar archive off/on-line browsing → .zip, .tgz, .tar, or .jar archive off/on-line browsing
Blocks: 82118
added as dependency to tracking bug 82118
and mentioned in bug 176722

i would REALLY love to see this working in 1.4a
...it already exists in many browsers 
and .war compressed web archive save/view support 
would be a 'killer app' feature.

added self to cc
summary change
was:
  .zip, .tgz, .tar, or .jar archive off/on-line browsing
is now:
  web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar)

hope that makes this bug more visible (and clear)

should this bug depend on bug 132008?
Summary: .zip, .tgz, .tar, or .jar archive off/on-line browsing → web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar)
Depends on: 132008
And when the mechanism works, it shouldn't be too complicated to implement CHM 
(BUG 123320) and MHTML (BUG 18764) filters too, just as alternative decompressors.

BTW, I once tried to compile Mozilla under Windows. Does anyone know where I can
find MSDEV project for it?
-> defaults
Assignee: new-network-bugs → law
QA Contact: benc → petersen
*** Bug 186283 has been marked as a duplicate of this bug. ***
I have a sample for this see 

jar:http://www.geocities.com/bijumaillist/mozilla/mozilla.zip!/mozilla.htm

(if your are unable to see it try
http://geocities.com/bijumaillist/go.html#jar:http://www.geocities.com/bijumaillist/mozilla/mozilla.zip!/mozilla.htm
or the snipped url http://snipurl.com/56pg )

(In reply to comment #3)
> One thing i'm not sure about is whether we can use jar through multiple 
> archives. jar:jar:file:a.jar!/b.jar!/

to see example for jar inside jar goto 
http://forums.mozillazine.org/viewtopic.php?t=55620

(But you cant see the dir listing)
The JAR protocol enables viewing/entering ZIP (=JAR/XPI) files. This feature
exists for a long time, and is a good start.
In bug 132008, I suggested that rather than using the complicated scheme of 
   jar:http://www.geocities.com/bijumaillist/mozilla/mozilla.zip!/mozilla.htm
using 
   http://www.geocities.com/bijumaillist/mozilla/mozilla.zip#mozilla.htm
Which if far more simple to easier to remember and use, and does not require the
definition of new schemes for any archive formats.

Beside of ZIP/JAR/XPI, there are a lot of other archive formats which are not
readible by mozilla using any scheme (WAR, TAR/TGZ, MHTML, CHM, etc.).
(Apologies if this comment is redundant.)

You cannot use '#' in these URIs because that breaks relative URI references in
the HTML files in the archive. For example, if a document
'http://example.com/foo.war#bar.html' links to 'baz.html' then this resolves to
'http://example.com/baz.html', which is not what you wanted.

More importantly, I don't think you need this feature at all for HTTP URIs. It
would make sense to use URIs like 'http://example.com/foo/bar.html', and let the
web server worry about extracting resources from the WAR file. You get
compression and so on as HTTP 1.1 features. This is how some servers already
work (using WAR files in the Java servlet sense).

On the other hand, it does make sense to store archives locally as file URIs (as
in 'file:///home/bill/foo.war/bar.html' or
'file:///home/bill/foo.mhtml/bar.html'). But they should not need any special
syntax to be processed by Mozilla: after establishing that there is no file
'/home/bill/foo.war/bar.html', Mozilla could recheck the path element by element
looking for archives or other special cases. 

In short, you don't need a new protocol for pulling things out of archives: for
most URI schemes, it is the server's problem, and for local archives, a fairly
simple elaboration of the 'file' protocol handler can do the job.
You comment is not redundant, but posting it in BUG 132008 might have been more
suitable. 

First, you cannot use the slash sign as archive separator. In
http://example.com/foo.war/bar.html, foo.war is considered as the path for
bar.html. The browser asks the server for foo.war/bar.html. Since this file
doesn't exist (foo.war is file and not path), the server gets the 404.html file
(with 200 OK respond). It doesn't even aware that it got an error page.

I think # sign is more suitable since it is already used as the fragmentation
marker in the standard URI scheme. Naturally fragments of an archive file are
the files inside it. Also this behavior does not contradict the relevant RFC's.

As for your example, If baz.html is linked at http://example.com/foo/bar.html,
you expect is to be resolved as http://example.com/foo/baz.html. 
Similarly, if it is linked at http://example.com/foo.war#bar.html, it will be
resolved as http://example.com/foo.war#baz.html. 

There is only one non-obvious behavior in my proposal. If
http://example.com/foo.war#path/bar.html links to /baz.html, should it be
resolved to http://example.com/baz.html, or to http://example.com/foo.war#bar.html. 
In other words, what is the root of http://example.com/foo.war#path/bar.html.
(In reply to comment #22)
> You comment is not redundant, but posting it in BUG 132008 might have been more
> suitable. 
> 
> First, you cannot use the slash sign as archive separator. In
> http://example.com/foo.war/bar.html, foo.war is considered as the path for
> bar.html. The browser asks the server for foo.war/bar.html. Since this file
> doesn't exist (foo.war is file and not path), the server gets the 404.html file
> (with 200 OK respond). It doesn't even aware that it got an error page.
> 
> I think # sign is more suitable since it is already used as the fragmentation
> marker in the standard URI scheme. Naturally fragments of an archive file are
> the files inside it. Also this behavior does not contradict the relevant RFC's.
> 
> As for your example, If baz.html is linked at http://example.com/foo/bar.html,
> you expect is to be resolved as http://example.com/foo/baz.html. 
> Similarly, if it is linked at http://example.com/foo.war#bar.html, it will be
> resolved as http://example.com/foo.war#baz.html. 
> 
> There is only one non-obvious behavior in my proposal. If
> http://example.com/foo.war#path/bar.html links to /baz.html, should it be
> resolved to http://example.com/baz.html, or to
http://example.com/foo.war#bar.html. 
> In other words, what is the root of http://example.com/foo.war#path/bar.html.

Remember that the "root" gets determined by the client (browser) and I
personally think that “#” should indicate the root of the archive file.
Remember, once we specify a URI to an archive file we don’t care about the
server where that came from anymore, only the archive it’s self and it’s contents.

Also, keep in mind that # may be used to designate a place within an html file.
So if I encounter the link "bar.html#credits" within the file bar.html located
in the archive foo.zip (or foo.war, a war file in J2EE terms is actually
designed to not be disclosed by the application server considering that it
usually contains info that could be used to compromise the server so that's
probably not a good example, but whatever). So if I use the path
http://example.com/foo.war#bar.html to access the file bar.html and it has a
link of "#credits" then do we end up with
"http://example.com/foo.war#bar.html#credits" or
"http://example.com/foo.war#credits"?

To clarify this issue, let's first define a few things based off of RFC 2396
"Uniform Resource Identifiers (URI): Generic Syntax"
(http://www.faqs.org/rfcs/rfc2396.html). A "URI Reference" in section 4 of RFC
2396 is defined as follows:

    URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]

So in example above of “http://example.com/foo.war#credits” using RFC 2396
rules, "http://example.com/foo.war" would be the URI (absolute in this case) and
"credits" would be the fragment (note the "#" excluded because it's just the
character that marks the end of the URI and the beginning of the fragment).
Thusly, in the example of “http://example.com/foo.war#bar.html#credits”,
“http://example.com/foo.war#credits” is the URI again and “bar.html#credits” is
the fragment. This time, only the 1st “#” character was excluded as per the RFC,
it says nothing about striping out subsequent “#” characters.

In section 4.1 of RFC 2396, is states that "the format and interpretation of
fragment identifiers is dependent on the media type" by media type it means MIME
type and refers to the MIME RFC.

Thusly, it would seem to make sense that a resolution should occur in a
hierarchical manner using the “#” as a delimiter between levels of fragmentation
as well as the delimiter between the URI and the fragmentation it’s self. Each
level of fragmentation should be resolved based upon the MIME type of that item
being referred to. For right now, the only MIME types that I am aware of
respecting (or commonly respecting) URI fragments are text/html and our proposal
(which would probably include all of the compressed MIME types). I DO NOT
recommend or advocate creating a new MIME type as mentioned in an earlier post
because there are already MIME types existing for each of these compressed
formats. So by these rules, let’s look at the implications of a few examples.

URI Reference:
http://example.com/foo.zip#credits

The mime type of foo.zip is application/zip which would tell Mozilla to download
and CACHE the file foo.zip and then retrieve the file named credits from the
root folder of the archive if it exists.

URI Reference:
http://example.com/foo.zip#bar.html#credits

Once again, the mime type of foo.zip would be application/zip. Mozilla looks at
the 1st item in the fragment list which is “bar.html”, and would retrieve
bar.html from the root of the archive. Then Mozilla will examine the MIME type
of this bar.html and which is text/html. Since Mozilla knows how to handle
fragments for text/html files it will look for an anchor with this name.

URI Reference:
http://example.com/fu.zip#bar.tar.gz#iraq.jar#src/forignpolicy/screwups/invasions/LetsGetSadam.java

Here is a more complex example. Assuming that fu.zip is a valid application/zip
(.zip) file that contains a file in it’s root named bar.tar.gz that is a valid
application/x-gtar (gzipped tape archive) file that contains a file in it’s root
named iraq.jar that is a valid application/java-archive file that contains the
file in the path src/forignpolicy/screwups/invasions named LetsGetSadam.java
then Mozilla should handle the item LetsGetSadam.java according to it’s rules
for this MIME type if it has one. I believe that all .java files get treated as
“text/plain” if they are identified at all. 

Some key things to remember is that the web server is only going to give us MIME
types for the object defined by the URI it’s self. Mozilla is going to be
responsible for identifying mime types for everything else the same way that it
would if a file on the local machine were entered with the file:// protocol.

In summary, this offering should provide a clean method of solving the problem
that as 
Zvi Devir bought up complies with relevant RFCs while is flexible, powerful and
simple in implementation.

I would encourage people to lease read this thread well and relevant RFCs and
post any other ideas, criticisms or recommendations.

Daniel
One more noteworthy item on archive caching

If the archive file is being retrieved remotely, I think that it's very
important that the archive file be cached (otherwise the advantage of retrieving
the file from the archive on the server is sorta lost). This probably brings up
a very important point. By default, Mozilla allocates 50MB of disk cache. Using
archives like this may invoke the need for greater amounts of cache space. This
may mean that at some point we will have to consider having a separate setting
(and/or folder) for cached archives to allow adaquate room for these archives to
be cached while not allowing small images and junk files to sit around
accumulating in numbers that cause directory indexes to become a performance
problem (either at the app or OS/file system level)
MAF (http://maf.mozdev.org/) is an extension which might be useful.
(In reply to comment #25)
> MAF (http://maf.mozdev.org/) is an extension which might be useful.

Yes, MAF is definately a solution to our problem.
It knows to load and save the uncompressed MHTML Microsoft archive
as well as zip-archived special formats.
Thanks to the developers! I hope it can be integrated into Firefox
as a standard save method. This is a very basic function.
Up to now, it's using some external commands and scripts.

Now, the only thing is to agree on a standard. I don't want to save
all my pages into the *.maff format and later some other format
will be standard. So for now I only trust the Microsoft .mht MHTML 
format that MAF can write. It will have a long future I guess.

Can some of the core developers decide for an "official" mozilla
archive standard ? What about just compressed MHTML, so it's
close to the official RFC.
*** Bug 279557 has been marked as a duplicate of this bug. ***
*** Bug 289868 has been marked as a duplicate of this bug. ***
(In reply to comment #25 and comment #26)

MAF would not fulfill the requirements of this feature request.  MAF does provide a means to save and view a single web page with it's images, objects and metadata, but it has some crucial limitations, specifically that it is limited to a single page and, that it uses a proprietary Microsoft standard.  The significance of this feature would be greatly diluted if that is what it was limited to.

The ability to of the browser to access any number of files in any type of archive would greatly augment Mozilla.  This type of support should not be very difficult to implement and would inherently provide the ability to read a web page from a MAF file.  It should also be designed in such a way that existing rules aren't broken requiring "exceptions" to those rules within the code (this makes code get ugly fast).  This is why I propose the solution in comment #23 that adheres to all relevant RFCs and would produce a mechanism for this to be implemented.

Let me give you one more example.  Browse to http://java.sun.com/j2se/1.5.0/download.jsp and download the J2SE 5.0 Documentation (you have to click an "I agree" and such).  This is a 44MB .zip file and contains 10,779 files, most of which are fairly small.  Once you decompress it, it expands to 223MB.  However, because files are allocated in blocks (varying in size from 512 bytes to 4096), there is an average of between 256 and 2048 bytes of wasted space per file.  So this means an additional 2.6MB to 21MB of disk space used taking up to 244MB of disk space.  Also, most disks aren't formatted with 512 byte blocks anymore so the number is more likely to be at the higher end.  As you can see, being able to access this directly from the 44MB .zip archive would be very helpful.

Also, I have a correction to my comment #23, the second sentence of the 4th paragraph should read as follows:

Thusly, in the example of “http://example.com/foo.war#bar.html#credits”,
“http://example.com/foo.war” is the URI again and “bar.html#credits” is
the fragment.

Daniel
If Firefox started supporting saving as WAR, then it would be a great thing because currently Konqueror supports WAR, but WAR files cannot be sent to people who do not have Linux/KDE. If Firefox supports WAR, people on all platforms will be able to read WAR files and then the usage of WARs will grow.

My humble opinion, and my vote.
Duplicate of this bug: 392092
Summary: web archive save/view support (.war, .zip, .tgz, .tar.gz, .jar) → web archive save/view support (KDE/Konqueror-like .war, .zip, .tgz, .tar.gz, .tar.bz2, tb2, .tbz2, .jar)
Assignee: law → nobody
QA Contact: chrispetersen → file-handling
Duplicate of this bug: 502528
Product: Core → Firefox
Target Milestone: Future → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.