Open Bug 121793 Opened 19 years ago Updated 1 year ago

RFE: Save complete webpage in one file using data: protocol (RFC 2397)

Categories

(Firefox :: File Handling, enhancement)

enhancement
Not set

Tracking

()

REOPENED

People

(Reporter: sinchi, Unassigned)

References

Details

Attachments

(2 files)

Thanks to XML data source syntaxis, it's possible saving HTML document with
images, embeds, external styles and scripts in one file. This can be made via
Base64 encoding and using "data:" protocol.

This option would be third point of drop down menu in "Save as" dialog, for example:
Save file as type: Web page, complete, with separate files (*.htm, *.html)
                   Web page, complete, in one whole file (*.htm, *.html)
                   Web page, HTML only (*.htm, *.html)

See demonstration of image embedding in attachment.
This isn't XHTML-specific. NS4.x knows inline images as well
OS: Windows 2000 → All
Hardware: PC → All
Yes, it's right :)
But this is standart XML feature.
OS: All → Windows 2000
Hardware: All → PC
OS: Windows 2000 → All
Hardware: PC → All
IE doesn't show the image. Valid RFE anyway.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Is this the same as bug 40873?
It is.  There is no reason to make up our own format when there is a standard
format for this.

*** This bug has been marked as a duplicate of 40873 ***
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → DUPLICATE
No, in bug 40873 offering to use multipart MIME HTML documents with boundaries.

But in this case, I propose to use XML "data:" protocol, without breaking
document to parts. This feature allows to get fully W3C standart compliant
document, which can be opened with any standart browser, placed to Web server etc.

I think, this is more advanced and useful technology in comparison with MHTML.
This isn't about RFC 2557, but RFC 2397.
It's either wontfix or new, but it's not a duplicate.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Summary: It's possible to save complete page with all embedded external objects as one whole file. → RFE: Save as RFC 2397 HTML; complete webpage in one file
Ah.  Ok, I had misunderstood... A point of reference (from rfc 2397):

   The "data:" URL scheme is only useful for short values. Note that
   some applications that use URLs may impose a length limit; for
   example, URLs embedded within <A> anchors in HTML have a length limit
   determined by the SGML declaration for HTML [RFC1866]. The LITLEN
   (1024) limits the number of characters which can appear in a single
   attribute value literal, the ATTSPLEN (2100) limits the sum of all
   lengths of all attribute value specifications which appear in a tag,
   and the TAGLEN (2100) limits the overall length of a tag.

Thus if we do this we have to be careful to only offer it as an option in cases
when all the linked content is smaller than the relevant limits.
Summary doesn't match comments in this bug, changing
Summary: RFE: Save as RFC 2397 HTML; complete webpage in one file → RFE: Save complete webpage in one file using data: protocol
So, where's the difference now ? Did you read RFC 2397 ? At least let the # stay
in the summary, so someone can search for it.
Summary: RFE: Save complete webpage in one file using data: protocol → RFE: Save complete webpage in one file using data: protocol (RFC 2397)
Markus: sorry... I misread the summary to read MHTML, especially as it was
marked a duplicate of that bug. I didn't look at the RFC numbers, but thought
they were the same, so I didn't include it
I think, LITLEN is not a very significant limitation. Quotation from RFC 2397:
   The effect of using long "data" URLs in applications is currently
   unknown; some software packages may exhibit unreasonable behavior
   when confronted with data that exceeds its allocated buffer size.

If Mozilla will can read big images from "data:" without problems, all will be OK.
You said:

> This feature allows to get fully W3C standart compliant
> document, which can be opened with any standart browser

I was just pointing out that this is not exactly true.  It's very likely to
cause at least some standards-compliant browsers, especially a stricter browser
on a more memory-limited platform, to do odd things....

May be, you're right. But MHTML has a same limitation - size of MHTML file and
size of file in RFC 2397 format are practically equal. And more, MHTML support
is less obvious thing, that data: protocol.

I don't know any browser to understand data: protocol except Mozilla and
Netscape 4.x. If files in RFC 2397 will get prevalence, support of this
standart-compliant format will be put in strict browsers. Actually, this feature
isn't sophisticated and don't demand a lot of system resources: strict browser,
gettind HTML data and having seen "data:" in object location, will cut base64
piece, save object file in temporary directory and substitute corresponding URL,
then continue HTML parsing. This scheme, I think, occupies not so much memory
and CPU time regarding usual HTML parsing.

And, finally, that is just user's choice - use saving with separate files, use
complicated MHTML format or use simple for interpretation, transparent RFC 2397
format. Mainly, Web page is being saved to local disk for private use, and later
it will be opened by same browser.
There are several problems with this approach:

It makes only sense for images and other objects. Linked stylesheets and
javascript would have to be 'included inline' (just as the C-preprocessor does).
While this is probably ok for Javascript, I don't think it will work for
(alternate) stylesheets and other linked resources.

It will bloat a file which reuses images a lot. Think of these spacer and bullet
GIFs.

It will break on objects attached via stylesheets (eg list bullets).
To Martin Kutschker:

No problem.

It's really possible use data: protocol within style sheet (both inline and
separate file), for example:
list-style-image:url(data:image/gif;base64,.....); 

It's really possible use data: protocol within <link>:
<link href="data:text/css;base64,......" rel="stylesheet" type="text/css" />

And more, I had tested "russian matryoshka": <link> with data: protocol, with
embedded image within CSS data. All was OK.
Only view the source for this file ;)
Amazing!

Still a (implementation) problem are stylesheets that include other stylesheets
and trusted Javascript that 'includes' JS-files via XPCOM (though they are a
problem for any save-as-a-whole strategy).

Has anyone tried saving this file? My Mozilla 0.9.7 on Linux always creates a
directory and files for the embedded (!) images. It does it even when I save as
"HTML only". Is there already a bug in this?

So what is missing (?) is to reuse resources:

What is working is this:

<object style="display: none" id="embed" name="embed2" type="image/gif"
data="
AAAAACH5BAEAAAEALAAAAAAPAA8AAAIujA2Zx5EC4WIgWnnq
vQBJLTyhE4khaG5Wqn4tp4ErFnMY+Sll9naUfGpkFL5DAQA7" />

<img src="javascript:this.src=document.getElementById('embed').data" id="test">

<script>document.getElementById('test').src =
document.getElementById('embed').data</script>

But this requires Javascript. Is there a better way to set the src/data of the
image?
> Has anyone tried saving this file? My Mozilla 0.9.7 on Linux 

Please don't test saving with 0.9.7.  Your comment touches on 2 or 3 separate
bugs in the save as impl in 0.9.7 (it all got completely rewritten right before
the milestone, with the ensuing issues).  All the bugs you mention are fixed in
current nightlys.

will have to wait for a future release, post mozilla1.0
Target Milestone: --- → Future
QA Contact: sairuh → benc
Blocks: 115634
Blocks: 116008
adding self to cc list
Blocks: 82118
QA Contact: benc → sairuh
QA Contact: sairuh → petersen
Blocks: 144766
*** Bug 199757 has been marked as a duplicate of this bug. ***
BTW, Opera 7.20 and later also supports data: URLs.
Is this being explored for Firefox?
This bug is unrelated to Seamonkey/Firefox fork.
(In reply to comment #25)
> Is this being explored for Firefox?
AFAIK, the data scheme is implemented in Gecko (Firefox, Mozilla etc.), and
works under all Mozilla variants.
I'm using the data scheme to save space of some html pages with a lot of tiny
GIF's inside them. If anyone is interested, I can post a small perl script that
does the trick.

However, I don't think that this RFE should implemented in Mozilla/FireFox. It's
more reasonable to implement it as an extension for Mozilla/Firefox.
(In reply to comment #27)
> I'm using the data scheme to save space of some html pages with a lot of tiny
> GIF's inside them. If anyone is interested, I can post a small perl script that
> does the trick.

You save space using the data: scheme? I'd like to see that.

> However, I don't think that this RFE should implemented in Mozilla/FireFox. 
> It's more reasonable to implement it as an extension for Mozilla/Firefox.

Mozilla Archive Format is a must-have to be able to read single file webpage formats like MHT (EML). http://maf.mozdev.org/ https://addons.mozilla.org/firefox/2925/
(In reply to comment #28)
...
> Mozilla Archive Format is a must-have to be able to read single file webpage
> formats like MHT (EML). http://maf.mozdev.org/
> https://addons.mozilla.org/firefox/2925/

"MAF 0.7.0 is currently under development and will be compatible with Firefox 1.5 only." which means it is soon to become obsolete with Firefox 2.0 coming out soon, unless there is some sort of secret development of this going on, but usually open source is more, um, "open".
(In reply to comment #28)
> (In reply to comment #27)
> > I'm using the data scheme to save space of some html pages with a lot of tiny
> > GIF's inside them. If anyone is interested, I can post a small perl script that
> > does the trick.
> 
> You save space using the data: scheme? I'd like to see that.
> 
Try it with images that are < 512 bytes. Every small file eats at leaset one full sector/inode + metadata_size(filename...), so you CAN save space.
Hm, no comment for over a year. That's unfortunate because I think this RFE is an extremely good idea and should be implemented. I think a perfect implementation would:

* include JavaScript and CSS code inline (i.e. convert <script> and <style> tags
  with a 'src' attribute to tags containing the contents). There is no need for
  the data: URI here; base64 encoding would only make it use more space and
  remove readability.
* recursively walk CSS @import clauses to include all the CSS.
* convert images - whether <img> tags in HTML or url() clauses in CSS - to data:
  URIs.

To clarify the distinction in the 'Save As' UI, I think the option that is currently called "HTML only" should perhaps be called "original HTML only" (to communicate that it's the unaltered HTML as output by the webserver). The one currently called "complete" could be renamed to "complete - multiple files", so the new option introduced by this RFE could then be called "complete - single HTML file".
Assignee: law → nobody
QA Contact: chrispetersen → file-handling
Duplicate of this bug: 583451
As all browsers support data URLs now, shouldn't this be relatively easy to implement? Or are there still unresolved issues?
you also need to base64 encode any audio and video files
how are really large files handled by base64?
Product: Core → Firefox
Target Milestone: Future → ---
Version: Trunk → unspecified
Firefox Quantum won't work with the MHTML file extensions, so it would be really nice to have this resolved.
You need to log in before you can comment on or make changes to this bug.