Save as "Web Page, Complete" for HTML should include meta charset

RESOLVED FIXED in mozilla1.9alpha5

Status

()

Core
Serializers
--
major
RESOLVED FIXED
11 years ago
10 years ago

People

(Reporter: Justin Cooper, Assigned: Ryan Jones)

Tracking

({intl})

Trunk
mozilla1.9alpha5
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite ?

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 2 obsolete attachments)

(Reporter)

Description

11 years ago
User-Agent:       Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1

This happens when you save a .htm file with Web Page complete with a web page with the unicode character ® This particular case produces this result:

While viewing source prior to saving this is displayed: <sup>&#174;</sup>
View saved source, this is displayed: <sup>®</sup>.

I think the desired output would be simply leaving the &#174; the same, and not converting it into the actual ® symbol.

Reproducible: Always

Steps to Reproduce:
1. Create a test.htm file.
2. Enter this code into the file: <sup>&#174;</sup>
3. Save the file.
4. Open the file with Firefox.
5. Save the file with Save As..Web Page, Complete.
6. Open the file.
Actual Results:  
This is displayed: ®
This is the source:
<html><head></head><body><sup>®</sup></body></html>

Expected Results:  
This should be displayed: ®
This should be the results: 
<html><head></head><body><sup>&#174;</sup></body></html>

This occurs with default install.  Default everything.

This happens on multiple computers.

This is confirmed by my organizations QA department.

Comment 1

11 years ago
Created attachment 216561 [details]
Test case with the code provided by the reporter.

Comment 2

11 years ago
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060327 Firefox/1.6a1

Works for me, with current build.
(Reporter)

Comment 3

11 years ago
Ok.  I found a perfect test case (My site is an internal secure site).

Take a look at Mozilla.com.

1.  Save Mozilla.com as Web Page Complete.

2.  Open saved file

3.  The Copyright symbol © (in the source it is &#169;) at the bottom has the strange  added to the beginning.

This looks like it's only happening when the site is encoded natively at UTF-8.  It also is only happening when the &#169; is used insted of &copy; is used.
(Reporter)

Updated

11 years ago
Summary: Web Page Complete parses unicode incorrectly. → "Web Page, Complete" Parses UTF-8 Incorrectly.
Version: unspecified → 1.5.0.x Branch

Comment 4

11 years ago
Thank you very much for that testcase. I am now able to confirm this bug.

Comment 5

11 years ago
Created attachment 216581 [details]
Save As Webpage, Complete of Mozilla.com

Updated

11 years ago
Attachment #216561 - Attachment is obsolete: true

Comment 6

11 years ago
This is just normal behaviour: When saving as "Web Page, complete", Firefox has to parse and modify the page before saving, so all the data passes through the DOM. There simply is no way to know afterwards how characters were originally encoded, so output can (and often will) differ substantially in such regard from the original page source. See also bug 225979 about a different effect caused by this.

If you want the unmodified original page source, save as "HTML only" instead.
(Reporter)

Comment 7

11 years ago
Wouldn't it be possible to detect the encoding of the original page, before the page is parsed?

Wouldn't it also be effective to leave the &#169, &#174, etc. in the code without modifying it to &copy; or whatnot?  There doesn't seem to be a reason (that I can think of) to change this part of the code.

It seems that the parser could be just a tad smarter.

Thank you.

Comment 8

11 years ago
This is not a parser problem:

After parsing, all the possible HTML code representations of a character look the same internally (otherwise rendering would become very inefficient). This part of the page is not changed because is has to, but it is just a side effect of the universal representation of characters inside the browser's DOM tree.

It is simply not possible that saving as "Web Page, complete" can completely match the original page's HTML source code. That's what "HTML only" is for.

Comment 9

11 years ago
As filed, I think this bug is WONTFIX -- Firefox will not remember whether characters in text nodes came from entities or directly from the source, because the DOM is equivalent and it would be wasteful to store that information for every text node even for pages that are not going to be saved.

But there's a real problem here: if you save mozilla.com, most of the "Other languages" links become unreadable (as well as the apostrophe in "It's free, and easy to use").  Firefox saves the page as UTF-8, but when you load it in Firefox or Safari, it gets treated as ISO-8859-1 because the page doesn't state its charset.

IMO, Firefox should do one of the following when saving HTML pages, at least for "Web page, complete":

1) Include a meta tag with the charset, e.g. <meta http-equiv="content-type" content="text/html; charset=UTF-8">.

2) Encode all non-ASCII characters as character entities.  (This would bloat Japanese pages quite a bit.)  See also the bug 119146, which was fixed by making certain characters (such as the character &nbsp; becomes) be output as entities.

The same problem exists for "Save as HTML only", but I'm not sure how that can be fixed without changing the HTML.

I'm morphing this bug to "Save as 'Web Page, Complete' for HTML should include meta charset or encode all non-ASCII characters as entities" and confirming it.
Blocks: 115634
Status: UNCONFIRMED → NEW
Component: General → DOM to Text Conversion
Ever confirmed: true
OS: Windows 2000 → All
Product: Firefox → Core
Hardware: PC → All
Summary: "Web Page, Complete" Parses UTF-8 Incorrectly. → Save as "Web Page, Complete" for HTML should include meta charset or encode all non-ASCII characters as entities
Version: 1.5.0.x Branch → Trunk

Updated

11 years ago
Assignee: nobody → dom-to-text
QA Contact: general

Updated

11 years ago
Severity: normal → major

Updated

11 years ago
Keywords: intl
Blocks: 105689
(Assignee)

Comment 10

10 years ago
Created attachment 257045 [details] [diff] [review]
Patch v1

Patch v1

* Add a meta tag directly to the document from nsHTMLContentSerializer. This will then add a sort of compatibility to the XML serializer which does something similar by adding the XML processing instructions tag to the start of the document.
Assignee: dom-to-text → sciguyryan
Status: NEW → ASSIGNED
Attachment #257045 - Flags: superreview?(bzbarsky)
Attachment #257045 - Flags: review?(bzbarsky)
I think I was too closely involved with this patch to review it.  I suggest review from an editor person (e.g. glazou) and probably sr from peterv or sicking.
(Assignee)

Updated

10 years ago
Attachment #257045 - Flags: superreview?(peterv)
Attachment #257045 - Flags: superreview?(bzbarsky)
Attachment #257045 - Flags: review?(daniel)
Attachment #257045 - Flags: review?(bzbarsky)
(Assignee)

Comment 12

10 years ago
Created attachment 257124 [details] [diff] [review]
Patch v1.1

Patch v1.1

A little change from Patch v1. I guess we should really obey the line-break rules for a normal meta tag by adding a new line before and after the added tag. (See |nsHTMLContentSerializer::LineBreakBeforeOpen|and |nsHTMLContentSerializer::LineBreakAfterOpen|)
Attachment #257045 - Attachment is obsolete: true
Attachment #257124 - Flags: superreview?(peterv)
Attachment #257124 - Flags: review?(daniel)
Attachment #257045 - Flags: superreview?(peterv)
Attachment #257045 - Flags: review?(daniel)
Comment on attachment 257124 [details] [diff] [review]
Patch v1.1

Sorry guys, I strongly disagree with the proposed changes.
First, there is no reason AT ALL why the document tree of the page should be changed when we save a document. Turning an entity reference into the corresponding char is not a doc tree change from our perspective BUT forcing a META is.
Second, this would SEVERELY impact the maintainability of non us-ascii and latin* pages, Jesse's example of japanese web pages is excellent.

From my point of view, this is clearly a no-go. On another hand, I think a possible solution is perhaps an enhanced "save page" dialog allowing you the options Nvu has for special chars : encode only & < > ' and nbsp, encode the above and latin1, encode all html4 special chars, use &#..; for all non ascii chars. Of course, the problem is a bit trickier for attribute values.
Attachment #257124 - Flags: review?(daniel) → review-

Comment 14

10 years ago
> Turning an entity reference into the corresponding char is not a doc tree change from our perspective BUT forcing a META is.

Adding a meta tag is less of a change than causing the page to be interpreted under the wrong character set when it is loaded locally.

> Second, this would SEVERELY impact the maintainability of non us-ascii and latin* pages, Jesse's example of japanese web pages is excellent.

How would adding a (correct) charset meta tag impact the maintainability of Japanese pages?

> On another hand, I think a possible solution is perhaps an enhanced "save page" dialog allowing you the options Nvu has for special chars...

Adding options is rarely the best way to fix buggy behavior.  What's wrong with just using entities for characters that can't be represented in the given charset (if any) along with a few special ones such as nbsp?

> Of course, the problem is a bit trickier for attribute values.

Only in that the double-quote character has to be escaped to "&quot;", right?
(Assignee)

Comment 15

10 years ago
increase in the page size.(In reply to comment #13)
> (From update of attachment 257124 [details] [diff] [review])
> Sorry guys, I strongly disagree with the proposed changes.
> First, there is no reason AT ALL why the document tree of the page should be
> changed when we save a document. Turning an entity reference into the
> corresponding char is not a doc tree change from our perspective BUT forcing a
> META is.

We do something simmilar for XML already by adding and changing the XML processing instruction to include the character set. See here: http://lxr.mozilla.org/seamonkey/source/content/base/src/nsXMLContentSerializer.cpp#985

> Second, this would SEVERELY impact the maintainability of non us-ascii and
> latin* pages, Jesse's example of japanese web pages is excellent.

I'm not sure I understand the reasoning here. If for example we are working in a Japanese character set then wouldn't it be beneficial to have that character set clearly defined to the document? Too me it makes more sence to do it this way than to encode the characters themselves which _would_ reduce readability and maintainability of the source code.
> First, there is no reason AT ALL why the document tree of the page should be
> changed when we save a document.

Because when loaded over HTTP the charset can come from the headers, not from the document.  When loaded from disk, this can't happen, so we have to make sure the document supplies the correct charset.

Note that the original use case is "save as complete web page" but I felt that hacking this in inside web browser persist wasn't as useful as just having the serializer do the right thing.  That said, I _am_ concerned about editor impact, so if there's a problem I would love to know what it is.  Example page where this would cause issues?
(In reply to comment #16)

> Because when loaded over HTTP the charset can come from the headers, not from
> the document.  When loaded from disk, this can't happen, so we have to make
> sure the document supplies the correct charset.

Good catch...

> Note that the original use case is "save as complete web page" but I felt that
> hacking this in inside web browser persist wasn't as useful as just having the
> serializer do the right thing.  That said, I _am_ concerned about editor
> impact, so if there's a problem I would love to know what it is.  Example page
> where this would cause issues?

People often "save web page as complete" to edit it locally. Then, they tweak
the URLs in the page to make it grokable again by the web site and finally
upload the page. But outputting entities everywhere could then multiply the size
of the document by a rather big factor...
To complete what I just said above, I don't really know what to do here. All
solutions seem to me a bit harmful to the document...
Where does the "outputting entities everywhere" part come in?  The whole point of this change is to store the original document charset so we can store the original bytes as much as possible instead of having to entity-encode them, no?
go ahead...
Attachment #257124 - Flags: review- → review+
Duplicate of this bug: 377568

Updated

10 years ago
Duplicate of this bug: 379632
Comment on attachment 257124 [details] [diff] [review]
Patch v1.1

>Index: content/base/src/nsHTMLContentSerializer.cpp
>===================================================================

>+    AppendToString(mLineBreak, aStr);
>+    AppendToString(NS_LITERAL_STRING("<meta http-equiv=\"content-type\""),
>+                   aStr);
>+    AppendToString(NS_LITERAL_STRING(" content=\"text/html; "), aStr);
>+    AppendToString(NS_ConvertASCIItoUTF16(mCharset), aStr);
>+    AppendToString(NS_LITERAL_STRING("\">"), aStr);
>+    AppendToString(mLineBreak, aStr);

Shouldn't this use LineBreakBeforeOpen/LineBreakAfterClose?
Attachment #257124 - Flags: superreview?(peterv) → superreview+
(Assignee)

Comment 24

10 years ago
(In reply to comment #23)
> Shouldn't this use LineBreakBeforeOpen/LineBreakAfterClose?
> 

If you think it matters I can use them (but I did follow the breaking rules for meta tags in any case).
Hmm, I guess it doesn't matter since we're adding a node that wasn't in the original document.
(Assignee)

Updated

10 years ago
Whiteboard: [checkin needed]

Comment 26

10 years ago
mozilla/content/base/src/nsHTMLContentSerializer.cpp  1.111
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Whiteboard: [checkin needed]
Target Milestone: --- → mozilla1.9alpha5

Updated

10 years ago
Summary: Save as "Web Page, Complete" for HTML should include meta charset or encode all non-ASCII characters as entities → Save as "Web Page, Complete" for HTML should include meta charset

Comment 27

10 years ago
The fix specifies charset incorrectly, see bug 380659.
Depends on: 380659
Depends on: 380668
(Assignee)

Updated

10 years ago
Depends on: 390735

Comment 28

10 years ago
Bug 105689 is fixed with this?!
Probably.  Feel free to test!
You need to log in before you can comment on or make changes to this bug.