331991 - Save as "Web Page, Complete" for HTML should include meta charset

Reporter

Description

•

19 years ago

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 This happens when you save a .htm file with Web Page complete with a web page with the unicode character ® This particular case produces this result: While viewing source prior to saving this is displayed: ® View saved source, this is displayed: Â®. I think the desired output would be simply leaving the ® the same, and not converting it into the actual ® symbol. Reproducible: Always Steps to Reproduce: 1. Create a test.htm file. 2. Enter this code into the file: ® 3. Save the file. 4. Open the file with Firefox. 5. Save the file with Save As..Web Page, Complete. 6. Open the file. Actual Results: This is displayed: Â® This is the source: <html><head></head><body>Â®</body></html> Expected Results: This should be displayed: ® This should be the results: <html><head></head><body>®</body></html> This occurs with default install. Default everything. This happens on multiple computers. This is confirmed by my organizations QA department.

Mark Romero

Comment 1

•

19 years ago

Attached file Test case with the code provided by the reporter. (obsolete) — Details

Mark Romero

Comment 2

•

19 years ago

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060327 Firefox/1.6a1 Works for me, with current build.

Justin Cooper

Reporter

Comment 3

•

19 years ago

Ok. I found a perfect test case (My site is an internal secure site). Take a look at Mozilla.com. 1. Save Mozilla.com as Web Page Complete. 2. Open saved file 3. The Copyright symbol © (in the source it is ©) at the bottom has the strange Â added to the beginning. This looks like it's only happening when the site is encoded natively at UTF-8. It also is only happening when the © is used insted of © is used.

Justin Cooper

Reporter

Updated

•

19 years ago

Summary: Web Page Complete parses unicode incorrectly. → "Web Page, Complete" Parses UTF-8 Incorrectly.

Version: unspecified → 1.5.0.x Branch

Mark Romero

Comment 4

•

19 years ago

Thank you very much for that testcase. I am now able to confirm this bug.

Mark Romero

Comment 5

•

19 years ago

Attached file Save As Webpage, Complete of Mozilla.com — Details

Mark Romero

Updated

•

19 years ago

Attachment #216561 - Attachment is obsolete: true

Elmar Ludwig

Comment 6

•

19 years ago

This is just normal behaviour: When saving as "Web Page, complete", Firefox has to parse and modify the page before saving, so all the data passes through the DOM. There simply is no way to know afterwards how characters were originally encoded, so output can (and often will) differ substantially in such regard from the original page source. See also bug 225979 about a different effect caused by this. If you want the unmodified original page source, save as "HTML only" instead.

Justin Cooper

Reporter

Comment 7

•

19 years ago

Wouldn't it be possible to detect the encoding of the original page, before the page is parsed? Wouldn't it also be effective to leave the &#169, &#174, etc. in the code without modifying it to © or whatnot? There doesn't seem to be a reason (that I can think of) to change this part of the code. It seems that the parser could be just a tad smarter. Thank you.

Elmar Ludwig

Comment 8

•

19 years ago

This is not a parser problem: After parsing, all the possible HTML code representations of a character look the same internally (otherwise rendering would become very inefficient). This part of the page is not changed because is has to, but it is just a side effect of the universal representation of characters inside the browser's DOM tree. It is simply not possible that saving as "Web Page, complete" can completely match the original page's HTML source code. That's what "HTML only" is for.

Jesse Ruderman

Comment 9

•

19 years ago

As filed, I think this bug is WONTFIX -- Firefox will not remember whether characters in text nodes came from entities or directly from the source, because the DOM is equivalent and it would be wasteful to store that information for every text node even for pages that are not going to be saved. But there's a real problem here: if you save mozilla.com, most of the "Other languages" links become unreadable (as well as the apostrophe in "It's free, and easy to use"). Firefox saves the page as UTF-8, but when you load it in Firefox or Safari, it gets treated as ISO-8859-1 because the page doesn't state its charset. IMO, Firefox should do one of the following when saving HTML pages, at least for "Web page, complete": 1) Include a meta tag with the charset, e.g. <meta http-equiv="content-type" content="text/html; charset=UTF-8">. 2) Encode all non-ASCII characters as character entities. (This would bloat Japanese pages quite a bit.) See also the bug 119146, which was fixed by making certain characters (such as the character   becomes) be output as entities. The same problem exists for "Save as HTML only", but I'm not sure how that can be fixed without changing the HTML. I'm morphing this bug to "Save as 'Web Page, Complete' for HTML should include meta charset or encode all non-ASCII characters as entities" and confirming it.

Blocks: 115634

Status: UNCONFIRMED → NEW

Component: General → DOM to Text Conversion

Ever confirmed: true

OS: Windows 2000 → All

Product: Firefox → Core

Hardware: PC → All

Summary: "Web Page, Complete" Parses UTF-8 Incorrectly. → Save as "Web Page, Complete" for HTML should include meta charset or encode all non-ASCII characters as entities

Version: 1.5.0.x Branch → Trunk

Jesse Ruderman

Updated

•

19 years ago

Assignee: nobody → dom-to-text

QA Contact: general

Jesse Ruderman

Updated

•

19 years ago

Severity: normal → major

Jesse Ruderman

Updated

•

19 years ago

Keywords: intl

Boris Zbarsky [:bzbarsky]

Updated

•

18 years ago

Blocks: 105689

Ryan Jones-Ward [:sciguyryan]

Assignee

Comment 10

•

18 years ago

Attached patch Patch v1 (obsolete) — Details — Splinter Review

Patch v1 * Add a meta tag directly to the document from nsHTMLContentSerializer. This will then add a sort of compatibility to the XML serializer which does something similar by adding the XML processing instructions tag to the start of the document.

Assignee: dom-to-text → sciguyryan

Status: NEW → ASSIGNED

Attachment #257045 - Flags: superreview?(bzbarsky)

Attachment #257045 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 11

•

18 years ago

I think I was too closely involved with this patch to review it. I suggest review from an editor person (e.g. glazou) and probably sr from peterv or sicking.

Ryan Jones-Ward [:sciguyryan]

Assignee

Updated

•

18 years ago

Attachment #257045 - Flags: superreview?(peterv)

Attachment #257045 - Flags: superreview?(bzbarsky)

Attachment #257045 - Flags: review?(daniel)

Attachment #257045 - Flags: review?(bzbarsky)

Ryan Jones-Ward [:sciguyryan]

Assignee

Comment 12

•

18 years ago

Attached patch Patch v1.1 — Details — Splinter Review

Patch v1.1 A little change from Patch v1. I guess we should really obey the line-break rules for a normal meta tag by adding a new line before and after the added tag. (See |nsHTMLContentSerializer::LineBreakBeforeOpen|and |nsHTMLContentSerializer::LineBreakAfterOpen|)

Attachment #257045 - Attachment is obsolete: true

Attachment #257124 - Flags: superreview?(peterv)

Attachment #257124 - Flags: review?(daniel)

Attachment #257045 - Flags: superreview?(peterv)

Attachment #257045 - Flags: review?(daniel)

Daniel Glazman (:glazou) (not active in Mozilla any more)

Comment 13

•

18 years ago

Comment on attachment 257124 [details] [diff] [review] Patch v1.1 Sorry guys, I strongly disagree with the proposed changes. First, there is no reason AT ALL why the document tree of the page should be changed when we save a document. Turning an entity reference into the corresponding char is not a doc tree change from our perspective BUT forcing a META is. Second, this would SEVERELY impact the maintainability of non us-ascii and latin* pages, Jesse's example of japanese web pages is excellent. From my point of view, this is clearly a no-go. On another hand, I think a possible solution is perhaps an enhanced "save page" dialog allowing you the options Nvu has for special chars : encode only & < > ' and nbsp, encode the above and latin1, encode all html4 special chars, use &#..; for all non ascii chars. Of course, the problem is a bit trickier for attribute values.

Attachment #257124 - Flags: review?(daniel) → review-

Jesse Ruderman

Comment 14

•

18 years ago

> Turning an entity reference into the corresponding char is not a doc tree change from our perspective BUT forcing a META is. Adding a meta tag is less of a change than causing the page to be interpreted under the wrong character set when it is loaded locally. > Second, this would SEVERELY impact the maintainability of non us-ascii and latin* pages, Jesse's example of japanese web pages is excellent. How would adding a (correct) charset meta tag impact the maintainability of Japanese pages? > On another hand, I think a possible solution is perhaps an enhanced "save page" dialog allowing you the options Nvu has for special chars... Adding options is rarely the best way to fix buggy behavior. What's wrong with just using entities for characters that can't be represented in the given charset (if any) along with a few special ones such as nbsp? > Of course, the problem is a bit trickier for attribute values. Only in that the double-quote character has to be escaped to """, right?

Ryan Jones-Ward [:sciguyryan]

Assignee

Comment 15

•

18 years ago

increase in the page size.(In reply to comment #13) > (From update of attachment 257124 [details] [diff] [review]) > Sorry guys, I strongly disagree with the proposed changes. > First, there is no reason AT ALL why the document tree of the page should be > changed when we save a document. Turning an entity reference into the > corresponding char is not a doc tree change from our perspective BUT forcing a > META is. We do something simmilar for XML already by adding and changing the XML processing instruction to include the character set. See here: http://lxr.mozilla.org/seamonkey/source/content/base/src/nsXMLContentSerializer.cpp#985 > Second, this would SEVERELY impact the maintainability of non us-ascii and > latin* pages, Jesse's example of japanese web pages is excellent. I'm not sure I understand the reasoning here. If for example we are working in a Japanese character set then wouldn't it be beneficial to have that character set clearly defined to the document? Too me it makes more sence to do it this way than to encode the characters themselves which _would_ reduce readability and maintainability of the source code.

Boris Zbarsky [:bzbarsky]

Comment 16

•

18 years ago

> First, there is no reason AT ALL why the document tree of the page should be > changed when we save a document. Because when loaded over HTTP the charset can come from the headers, not from the document. When loaded from disk, this can't happen, so we have to make sure the document supplies the correct charset. Note that the original use case is "save as complete web page" but I felt that hacking this in inside web browser persist wasn't as useful as just having the serializer do the right thing. That said, I _am_ concerned about editor impact, so if there's a problem I would love to know what it is. Example page where this would cause issues?

Daniel Glazman (:glazou) (not active in Mozilla any more)

Comment 17

•

18 years ago

(In reply to comment #16) > Because when loaded over HTTP the charset can come from the headers, not from > the document. When loaded from disk, this can't happen, so we have to make > sure the document supplies the correct charset. Good catch... > Note that the original use case is "save as complete web page" but I felt that > hacking this in inside web browser persist wasn't as useful as just having the > serializer do the right thing. That said, I _am_ concerned about editor > impact, so if there's a problem I would love to know what it is. Example page > where this would cause issues? People often "save web page as complete" to edit it locally. Then, they tweak the URLs in the page to make it grokable again by the web site and finally upload the page. But outputting entities everywhere could then multiply the size of the document by a rather big factor...

Daniel Glazman (:glazou) (not active in Mozilla any more)

Comment 18

•

18 years ago

To complete what I just said above, I don't really know what to do here. All solutions seem to me a bit harmful to the document...

Boris Zbarsky [:bzbarsky]

Comment 19

•

18 years ago

Where does the "outputting entities everywhere" part come in? The whole point of this change is to store the original document charset so we can store the original bytes as much as possible instead of having to entity-encode them, no?

Daniel Glazman (:glazou) (not active in Mozilla any more)

Comment 20

•

18 years ago

go ahead...

Daniel Glazman (:glazou) (not active in Mozilla any more)

Updated

•

18 years ago

Attachment #257124 - Flags: review- → review+

Peter Van der Beken [:peterv]

Comment 23

•

18 years ago

Comment on attachment 257124 [details] [diff] [review] Patch v1.1 >Index: content/base/src/nsHTMLContentSerializer.cpp >=================================================================== >+ AppendToString(mLineBreak, aStr); >+ AppendToString(NS_LITERAL_STRING("<meta http-equiv=\"content-type\""), >+ aStr); >+ AppendToString(NS_LITERAL_STRING(" content=\"text/html; "), aStr); >+ AppendToString(NS_ConvertASCIItoUTF16(mCharset), aStr); >+ AppendToString(NS_LITERAL_STRING("\">"), aStr); >+ AppendToString(mLineBreak, aStr); Shouldn't this use LineBreakBeforeOpen/LineBreakAfterClose?

Attachment #257124 - Flags: superreview?(peterv) → superreview+

Ryan Jones-Ward [:sciguyryan]

Assignee

Comment 24

•

18 years ago

(In reply to comment #23) > Shouldn't this use LineBreakBeforeOpen/LineBreakAfterClose? > If you think it matters I can use them (but I did follow the breaking rules for meta tags in any case).

Peter Van der Beken [:peterv]

Comment 25

•

18 years ago

Hmm, I guess it doesn't matter since we're adding a node that wasn't in the original document.

Ryan Jones-Ward [:sciguyryan]

Assignee

Updated

•

18 years ago

Whiteboard: [checkin needed]

Nickolay_Ponomarev

Comment 26

•

18 years ago

mozilla/content/base/src/nsHTMLContentSerializer.cpp 1.111

Status: ASSIGNED → RESOLVED

Closed: 18 years ago

Flags: in-testsuite?

Resolution: --- → FIXED

Whiteboard: [checkin needed]

Target Milestone: --- → mozilla1.9alpha5

Steffen Wilberg

Updated

•

18 years ago

Summary: Save as "Web Page, Complete" for HTML should include meta charset or encode all non-ASCII characters as entities → Save as "Web Page, Complete" for HTML should include meta charset

[:Aleksej]

Comment 27

•

18 years ago

The fix specifies charset incorrectly, see bug 380659.

Boris Zbarsky [:bzbarsky]

Updated

•

18 years ago

Depends on: 380659

Boris Zbarsky [:bzbarsky]

Updated

•

18 years ago

Depends on: 380668

Ryan Jones-Ward [:sciguyryan]

Assignee

Updated

•

18 years ago

Depends on: 390735

j.j.

Comment 28

•

18 years ago

Bug 105689 is fixed with this?!

Boris Zbarsky [:bzbarsky]

Comment 29

•

18 years ago

Probably. Feel free to test!

Gingerbread Man

Updated

•

6 years ago

Test case with the code provided by the reporter. 19 years ago Mark Romero 51 bytes, text/html		Details
Save As Webpage, Complete of Mozilla.com 19 years ago Mark Romero 10.12 KB, text/html		Details
Patch v1 18 years ago Ryan Jones-Ward [:sciguyryan] 2.13 KB, patch		Details \| Diff \| Splinter Review
Patch v1.1 18 years ago Ryan Jones-Ward [:sciguyryan] 2.33 KB, patch	glazou : review+ peterv : superreview+	Details \| Diff \| Splinter Review