829543 - Refactor hintCharset from nsDocumentViewer to reloadCharset on docshell

Assignee

Description

•

13 years ago

We have a couple of mystery properties on nsIMarkupDocumentViewer: hintCharacterSet and hintCharacterSetSource. We should get someone to investigate what they are for and why. If they are useless, they should be removed. If they are useful, they should be properly documented in nsIMarkupDocumentViewer.idl.

Henri Sivonen (:hsivonen)

Assignee

Comment 1

•

13 years ago

Also forceCharacterSet.

Henri Sivonen (:hsivonen)

Assignee

Comment 2

•

13 years ago

Argh. These are used by comm-central: /editor/ui/composer/content/editingOverlay.js (View Hg log or Hg annotations) line 122 -- contentViewer.forceCharacterSet = aCharset; /mailnews/base/src/nsMsgPrintEngine.cpp (View Hg log or Hg annotations) line 522 -- muDV->SetForceCharacterSet(NS_LITERAL_CSTRING("UTF-8")); /mailnews/compose/src/nsMsgCompose.cpp (View Hg log or Hg annotations) line 1578 -- NS_ENSURE_SUCCESS(markupCV->SetForceCharacterSet(msgCharSet), NS_ERROR_FAILURE); /mailnews/base/src/nsMessenger.cpp (View Hg log or Hg annotations) line 307 -- muDV->SetHintCharacterSet(aCharset); line 308 -- muDV->SetHintCharacterSetSource(kCharsetFromMetaTag);

Jonathan Protzenko [:protz]

Comment 3

•

13 years ago

This is also used by all other extensions that wish to display a message themselves. In Thunderbird Conversations, I'm doing a weird dance to have the message display properly: let cv = iframe.docShell.contentViewer; cv.QueryInterface(Ci.nsIMarkupDocumentViewer); cv.hintCharacterSet = "UTF-8"; cv.hintCharacterSetSource = kCharsetFromChannel; (before streaming a given message into the iframe). I believe that: 1) kCharsetFromMetaTag is incorrect; in Thunderbird Conversations I figured that out quite some time ago, and changed it to kCharsetFromChannel (see bug 594646 which is about precisely fixing that) 2) this is pretty ugly (as you already stated :)); any code that removes this would make me immensely happy. From what I recall, the code responsible for decoding a message is in the infamous libmime, and libmime outputs UTF-8 already. The code responsible for briding libmime and the rest of the streaming process is in mailnews/mime/src/mimemoz2.cpp and the code that makes sure libmime outputs utf8 is apparently in comi18n.cpp ; mime_convert_charset in mimemoz2.cpp seems interesting. If we could just add the right calls in mimemoz2.cpp so that they output a BOM and remove all this nonsense altogether, that would be a big win.

Josh Yuan

Comment 4

•

13 years ago

Hello there, My friend and I are interested in helping out. Is this bug still active? If so, where/how do we begin? Thanks.

Henri Sivonen (:hsivonen)

Assignee

Comment 5

•

13 years ago

(In reply to Josh Yuan from comment #4) > My friend and I are interested in helping out. Is this bug still active? > If so, where/how do we begin? Please see the comment previous to yours.

Rahul Gandhi

Comment 6

•

12 years ago

Hello sir, I wants to work on this bug. I have read all the comments and i am interested working on this project

Tony Mechelynck [:tonymec]

Comment 7

•

12 years ago

In reply to comment #3: Maybe I'm just clueless but what about the code (in mailnews and in Composer) which _sends_ a message or a page in a given charset? In any case, IMO there's no such thing as too detailed documentation (as long as it is accurate, of course).

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

12 years ago

(In reply to Rahul Gandhi from comment #6) > I wants to work on this bug. I have read all the comments and i am > interested working on this project Please go ahead. It appears that this is not being actively worked on by someone else. (In reply to Tony Mechelynck [:tonymec] from comment #7) > In reply to comment #3: Maybe I'm just clueless but what about the code (in > mailnews and in Composer) which _sends_ a message or a page in a given > charset? Does mailnews control serializer charset by forcing the docshell charset? If it was up to me, I'd make Composer always output UTF-8. Also, one has to wonder why mailnews couldn't always send UTF-8 these days. It's not like the ability to read UTF-8 at the other end is a super-new thing.

Tony Mechelynck [:tonymec]

Comment 9

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #8) > (In reply to Tony Mechelynck [:tonymec] from comment #7) > > In reply to comment #3: Maybe I'm just clueless but what about the code (in > > mailnews and in Composer) which _sends_ a message or a page in a given > > charset? > > Does mailnews control serializer charset by forcing the docshell charset? > > If it was up to me, I'd make Composer always output UTF-8. Also, one has to > wonder why mailnews couldn't always send UTF-8 these days. It's not like the > ability to read UTF-8 at the other end is a super-new thing. I don't know how it does it, but the user has the following choices (in Preferences unless otherwise shown): - select a "default charset" for new messages - select a charset (Options → Character Encoding) for the current message - decide whether replies should be sent in the charset of the incoming message of forced to the default charset - decide whether to use quoted-printable or 8bit if the message contains bytes above 0x7F I would also prefer UTF-8 but there are reasons for and against: - in my experience, Google Groups garbles 8-bit characters in messages which it doesn't receive in UTF-8 (that's a bug in Google, but I don't expect it to be fixed any time soon); - UTF-8 uses more bandwidth than a charset tailored to the language being used: for instance 2 bytes per Cyrillic letter compared to 1 in koi8-r or koi8-u (similarly with other single-byte encodings for Greek, Hebrew, Arabic, etc.); 3 bytes per hanzi compared to 2 in GB2312; etc.

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

12 years ago

(In reply to Tony Mechelynck [:tonymec] from comment #9) > I would also prefer UTF-8 but there are reasons for and against: Filed bug 862292. > - in my experience, Google Groups garbles 8-bit characters in messages which > it doesn't receive in UTF-8 (that's a bug in Google, but I don't expect it > to be fixed any time soon); So that's a reason *for*. > - UTF-8 uses more bandwidth than a charset tailored to the language being > used: for instance 2 bytes per Cyrillic letter compared to 1 in koi8-r or > koi8-u (similarly with other single-byte encodings for Greek, Hebrew, > Arabic, etc.); 3 bytes per hanzi compared to 2 in GB2312; etc. These days, there's so much video and images flying around on fast intertubes that trying to optimize the transmission size of text like that seems like the wrong optimization considering the complexity that arises from legacy encodings.

gillh11

Comment 11

•

12 years ago

Can I have this assigned to me?

Magnus Melin [:mkmelin]

Comment 12

•

12 years ago

Certainly.

Assignee: nobody → gillh11

Henri Sivonen (:hsivonen)

Assignee

Comment 13

•

12 years ago

(In reply to Tony Mechelynck [:tonymec] from comment #9) > - UTF-8 uses more bandwidth than a charset tailored to the language being > used: for instance 2 bytes per Cyrillic letter compared to 1 in koi8-r or > koi8-u The Russian localization of Thunderbird already default to UTF-8 for outgoing email.

Jonathan Protzenko [:protz]

Comment 14

•

12 years ago

I think you'd have to check with David Bienvenu, who said at some point that a significant fraction of mobile phones in Japan still weren't able to deal with UTF-8. That was a few years ago, though, iirc.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Mentor: hsivonen

Whiteboard: [mentor=hsivonen][lang=C++] → [lang=C++]

Henri Sivonen (:hsivonen)

Assignee

Comment 15

•

8 years ago

(In reply to Jonathan Protzenko [:protz] from comment #3) > This is also used by all other extensions that wish to display a message > themselves. Is this still relevant in the post-57 world? Has Thunderbird already figured out its extension story going forward?

Magnus Melin [:mkmelin]

Comment 16

•

8 years ago

The plan is that Thunderbird will still support (all the) old style add-ons, as long as it's feasible.

Henri Sivonen (:hsivonen)

Assignee

Comment 18

•

5 years ago

It appears that hintCharset is used when the HTML parser requests a charset reload. Rename it accordingly and move it to the docshell.

Assignee: gillh11 → nobody

Summary: Investigate the purpose of nsIMarkupDocumentViewer.hintCharacterSet and then either document or remove it → Refactor hintCharset from nsDocumentViewer to reloadCharset on docshell

Henri Sivonen (:hsivonen)

Assignee

Comment 20

•

4 years ago

From the duplicate:

We use the "hint charset" as a place where we keep the encoding over a late meta or late encoding detection reload.

I don't see the propagation to children making any sense, since the children are destroyed by the reload and new child viewers created when the corresponding iframes are eventually re-created.

I suggest removing the CallChildren part and checking if the encoding reload stuff still works. https://searchfox.org/mozilla-central/source/testing/web-platform/tests/encoding-detection has reload tests, but they don't have iframes.

Sneha k

Comment 21

•

4 years ago

I want to work on this, can I get some information regarding this bug?

Henri Sivonen (:hsivonen)

Assignee

Comment 22

•

4 years ago

(In reply to Sneha k from comment #21)

I want to work on this, can I get some information regarding this bug?

The comment previous to your has the most up-to-date suggestion.

Henri Sivonen (:hsivonen)

Assignee

Comment 23

•

4 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #22)

(In reply to Sneha k from comment #21)

I want to work on this, can I get some information regarding this bug?

The comment previous to your has the most up-to-date suggestion.

Sorry about not waiting. I'm taking this to get the docshell stuff here off Fission audits.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5219a9d3ea58e2ecd9e716bedabb58d0ad7490fd

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen)

Assignee

Comment 24

•

4 years ago

Attached file Bug 829543 - Rename hintCharset to reloadEncoding and remove unnecessary code. — Details

Pulsebot

Comment 25

•

4 years ago

Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/996f57ff4094 Rename hintCharset to reloadEncoding and remove unnecessary code. r=emk

Magnus Melin [:mkmelin]

Updated

•

4 years ago

Comment 26

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/996f57ff4094

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox90: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 90 Branch