Closed Bug 829543 Opened 13 years ago Closed 4 years ago

Refactor hintCharset from nsDocumentViewer to reloadCharset on docshell

Categories

(Core :: DOM: Navigation, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox90 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen, Mentored)

References

Details

(Whiteboard: [lang=C++])

Attachments

(1 file)

We have a couple of mystery properties on nsIMarkupDocumentViewer: hintCharacterSet and hintCharacterSetSource. We should get someone to investigate what they are for and why. If they are useless, they should be removed. If they are useful, they should be properly documented in nsIMarkupDocumentViewer.idl.
Also forceCharacterSet.
Argh. These are used by comm-central: /editor/ui/composer/content/editingOverlay.js (View Hg log or Hg annotations) line 122 -- contentViewer.forceCharacterSet = aCharset; /mailnews/base/src/nsMsgPrintEngine.cpp (View Hg log or Hg annotations) line 522 -- muDV->SetForceCharacterSet(NS_LITERAL_CSTRING("UTF-8")); /mailnews/compose/src/nsMsgCompose.cpp (View Hg log or Hg annotations) line 1578 -- NS_ENSURE_SUCCESS(markupCV->SetForceCharacterSet(msgCharSet), NS_ERROR_FAILURE); /mailnews/base/src/nsMessenger.cpp (View Hg log or Hg annotations) line 307 -- muDV->SetHintCharacterSet(aCharset); line 308 -- muDV->SetHintCharacterSetSource(kCharsetFromMetaTag);
This is also used by all other extensions that wish to display a message themselves. In Thunderbird Conversations, I'm doing a weird dance to have the message display properly: let cv = iframe.docShell.contentViewer; cv.QueryInterface(Ci.nsIMarkupDocumentViewer); cv.hintCharacterSet = "UTF-8"; cv.hintCharacterSetSource = kCharsetFromChannel; (before streaming a given message into the iframe). I believe that: 1) kCharsetFromMetaTag is incorrect; in Thunderbird Conversations I figured that out quite some time ago, and changed it to kCharsetFromChannel (see bug 594646 which is about precisely fixing that) 2) this is pretty ugly (as you already stated :)); any code that removes this would make me immensely happy. From what I recall, the code responsible for decoding a message is in the infamous libmime, and libmime outputs UTF-8 already. The code responsible for briding libmime and the rest of the streaming process is in mailnews/mime/src/mimemoz2.cpp and the code that makes sure libmime outputs utf8 is apparently in comi18n.cpp ; mime_convert_charset in mimemoz2.cpp seems interesting. If we could just add the right calls in mimemoz2.cpp so that they output a BOM and remove all this nonsense altogether, that would be a big win.
Hello there, My friend and I are interested in helping out. Is this bug still active? If so, where/how do we begin? Thanks.
(In reply to Josh Yuan from comment #4) > My friend and I are interested in helping out. Is this bug still active? > If so, where/how do we begin? Please see the comment previous to yours.
Hello sir, I wants to work on this bug. I have read all the comments and i am interested working on this project
In reply to comment #3: Maybe I'm just clueless but what about the code (in mailnews and in Composer) which _sends_ a message or a page in a given charset? In any case, IMO there's no such thing as too detailed documentation (as long as it is accurate, of course).
(In reply to Rahul Gandhi from comment #6) > I wants to work on this bug. I have read all the comments and i am > interested working on this project Please go ahead. It appears that this is not being actively worked on by someone else. (In reply to Tony Mechelynck [:tonymec] from comment #7) > In reply to comment #3: Maybe I'm just clueless but what about the code (in > mailnews and in Composer) which _sends_ a message or a page in a given > charset? Does mailnews control serializer charset by forcing the docshell charset? If it was up to me, I'd make Composer always output UTF-8. Also, one has to wonder why mailnews couldn't always send UTF-8 these days. It's not like the ability to read UTF-8 at the other end is a super-new thing.
(In reply to Henri Sivonen (:hsivonen) from comment #8) > (In reply to Tony Mechelynck [:tonymec] from comment #7) > > In reply to comment #3: Maybe I'm just clueless but what about the code (in > > mailnews and in Composer) which _sends_ a message or a page in a given > > charset? > > Does mailnews control serializer charset by forcing the docshell charset? > > If it was up to me, I'd make Composer always output UTF-8. Also, one has to > wonder why mailnews couldn't always send UTF-8 these days. It's not like the > ability to read UTF-8 at the other end is a super-new thing. I don't know how it does it, but the user has the following choices (in Preferences unless otherwise shown): - select a "default charset" for new messages - select a charset (Options → Character Encoding) for the current message - decide whether replies should be sent in the charset of the incoming message of forced to the default charset - decide whether to use quoted-printable or 8bit if the message contains bytes above 0x7F I would also prefer UTF-8 but there are reasons for and against: - in my experience, Google Groups garbles 8-bit characters in messages which it doesn't receive in UTF-8 (that's a bug in Google, but I don't expect it to be fixed any time soon); - UTF-8 uses more bandwidth than a charset tailored to the language being used: for instance 2 bytes per Cyrillic letter compared to 1 in koi8-r or koi8-u (similarly with other single-byte encodings for Greek, Hebrew, Arabic, etc.); 3 bytes per hanzi compared to 2 in GB2312; etc.
(In reply to Tony Mechelynck [:tonymec] from comment #9) > I would also prefer UTF-8 but there are reasons for and against: Filed bug 862292. > - in my experience, Google Groups garbles 8-bit characters in messages which > it doesn't receive in UTF-8 (that's a bug in Google, but I don't expect it > to be fixed any time soon); So that's a reason *for*. > - UTF-8 uses more bandwidth than a charset tailored to the language being > used: for instance 2 bytes per Cyrillic letter compared to 1 in koi8-r or > koi8-u (similarly with other single-byte encodings for Greek, Hebrew, > Arabic, etc.); 3 bytes per hanzi compared to 2 in GB2312; etc. These days, there's so much video and images flying around on fast intertubes that trying to optimize the transmission size of text like that seems like the wrong optimization considering the complexity that arises from legacy encodings.
Can I have this assigned to me?
Certainly.
Assignee: nobody → gillh11
(In reply to Tony Mechelynck [:tonymec] from comment #9) > - UTF-8 uses more bandwidth than a charset tailored to the language being > used: for instance 2 bytes per Cyrillic letter compared to 1 in koi8-r or > koi8-u The Russian localization of Thunderbird already default to UTF-8 for outgoing email.
I think you'd have to check with David Bienvenu, who said at some point that a significant fraction of mobile phones in Japan still weren't able to deal with UTF-8. That was a few years ago, though, iirc.
Mentor: hsivonen
Whiteboard: [mentor=hsivonen][lang=C++] → [lang=C++]
(In reply to Jonathan Protzenko [:protz] from comment #3) > This is also used by all other extensions that wish to display a message > themselves. Is this still relevant in the post-57 world? Has Thunderbird already figured out its extension story going forward?
The plan is that Thunderbird will still support (all the) old style add-ons, as long as it's feasible.

It appears that hintCharset is used when the HTML parser requests a charset reload. Rename it accordingly and move it to the docshell.

Assignee: gillh11 → nobody
Summary: Investigate the purpose of nsIMarkupDocumentViewer.hintCharacterSet and then either document or remove it → Refactor hintCharset from nsDocumentViewer to reloadCharset on docshell

From the duplicate:

We use the "hint charset" as a place where we keep the encoding over a late meta or late encoding detection reload.

I don't see the propagation to children making any sense, since the children are destroyed by the reload and new child viewers created when the corresponding iframes are eventually re-created.

I suggest removing the CallChildren part and checking if the encoding reload stuff still works. https://searchfox.org/mozilla-central/source/testing/web-platform/tests/encoding-detection has reload tests, but they don't have iframes.

I want to work on this, can I get some information regarding this bug?

(In reply to Sneha k from comment #21)

I want to work on this, can I get some information regarding this bug?

The comment previous to your has the most up-to-date suggestion.

(In reply to Henri Sivonen (:hsivonen) from comment #22)

(In reply to Sneha k from comment #21)

I want to work on this, can I get some information regarding this bug?

The comment previous to your has the most up-to-date suggestion.

Sorry about not waiting. I'm taking this to get the docshell stuff here off Fission audits.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5219a9d3ea58e2ecd9e716bedabb58d0ad7490fd

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/996f57ff4094 Rename hintCharset to reloadEncoding and remove unnecessary code. r=emk
See Also: → 1708178
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: