Closed Bug 1713786 Opened 5 months ago Closed 18 days ago

Fix "Repair Text Encoding" menu item not doing anything

Categories

(MailNews Core :: Internationalization, defect)

Thunderbird 91
defect

Tracking

(thunderbird_esr91+ affected)

RESOLVED FIXED
94 Branch
Tracking Status
thunderbird_esr91 + affected

People

(Reporter: freaktechnik, Assigned: freaktechnik)

References

Details

Attachments

(2 files, 2 obsolete files)

After bug 1704749 the "Repair Text Encoding" menu item will not do anything, since assigning the magic auto charset to the message window charset throws (this was already the case with the "Automatic" charset item before).

We can fix the display auto detection by skipping the message window parts. However, for optimal operation we should actually determine the detected charset and use that for all the operations. That way quoting the fixed message also uses the updated charset.

this was already the case with the "Automatic" charset item before

... and the "Japanese" entry already in TB 78, see bug 1677104.

Attached patch 1713786.patch - Solution idea (obsolete) — Splinter Review

This works on a couple of plain text messages I tried. Maybe this is the way to go. It will need refinement for multipart messages depending on whether the plaintext or HTML part is displayed, see pref mailnews.display.html_as.

Attachment #9224827 - Attachment is patch: true
Version: unspecified → Thunderbird 91

Comment on attachment 9224827 [details] [diff] [review]
1713786.patch - Solution idea

I'm adding Magnus for review to get action on the bug.

Attachment #9224827 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9224827 [details] [diff] [review]
1713786.patch - Solution idea

Review of attachment 9224827 [details] [diff] [review]:
-----------------------------------------------------------------

I'll let Martin check this
Attachment #9224827 - Flags: review?(mkmelin+mozilla) → review?(martin)

Comment on attachment 9224827 [details] [diff] [review]
1713786.patch - Solution idea

This is not really ready for review, see comment #2. The patch also contains copied code which should be moved to a common location. Meanwhile I noticed this add-on:
https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/ which reinstates the charset menu for those who want it.

Attachment #9224827 - Flags: review?(martin)

It's possible (haven't tried yet) that the solution doesn't work any more after
https://hg.mozilla.org/comm-central/rev/9473904d2047fc7d42476625eb8f4cf8d5a6f11f#l12.408
See my question in bug 1713145 comment #15.

The patch in bug 1713145 does not alter any core behavior. It fixes the newly introduces MimeParser.extractMimeMsg() jsmime based mime parser to align better with Thunderbirds core libmime based mime parser. It is currently only used in the messages WebExtension API for nntp messages.

The patch in bug 1713145 does not alter any core behavior.

Well, mailnews/mime/src/mimeParser.jsm appears to a Thunderbird core module, and yes, its behaviour has changed to the point where the patch here doesn't work any more. I've fixed it now by explicitly requesting strformat: "binarystring".

Refer to comment #2 for more work to be done here.

Attachment #9224827 - Attachment is obsolete: true

I was not aware (and did not check) that you actually use my emitter, since I added it just 2 weeks ago. With "does not alter any core behavior" I meant no core component is using it and therefore nothing could have changed.

You are only looking at the first parts body, so you do not need the full emitter, maybe it is better to use extractHeadersAndBody or add a simple emitter for your use case. I am not a friend of making the strformat of my emitter configurable.

Thanks for the further comment, extractHeadersAndBody() indeed sounds more suitable. I don't intend to fix this bug here. All I've done was to present a possible solution, which is basically:

   // get the relevant body part somehow, depends on what parts are present
   // and which part is displayed (pref mailnews.display.html_as)
+  let body = ...
+  let compUtils = Cc[
+    "@mozilla.org/messengercompose/computils;1"
+  ].createInstance(Ci.nsIMsgCompUtils);
+  let charset = compUtils.detectCharset(body);
+
+  msgWindow.mailCharacterSet = charset;

I hope someone will implement this correctly or come up with a different solution.

I am realizing that this particular issue is not the proper place for such complains, but it seems I am too late and more relevant tasks are closed already.

Is charset menu completely removed from thunderbird? During pre-unicode age, around year 2000. legacy 8-bit encodings were widely used. There are several charsets per language, e.g. koi8-r and windows-1251 for Russian (and cp866, iso8859-5, something rare for macs). Letters from Windows users often did not have explicit charset specified. Even on unix/linux it depended if localization were achieved with some hacks. I regularly received messages with some "assumed" encoding or event with wrongly specified encoding. E.g. in pine it was necessary to setup special filters to read such messages. Tools intended to guess encoding were unreliable. That is why I consider charset menu is important. It can reside deeper in some submenu but it should be available for a case when it is necessary to read an old message. It is not common nowadays, so telemetry could easily show that it is rarely used, especially for firefox since many of old sites are already dead. From mail user agent I expect more conservative feature set. Please, consider restoring of ability to manually choose message encoding.

Sorry if I missed something and this comment is a false alarm.

Is charset menu completely removed from thunderbird?

Yes, in bug 1704749, currently only visible in TB Daily. Did you see comment #5 mentioning https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/?

IMHO the removal of features without hard facts to justify such removal is questionable (see discussion at the end of bug 1704749). Replacing the charset menu (mostly working, see bug 1677104) with a 100%-currently-not-working automatic feature is rather unfortunate. Let's hope it gets fixed before TB 91 goes to beta and later becomes the new TB 91 ESR. Currently fixing a wrong charset is completely broken in TB Daily for both message display (this bug here) and message source display (bug 1716059).

There is another good reason to restore the charset menu as discovered in bug 1716059. The "repair" is based on automatic detection of the charset and that only works if you feed it 8bit content. For messages with CTE quoted-printable there is now no way to correct the charset.

The patch here won't work if the message has CTE quoted-printable. In this case the retrieved body is ASCII and nothing can be detected. So at least QP needs to be decoded before passing the body to charset detection.

Apologies! Just ignore the last two comments. The patch in attachment 9226588 [details] [diff] [review] does work, even for QP messages. Looks like extractMimeMsg() already returns the QP-decoded result. So it's just a matter of addressing comment #2 and comment #9/comment #10.

(In reply to max from comment #11)

Tools intended to guess encoding were unreliable.

While chardetng isn't reliable for some Latin-script languages, notably Hungarian and Lithuanian, for email-length Russian inputs, it should be very, very accurate, though of course someone will find some counterexamples.

José, thank you for details. I have not tried quality of automatic detection yet (I hope, I can do it using experiments API for webextensions). In meanwhile a couple of additional notes:

  1. Bug #1687635 comment #54 mentions https://support.mozilla.org/en-US/kb/text-encoding-no-longer-available-firefox-menu It seems, in Firefox encodings list is removed from hamburger menu only and it is still available from (hidden by default) menu bar. I have not tried it however.

  2. Besides a body, a message may have attachments. By default text attachments are not shown, but configuration may be adjusted to display them. Encoding of some attachments may differ from encoding of the body. Choosing explicit encoding it was possible to read particular part even though all others became unreadable.

(In reply to max from comment #17)

José, thank you for details. I have not tried quality of automatic detection yet (I hope, I can do it using experiments API for webextensions). In meanwhile a couple of additional notes:

  1. Bug #1687635 comment #54 mentions https://support.mozilla.org/en-US/kb/text-encoding-no-longer-available-firefox-menu It seems, in Firefox encodings list is removed from hamburger menu only and it is still available from (hidden by default) menu bar. I have not tried it however.

The list was removed from the hamburger menu in 89. In the two remaining entry points, the list was replaced with the single action in 91.

  1. Besides a body, a message may have attachments. By default text attachments are not shown, but configuration may be adjusted to display them. Encoding of some attachments may differ from encoding of the body. Choosing explicit encoding it was possible to read particular part even though all others became unreadable.

When the detector is invoked, it would make sense to run it on a per-message-part basis.

Besides a body, a message may have attachments. ... Choosing explicit encoding it was possible to read particular part ...

Yes, being able to force a particular charset was useful, but somehow more an more niche features are removed from Thunderbird. You can now use the add-on I mentioned to do it. It has the added geek bonus that the charset name is mentioned explicitly, so you don't have to search for "Greek ISO" if you're interested, for example. Working automatic detection would be good for people who don't want to use "trial and error".

When the detector is invoked, it would make sense to run it on a per-message-part basis.

Good idea, but that will make the implementation harder. Besides, the entire message is one DOM document with many (attachment) parts strung together, so there can only be one charset. How would you select an inline attachment to use that part for the detection? I'd say all this is too geeky, you can save the attachment and view it. It's important that you can fix the message body if it's not properly displayed.

(In reply to José M. Muñoz from comment #19)

Besides, the entire message is one DOM document with many (attachment) parts strung together, so there can only be one charset.

Yes, mailnews converts to UTF-8 before handing the data to m-c code.

How would you select an inline attachment to use that part for the detection?

The detection needs to happen, as it, I believe, already does for message parts without charset, in the MIME engine after decoding QP or Base64 but before converting the parts to UTF-8 and joining them together.

(In reply to José M. Muñoz from comment #19)

Yes, being able to force a particular charset was useful, but somehow more an more niche features are removed from Thunderbird. You can now use the add-on I mentioned to do it.

I would say that ability to deal with partially broken messages has a real value for a mail user agent. It is rarely required last years, but still...

Concerning extension, for browsers WebExtensions API limits to some extent what an add-on can do. Charset menu in Thunderbird requires experiments API, so full access. As the consequence, the author should have ideal reputation, e.g. it is necessary to be sure that he will not accidentally lost control, will not sell the extension to untrusted 3rd party company, etc. This particular extension does not have a link to source code repository, no license specified (though GPL-3 inside .xpi). I do not think, it is comfortable situation when someone is in a hurry to read a message, but it is necessary to install a suspicious add-on.

I have tried charset detection for several messages having KOI8-R and windows-1251 encodings. Charset is recognized correctly for these examples. KOI8-U instead of KOI8-R is not a real problem since messages do not have characters in the range where encodings differ from each other.

Off-topic re. that add-on: Until TB 68 all add-ons were "privileged" (more than once breaking TB). The full source code is in the add-on. BTW, the author used to work for TB. For me the add-on works fine and I can't see anything rogue in the ~120 lines of JS.

I have tried charset detection for several messages having KOI8-R and windows-1251 encodings.

How did you do that since automatic detection is broken and has always been broken since it was never correctly implemented?

(In reply to Henri Sivonen (:hsivonen) from comment #20)

The detection needs to happen, as it, I believe, already does for message parts without charset, in the MIME engine after decoding QP or Base64 but before converting the parts to UTF-8 and joining them together.

As far as I can tell, there is no detection for messages without charset, this code
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/mime/src/comi18n.cpp#37
isn't triggered during my debugging. In general, the MIME code is 20+ years old, there are few (if any) people on the TB team with a working understanding of it. As you can see, the bug has not received any "official" attention yet, that's why I suggested a hacky solution to get the body and detect its charset.

(In reply to José M. Muñoz from comment #22)

the bug has not received any "official" attention yet

Its existence is "official" attention in the first place.

I agree that the solution of having the auto detection happen per mime part is probably the best way to do it.

Sorry, slip of the keyboard, s/any/much/. Hard to understand how TB does its scheduling, it's been broken for two weeks now and nobody is assigned to the bug. No priority/severity set although this is a user-facing regression and loss of functionality.

(In reply to max from comment #21)

KOI8-U instead of KOI8-R is not a real problem since messages do not have characters in the range where encodings differ from each other.

The detector never answers KOI8-R. That is, if the input looks like KOI8-R or KOI8-U, it answers KOI8-U without trying to distinguish between the two. Where the two encodings differ, it's 1) very unlikely that box drawing would be used on the Web or in email and 2) the failure mode of letters getting replaced with box segments is worse than the failure mode of box segments getting replaced with letters.

(In reply to José M. Muñoz from comment #22)

As far as I can tell, there is no detection for messages without charset, this code
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/mime/src/comi18n.cpp#37
isn't triggered during my debugging.

Interesting. It's rather unfortunate that there's code that gets updated for m-c changes without checking that it does what it's supposed to be doing.

Interesting. It's rather unfortunate that there's code that gets updated for m-c changes without checking that it does what it's supposed to be doing.

The other call site does get exercised so we know that charset detection works:
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/compose/src/nsMsgCompUtils.cpp#75
There's also a test: test_detectAttachmentCharset.js.

MIME_detect_charset and its callers are only called in mimetext.cpp, so only for plaintext mail:
https://searchfox.org/comm-central/search?q=MIME_detect_charset&redirect=false
No idea what that code does.

I tested with flowed and non-flowed plaintext mail and the code wasn't triggered, hence a missing charset wasn't replaced by a detected one, everything was displayed as if it were UTF-8.

A bit off-topic: Note that flowed plaintext mail is processed in mimetpfl.cpp (tpfl: t-ext p-lain fl-owed). FYI, MIME has that class model
https://searchfox.org/comm-central/rev/e5e75651c5fb70526ae298312d99bc37ffd1ad32/mailnews/mime/src/mimei.h#24
implemented in C, so every class implements its own methods or calls into the parent. All that's done by passing around function pointers. There was an attempt to straighten that out, see bug 1463289.

(In reply to José M. Muñoz from comment #22)

How did you do that since automatic detection is broken and has always been broken since it was never correctly implemented?

that's why I suggested a hacky solution to get the body and detect its charset.

I was afraid that charset detector behavior may be completely unreliable. So I created a small extension that is based on your "hacky" patch (by the way, const or let is missed before mimeMsg). I did not touch menu entries, etc. I tried several phrases and fragments of messages encoded as koi8-r and cp1251. I do not have a convincing example that charset detector does not work (however I would still prefer to have charset menu as a fallback). Since I use Thunderbird from Ubuntu packages, I do not bother too much concerning version 91 (update from 68 was not so much time ago). I hope, it would not be required to exercise with iconv to read a message.

var tst = class extends ExtensionCommon.ExtensionAPI {
	getAPI(context) {
		return { tst: {
			async charset(raw) {
				const mimeMsg = MimeParser.extractMimeMsg(raw, {
					includeAttachments: false,
					strformat: "binarystring",
				});
				let body = mimeMsg.parts[0].body;
				let compUtils = Cc[
					"@mozilla.org/messengercompose/computils;1"
				].createInstance(Ci.nsIMsgCompUtils);
				let charset = compUtils.detectCharset(body);
				return { body, charset };
			},

+ patched mimeParser.jsm

See Also: → 1717523

Thank you all for the discussion here. We've completed José's idea and included it in our product. John, we've also followed your advice and added another emitter:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1713786-fix-repair-charset.patch

Hmm, this comes as a surprise, I didn't know there's another product now. Anyway, you're welcome and it's all FOSS anyway. We have a saying that goes (translated): "Competition stimulates business".

(In reply to José M. Muñoz from comment #12)

Is charset menu completely removed from thunderbird?

Yes, in bug 1704749, currently only visible in TB Daily. Did you see comment #5 mentioning https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/?

IMHO the removal of features without hard facts to justify such removal is questionable (see discussion at the end of bug 1704749). Replacing the charset menu (mostly working, see bug 1677104) with a 100%-currently-not-working automatic feature is rather unfortunate. Let's hope it gets fixed before TB 91 goes to beta and later becomes the new TB 91 ESR. Currently fixing a wrong charset is completely broken in TB Daily for both message display (this bug here) and message source display (bug 1716059).

Looks like it didn't.

(In reply to José M. Muñoz from comment #24)

Sorry, slip of the keyboard, s/any/much/. Hard to understand how TB does its scheduling, it's been broken for two weeks now and nobody is assigned to the bug. No priority/severity set although this is a user-facing regression and loss of functionality.

So what is the solution? Back out bug 1704749?

Assignee: nobody → martin
Status: NEW → ASSIGNED
Attachment #9226588 - Attachment is obsolete: true

Pushed by alessandro@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/302c32860a1f
Do mime detection instead of charset override. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 18 days ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch

nsIMessenger.setDocumentCharset() also corrected the subject display in the header pane (not the thread pane), your solution doesn't. Here's a test message from a Russian news group.

I don't understand why the charset override has been removed, instead of detection being just added. Detection doesn't always work. If the user can't force the charset, that means that some messages will now be forever unreadable.

(This should eventually go uplift to 91, but best to let it bake on beta for the full cycle.)

You need to log in before you can comment on or make changes to this bug.