Fix "Repair Text Encoding" menu item not doing anything
Categories
(MailNews Core :: Internationalization, defect)
Tracking
(thunderbird_esr91+ fixed)
People
(Reporter: freaktechnik, Assigned: freaktechnik)
References
(Regressed 1 open bug)
Details
Attachments
(8 files, 3 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
1.65 KB,
text/plain
|
Details | |
1.22 KB,
text/plain
|
Details | |
1.19 KB,
message/rfc822
|
Details | |
28.50 KB,
image/png
|
Details | |
50.77 KB,
patch
|
rjl
:
approval-comm-esr91+
|
Details | Diff | Splinter Review |
90.40 KB,
message/rfc822
|
Details | |
90.39 KB,
text/plain
|
Details |
After bug 1704749 the "Repair Text Encoding" menu item will not do anything, since assigning the magic auto charset to the message window charset throws (this was already the case with the "Automatic" charset item before).
We can fix the display auto detection by skipping the message window parts. However, for optimal operation we should actually determine the detected charset and use that for all the operations. That way quoting the fixed message also uses the updated charset.
Comment 1•4 years ago
|
||
this was already the case with the "Automatic" charset item before
... and the "Japanese" entry already in TB 78, see bug 1677104.
Comment 2•4 years ago
|
||
This works on a couple of plain text messages I tried. Maybe this is the way to go. It will need refinement for multipart messages depending on whether the plaintext or HTML part is displayed, see pref mailnews.display.html_as.
Updated•4 years ago
|
Updated•3 years ago
|
Comment 3•3 years ago
|
||
Comment on attachment 9224827 [details] [diff] [review]
1713786.patch - Solution idea
I'm adding Magnus for review to get action on the bug.
Comment 4•3 years ago
|
||
Comment 5•3 years ago
|
||
Comment on attachment 9224827 [details] [diff] [review]
1713786.patch - Solution idea
This is not really ready for review, see comment #2. The patch also contains copied code which should be moved to a common location. Meanwhile I noticed this add-on:
https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/ which reinstates the charset menu for those who want it.
Comment 6•3 years ago
|
||
It's possible (haven't tried yet) that the solution doesn't work any more after
https://hg.mozilla.org/comm-central/rev/9473904d2047fc7d42476625eb8f4cf8d5a6f11f#l12.408
See my question in bug 1713145 comment #15.
Comment 7•3 years ago
|
||
The patch in bug 1713145 does not alter any core behavior. It fixes the newly introduces MimeParser.extractMimeMsg()
jsmime based mime parser to align better with Thunderbirds core libmime based mime parser. It is currently only used in the messages WebExtension API for nntp messages.
Comment 8•3 years ago
|
||
The patch in bug 1713145 does not alter any core behavior.
Well, mailnews/mime/src/mimeParser.jsm appears to a Thunderbird core module, and yes, its behaviour has changed to the point where the patch here doesn't work any more. I've fixed it now by explicitly requesting strformat: "binarystring"
.
Refer to comment #2 for more work to be done here.
Comment 9•3 years ago
|
||
I was not aware (and did not check) that you actually use my emitter, since I added it just 2 weeks ago. With "does not alter any core behavior" I meant no core component is using it and therefore nothing could have changed.
You are only looking at the first parts body, so you do not need the full emitter, maybe it is better to use extractHeadersAndBody or add a simple emitter for your use case. I am not a friend of making the strformat of my emitter configurable.
Comment 10•3 years ago
|
||
Thanks for the further comment, extractHeadersAndBody()
indeed sounds more suitable. I don't intend to fix this bug here. All I've done was to present a possible solution, which is basically:
// get the relevant body part somehow, depends on what parts are present
// and which part is displayed (pref mailnews.display.html_as)
+ let body = ...
+ let compUtils = Cc[
+ "@mozilla.org/messengercompose/computils;1"
+ ].createInstance(Ci.nsIMsgCompUtils);
+ let charset = compUtils.detectCharset(body);
+
+ msgWindow.mailCharacterSet = charset;
I hope someone will implement this correctly or come up with a different solution.
Updated•3 years ago
|
Comment 11•3 years ago
|
||
I am realizing that this particular issue is not the proper place for such complains, but it seems I am too late and more relevant tasks are closed already.
Is charset menu completely removed from thunderbird? During pre-unicode age, around year 2000. legacy 8-bit encodings were widely used. There are several charsets per language, e.g. koi8-r and windows-1251 for Russian (and cp866, iso8859-5, something rare for macs). Letters from Windows users often did not have explicit charset specified. Even on unix/linux it depended if localization were achieved with some hacks. I regularly received messages with some "assumed" encoding or event with wrongly specified encoding. E.g. in pine it was necessary to setup special filters to read such messages. Tools intended to guess encoding were unreliable. That is why I consider charset menu is important. It can reside deeper in some submenu but it should be available for a case when it is necessary to read an old message. It is not common nowadays, so telemetry could easily show that it is rarely used, especially for firefox since many of old sites are already dead. From mail user agent I expect more conservative feature set. Please, consider restoring of ability to manually choose message encoding.
Sorry if I missed something and this comment is a false alarm.
Comment 12•3 years ago
|
||
Is charset menu completely removed from thunderbird?
Yes, in bug 1704749, currently only visible in TB Daily. Did you see comment #5 mentioning https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/?
IMHO the removal of features without hard facts to justify such removal is questionable (see discussion at the end of bug 1704749). Replacing the charset menu (mostly working, see bug 1677104) with a 100%-currently-not-working automatic feature is rather unfortunate. Let's hope it gets fixed before TB 91 goes to beta and later becomes the new TB 91 ESR. Currently fixing a wrong charset is completely broken in TB Daily for both message display (this bug here) and message source display (bug 1716059).
Comment 13•3 years ago
|
||
There is another good reason to restore the charset menu as discovered in bug 1716059. The "repair" is based on automatic detection of the charset and that only works if you feed it 8bit content. For messages with CTE quoted-printable there is now no way to correct the charset.
Comment 14•3 years ago
|
||
The patch here won't work if the message has CTE quoted-printable. In this case the retrieved body is ASCII and nothing can be detected. So at least QP needs to be decoded before passing the body to charset detection.
Comment 15•3 years ago
|
||
Apologies! Just ignore the last two comments. The patch in attachment 9226588 [details] [diff] [review] does work, even for QP messages. Looks like extractMimeMsg()
already returns the QP-decoded result. So it's just a matter of addressing comment #2 and comment #9/comment #10.
Comment 16•3 years ago
|
||
(In reply to max from comment #11)
Tools intended to guess encoding were unreliable.
While chardetng isn't reliable for some Latin-script languages, notably Hungarian and Lithuanian, for email-length Russian inputs, it should be very, very accurate, though of course someone will find some counterexamples.
Comment 17•3 years ago
|
||
José, thank you for details. I have not tried quality of automatic detection yet (I hope, I can do it using experiments API for webextensions). In meanwhile a couple of additional notes:
-
Bug #1687635 comment #54 mentions https://support.mozilla.org/en-US/kb/text-encoding-no-longer-available-firefox-menu It seems, in Firefox encodings list is removed from hamburger menu only and it is still available from (hidden by default) menu bar. I have not tried it however.
-
Besides a body, a message may have attachments. By default text attachments are not shown, but configuration may be adjusted to display them. Encoding of some attachments may differ from encoding of the body. Choosing explicit encoding it was possible to read particular part even though all others became unreadable.
Comment 18•3 years ago
|
||
(In reply to max from comment #17)
José, thank you for details. I have not tried quality of automatic detection yet (I hope, I can do it using experiments API for webextensions). In meanwhile a couple of additional notes:
- Bug #1687635 comment #54 mentions https://support.mozilla.org/en-US/kb/text-encoding-no-longer-available-firefox-menu It seems, in Firefox encodings list is removed from hamburger menu only and it is still available from (hidden by default) menu bar. I have not tried it however.
The list was removed from the hamburger menu in 89. In the two remaining entry points, the list was replaced with the single action in 91.
- Besides a body, a message may have attachments. By default text attachments are not shown, but configuration may be adjusted to display them. Encoding of some attachments may differ from encoding of the body. Choosing explicit encoding it was possible to read particular part even though all others became unreadable.
When the detector is invoked, it would make sense to run it on a per-message-part basis.
Comment 19•3 years ago
|
||
Besides a body, a message may have attachments. ... Choosing explicit encoding it was possible to read particular part ...
Yes, being able to force a particular charset was useful, but somehow more an more niche features are removed from Thunderbird. You can now use the add-on I mentioned to do it. It has the added geek bonus that the charset name is mentioned explicitly, so you don't have to search for "Greek ISO" if you're interested, for example. Working automatic detection would be good for people who don't want to use "trial and error".
When the detector is invoked, it would make sense to run it on a per-message-part basis.
Good idea, but that will make the implementation harder. Besides, the entire message is one DOM document with many (attachment) parts strung together, so there can only be one charset. How would you select an inline attachment to use that part for the detection? I'd say all this is too geeky, you can save the attachment and view it. It's important that you can fix the message body if it's not properly displayed.
Comment 20•3 years ago
|
||
(In reply to José M. Muñoz from comment #19)
Besides, the entire message is one DOM document with many (attachment) parts strung together, so there can only be one charset.
Yes, mailnews converts to UTF-8 before handing the data to m-c code.
How would you select an inline attachment to use that part for the detection?
The detection needs to happen, as it, I believe, already does for message parts without charset
, in the MIME engine after decoding QP or Base64 but before converting the parts to UTF-8 and joining them together.
Comment 21•3 years ago
|
||
(In reply to José M. Muñoz from comment #19)
Yes, being able to force a particular charset was useful, but somehow more an more niche features are removed from Thunderbird. You can now use the add-on I mentioned to do it.
I would say that ability to deal with partially broken messages has a real value for a mail user agent. It is rarely required last years, but still...
Concerning extension, for browsers WebExtensions API limits to some extent what an add-on can do. Charset menu in Thunderbird requires experiments API, so full access. As the consequence, the author should have ideal reputation, e.g. it is necessary to be sure that he will not accidentally lost control, will not sell the extension to untrusted 3rd party company, etc. This particular extension does not have a link to source code repository, no license specified (though GPL-3 inside .xpi). I do not think, it is comfortable situation when someone is in a hurry to read a message, but it is necessary to install a suspicious add-on.
I have tried charset detection for several messages having KOI8-R and windows-1251 encodings. Charset is recognized correctly for these examples. KOI8-U instead of KOI8-R is not a real problem since messages do not have characters in the range where encodings differ from each other.
Comment 22•3 years ago
|
||
Off-topic re. that add-on: Until TB 68 all add-ons were "privileged" (more than once breaking TB). The full source code is in the add-on. BTW, the author used to work for TB. For me the add-on works fine and I can't see anything rogue in the ~120 lines of JS.
I have tried charset detection for several messages having KOI8-R and windows-1251 encodings.
How did you do that since automatic detection is broken and has always been broken since it was never correctly implemented?
(In reply to Henri Sivonen (:hsivonen) from comment #20)
The detection needs to happen, as it, I believe, already does for message parts without
charset
, in the MIME engine after decoding QP or Base64 but before converting the parts to UTF-8 and joining them together.
As far as I can tell, there is no detection for messages without charset, this code
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/mime/src/comi18n.cpp#37
isn't triggered during my debugging. In general, the MIME code is 20+ years old, there are few (if any) people on the TB team with a working understanding of it. As you can see, the bug has not received any "official" attention yet, that's why I suggested a hacky solution to get the body and detect its charset.
Assignee | ||
Comment 23•3 years ago
|
||
(In reply to José M. Muñoz from comment #22)
the bug has not received any "official" attention yet
Its existence is "official" attention in the first place.
I agree that the solution of having the auto detection happen per mime part is probably the best way to do it.
Comment 24•3 years ago
|
||
Sorry, slip of the keyboard, s/any/much/. Hard to understand how TB does its scheduling, it's been broken for two weeks now and nobody is assigned to the bug. No priority/severity set although this is a user-facing regression and loss of functionality.
Comment 25•3 years ago
|
||
(In reply to max from comment #21)
KOI8-U instead of KOI8-R is not a real problem since messages do not have characters in the range where encodings differ from each other.
The detector never answers KOI8-R. That is, if the input looks like KOI8-R or KOI8-U, it answers KOI8-U without trying to distinguish between the two. Where the two encodings differ, it's 1) very unlikely that box drawing would be used on the Web or in email and 2) the failure mode of letters getting replaced with box segments is worse than the failure mode of box segments getting replaced with letters.
(In reply to José M. Muñoz from comment #22)
As far as I can tell, there is no detection for messages without charset, this code
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/mime/src/comi18n.cpp#37
isn't triggered during my debugging.
Interesting. It's rather unfortunate that there's code that gets updated for m-c changes without checking that it does what it's supposed to be doing.
Comment 26•3 years ago
|
||
Interesting. It's rather unfortunate that there's code that gets updated for m-c changes without checking that it does what it's supposed to be doing.
The other call site does get exercised so we know that charset detection works:
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/compose/src/nsMsgCompUtils.cpp#75
There's also a test: test_detectAttachmentCharset.js.
MIME_detect_charset and its callers are only called in mimetext.cpp, so only for plaintext mail:
https://searchfox.org/comm-central/search?q=MIME_detect_charset&redirect=false
No idea what that code does.
I tested with flowed and non-flowed plaintext mail and the code wasn't triggered, hence a missing charset wasn't replaced by a detected one, everything was displayed as if it were UTF-8.
A bit off-topic: Note that flowed plaintext mail is processed in mimetpfl.cpp (tpfl: t-ext p-lain fl-owed). FYI, MIME has that class model
https://searchfox.org/comm-central/rev/e5e75651c5fb70526ae298312d99bc37ffd1ad32/mailnews/mime/src/mimei.h#24
implemented in C, so every class implements its own methods or calls into the parent. All that's done by passing around function pointers. There was an attempt to straighten that out, see bug 1463289.
Comment 27•3 years ago
|
||
(In reply to José M. Muñoz from comment #22)
How did you do that since automatic detection is broken and has always been broken since it was never correctly implemented?
that's why I suggested a hacky solution to get the body and detect its charset.
I was afraid that charset detector behavior may be completely unreliable. So I created a small extension that is based on your "hacky" patch (by the way, const
or let
is missed before mimeMsg
). I did not touch menu entries, etc. I tried several phrases and fragments of messages encoded as koi8-r and cp1251. I do not have a convincing example that charset detector does not work (however I would still prefer to have charset menu as a fallback). Since I use Thunderbird from Ubuntu packages, I do not bother too much concerning version 91 (update from 68 was not so much time ago). I hope, it would not be required to exercise with iconv
to read a message.
var tst = class extends ExtensionCommon.ExtensionAPI {
getAPI(context) {
return { tst: {
async charset(raw) {
const mimeMsg = MimeParser.extractMimeMsg(raw, {
includeAttachments: false,
strformat: "binarystring",
});
let body = mimeMsg.parts[0].body;
let compUtils = Cc[
"@mozilla.org/messengercompose/computils;1"
].createInstance(Ci.nsIMsgCompUtils);
let charset = compUtils.detectCharset(body);
return { body, charset };
},
+
patched mimeParser.jsm
Comment 28•3 years ago
|
||
Thank you all for the discussion here. We've completed José's idea and included it in our product. John, we've also followed your advice and added another emitter:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1713786-fix-repair-charset.patch
Comment 29•3 years ago
|
||
Hmm, this comes as a surprise, I didn't know there's another product now. Anyway, you're welcome and it's all FOSS anyway. We have a saying that goes (translated): "Competition stimulates business".
Comment 30•3 years ago
|
||
(In reply to José M. Muñoz from comment #12)
Is charset menu completely removed from thunderbird?
Yes, in bug 1704749, currently only visible in TB Daily. Did you see comment #5 mentioning https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/?
IMHO the removal of features without hard facts to justify such removal is questionable (see discussion at the end of bug 1704749). Replacing the charset menu (mostly working, see bug 1677104) with a 100%-currently-not-working automatic feature is rather unfortunate. Let's hope it gets fixed before TB 91 goes to beta and later becomes the new TB 91 ESR. Currently fixing a wrong charset is completely broken in TB Daily for both message display (this bug here) and message source display (bug 1716059).
Looks like it didn't.
(In reply to José M. Muñoz from comment #24)
Sorry, slip of the keyboard, s/any/much/. Hard to understand how TB does its scheduling, it's been broken for two weeks now and nobody is assigned to the bug. No priority/severity set although this is a user-facing regression and loss of functionality.
So what is the solution? Back out bug 1704749?
Assignee | ||
Comment 31•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Comment 32•3 years ago
|
||
Pushed by alessandro@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/302c32860a1f
Do mime detection instead of charset override. r=mkmelin
Updated•3 years ago
|
Comment 33•3 years ago
|
||
nsIMessenger.setDocumentCharset() also corrected the subject display in the header pane (not the thread pane), your solution doesn't. Here's a test message from a Russian news group.
Comment 34•3 years ago
|
||
I don't understand why the charset override has been removed, instead of detection being just added. Detection doesn't always work. If the user can't force the charset, that means that some messages will now be forever unreadable.
Comment 35•3 years ago
|
||
(This should eventually go uplift to 91, but best to let it bake on beta for the full cycle.)
Comment 36•3 years ago
|
||
The linked bug 1738000 is, IMHO, an indication of why it is important to allow overriding the character set, and not rely on absolutely always getting the right charset.
Comment 37•3 years ago
|
||
Assignee | ||
Comment 38•3 years ago
|
||
[Approval Request Comment]
Regression caused by (bug #): bug 1704749
User impact if declined: "Repair Text Encoding" menu item does nothing
Testing completed (on c-c, etc.): tested on beta, c-c, has a browser test
Risk to taking this patch (and alternatives if risky): Worst thing I've seen (in previous versions of the patch) is that some actions could fail once the encoding has been repaired. Given this has been on beta for an entire cycle this seems fairly safe.
Comment 39•3 years ago
|
||
"Repair Text Encoding" don`t recognise Windows-1251
Comment 40•3 years ago
|
||
This is detected as windows-1251, isn't it?
Updated•3 years ago
|
Comment 41•3 years ago
|
||
Please don't change mail attachments to message/rfc822. BMO can't view those, see bug 1154521 comment #7.
Updated•3 years ago
|
Comment 43•3 years ago
|
||
Example win-1251 that don`t recognised correctly at 91.x
Comment 44•3 years ago
|
||
Show test file not correct
Comment 45•3 years ago
|
||
TB 91.3 is still broken. It works in the latest beta, right?
Comment 46•3 years ago
|
||
(In reply to Rachel Martin from comment #40)
Created attachment 9249603 [details]
windows-1251.emlThis is detected as windows-1251, isn't it?
Yes, and don`t show correct(In reply to Rachel Martin from comment #45)
TB 91.3 is still broken. It works in the latest beta, right?
Yes, at 95 it correct. Wait for release update.
Comment 47•3 years ago
|
||
(In reply to Artem from comment #43)
Created attachment 9249640 [details]
Inventory.emlExample win-1251 that don`t recognised correctly at 91.x
The detector detects this as windows-1251
; the problem is 91 not running the detector.
Comment 48•3 years ago
|
||
Comment on attachment 9249413 [details] [diff] [review]
bug1713786-esr91.diff
[Triage Comment]
Approved for esr91
And thanks for the test
Comment 49•3 years ago
|
||
Comment on attachment 9249413 [details] [diff] [review]
bug1713786-esr91.diff
[Triage Comment]
(really) Approved for esr91
Comment 50•3 years ago
|
||
bugherder uplift |
Thunderbird 91.3.1:
https://hg.mozilla.org/releases/comm-esr91/rev/87443dabfdc4
Comment 51•3 years ago
|
||
Backout Thunderbird 91.3.1:
https://hg.mozilla.org/releases/comm-esr91/rev/de65d8f20698
There is C++ code in comm-esr91 that has been removed from comm-central that fails to build with these changes applied. There's probably more.
ERROR - /builds/worker/checkouts/gecko/comm/mailnews/compose/src/nsMsgAttachmentHandler.cpp:579:52: error: cannot initialize a parameter of type 'bool' with an rvalue of type 'nullptr_t'
INFO - nullptr, nullptr,
INFO - ^~~~~~~
INFO - /builds/worker/workspace/obj-build/dist/include/nsIMsgMessageService.h:71:169: note: passing argument to parameter 'aCharsetOverride' here
INFO - JS_HAZ_CAN_RUN_SCRIPT NS_IMETHOD DisplayMessage(const char * aMessageURI, nsISupports *aDisplayConsumer, nsIMsgWindow *aMsgWindow, nsIUrlListener *aUrlListener, bool aCharsetOverride, nsIURI **aURL) = 0;
INFO - ^
INFO - 1 error generated.
ERROR - make[4]: *** [/builds/worker/checkouts/gecko/config/rules.mk:676: nsMsgAttachmentHandler.o] Error 1
INFO - make[4]: Leaving directory '/builds/worker/workspace/obj-build/comm/mailnews/compose/src'
ERROR - make[3]: *** [/builds/worker/checkouts/gecko/config/recurse.mk:72: comm/mailnews/compose/src/target-objects] Error 2
INFO - make[3]: *** Waiting for unfinished jobs....
Updated•3 years ago
|
Comment hidden (off-topic) |
Comment 53•3 years ago
|
||
Comment posted to the wrong bug? Timers? Rooms?
Comment 54•3 years ago
|
||
[Triage Comment]
Patch was previously approved for 91.3.1. Updated version fixes msMsgAttachmentHandler.cpp.
Comment 55•3 years ago
|
||
bugherder uplift |
Thunderbird 91.3.2:
https://hg.mozilla.org/releases/comm-esr91/rev/c45170901b9f
Comment 56•3 years ago
|
||
This message has two parts: Plain Text and Rich Text (Html).
Thunderbird can display correctly in Plain Text, but not in Rich Text.
"Repair Text Encoding" is not help.
Rich Text (html) part has "charset=big5" in <meta> tag but seems like Thunderbird ignored it.
Comment 57•3 years ago
|
||
Henri, could you please give us your opinion on this. The text/plain part is base64 encode and has these headers:
Content-Type: text/plain;
charset="big5"
Content-Transfer-Encoding: base64
The text/html part doesn't specify a charset. Opening the eml file in Notepad++, it detects Big5.
@Ryan: I don't think TB looks at the HTML content for charset detection. It would also be best to report this as a new bug.
Comment 58•3 years ago
|
||
Actually, the message displays fine in TB 95 beta where this code is present. So detection of the charset must have worked:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l3.18
This will eventually be in TB 91.x.
Comment 59•3 years ago
|
||
Sorry, I got confused. The original message doesn't display correctly, what does display correctly is the file I imported after writing it back with Notepad++. When comparing the original to that file, the compare utility complains about encoding errors.
So back to Henri. Is there anything strange in 'Example message in big5 encoding that "Repair Text Encoding" is not working in 91.3.2'?
Comment 60•3 years ago
|
||
(In reply to Rachel Martin from comment #59)
So back to Henri. Is there anything strange in 'Example message in big5 encoding that "Repair Text Encoding" is not working in 91.3.2'?
As far as I can tell, this is not a detector bug but an enveloping bug in whatever software generated the email. It appears that the HTML part has first been encoded as Big5 and then CRLF line breaks have been inserted every 998 bytes (so that every line with CRLF included is 1000 bytes).
These line breaks not only occur in places that are inappropriate for the HTML markup but also between the bytes of two-byte Big5 characters. As a result, the stream of bytes ends up not looking like Big5, since in has a Big5 lead byte followed by a carriage return even though to there should be a proper Big5 trail byte.
Since I'm not aware of this failure mode occurring on the Web and this seems specific to bad email enveloping CRLF injection, I don't intend to change the detector due to this, but I leave it up to you if you wish to explore some hacks to remove CRLFs occurring at regular byte intervals in Content-Transfer-Encoding: 8bit
before further processing.
Comment 61•3 years ago
|
||
Thanks for the insight, Henri, much appreciated. So the charset detection is thrown off the rails by CRLF being injected into two-byte Big5 characters. Notepad++ detects Big5 and apparently removes the broken characters, so once the file is written back, it's all valid Big5 with some missing characters, in fact, Notepad++ saves 10 bytes less.
The message was produced by "Microsoft CDO for Windows 2000", well, the year 2000 was 21 years ago, perhaps this sort of message won't happen all that often any more. Pity the facility to explicitly set a charset was removed, otherwise the user could have set the correct charset and read most of the message, apart from the five broken characters.
Reporter, if you want to take the issue further, please file a new bug with the summary:
Charset detection doesn't work on Big5-encoded message were some two-byte Big5 characters were injected with CRLF.
Needless to say that the client which produced the message is in error, much like TB before version 45 and before bug 1225904 was fixed. As some may recall, TB regularly corrupted CJK messages. These days TB mostly uses base64 encoding to work around long lines.
Comment 62•3 years ago
|
||
Reporter, if you want to take the issue further, please file a new bug etc.
Note that this situation is one of the reasons why a manual override of the encoding remains necessary for the foreseeable future. There are, and will remain, situations in which the auto-detection fails, or cannot succeed, for some reason.
Comment 63•3 years ago
|
||
Bug 1743059 was actually filed for the Big5 issue.
Comment hidden (off-topic) |
Description
•