Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276)
Categories
(MailNews Core :: Internationalization, task)
Tracking
(Not tracked)
People
(Reporter: hsivonen, Assigned: mkmelin)
References
Details
Attachments
(3 files, 3 obsolete files)
11.54 KB,
patch
|
mkmelin
:
review+
|
Details | Diff | Splinter Review |
1.16 KB,
patch
|
mkmelin
:
review+
|
Details | Diff | Splinter Review |
3.61 KB,
patch
|
jorgk-bmo
:
review+
|
Details | Diff | Splinter Review |
Bug 1551276 changes m-c encoding detection again. Sorry. It takes away the Cyrillic detector and adds mozilla::EncodingDetector
as a general-purpose detector. Thunderbird probably wants to run mozilla::EncodingDetector
in place of the current three-step detection using the Japanese, Cyrillic, and UTF-8 detection steps.
Assignee | ||
Comment 2•4 years ago
|
||
https://searchfox.org/comm-central/search?q=nsICharsetDetector&path=
https://searchfox.org/comm-central/search?q=nsICharsetDetectionObserver&path=
As well as https://hg.mozilla.org/mozilla-central/rev/96b8dc8fd886#l2.27 + below
(at least)
Assignee | ||
Comment 4•4 years ago
|
||
Assignee | ||
Comment 5•4 years ago
|
||
This compiles. Not sure about working.
Don't know why the if (encoding) gives a "error: attempt to use a deleted function"
Reporter | ||
Comment 6•4 years ago
|
||
Comment on attachment 9115746 [details] [diff] [review] bug1602816.patch Review of attachment 9115746 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/base/util/nsMsgUtils.cpp @@ +1822,5 @@ > + bool forceEncodingDetectorToCyrillicOnly = > + detectorName.EqualsLiteral("ruprob") || > + detectorName.EqualsLiteral("ukprob"); > + nsCString tld; > + if (forceEncodingDetectorToCyrillicOnly) { FWIW, in Firefox, I put in the pref to get this kind of behavior mainly as extra prudence to have a way to soft-unship in beta if something really problematic turns out. Since you are not checking `intl.charset.detector.ng.enabled`, this (and the bit about Japanese below) will end up constraining the detection for the Russian, Ukrainian, and Japanese localizations compared to everything else. If you don't want to have pref to maybe soft-unship in an unlikely scenario, I suggest just using `mozilla::EncodingDetector` without trying `mozilla::JapaneseDetector` and without examining the old `intl.charset.detector` pref. @@ +1853,4 @@ > } > + } else { > + Unused << detector->Feed(src, true); > + auto encoding = detector->Guess(tld, true); The call to `Guess` should be after the loop when the whole stream has been fed to the detector. @@ +1866,5 @@ > // Rewind file again. > seekStream->Seek(nsISeekableStream::NS_SEEK_SET, 0); > > // Check UTF-8. > if (IsStreamUTF8(inputStream)) { This check is obsoleted by passing `true` as the second argument of `mozilla::EncodingDetector::Guess()`. ::: mailnews/mime/src/comi18n.cpp @@ +43,5 @@ > + bool forceEncodingDetectorToCyrillicOnly = > + detectorName.EqualsLiteral("ruprob") || > + detectorName.EqualsLiteral("ukprob"); > + nsCString tld; > + if (forceEncodingDetectorToCyrillicOnly) { The same comment applies here. @@ +82,1 @@ > if (mozilla::IsUtf8(mozilla::MakeSpan(aBuf, aLength))) { This check is obsoleted, too.
Reporter | ||
Comment 7•4 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #5)
Created attachment 9115746 [details] [diff] [review]
bug1602816.patchThis compiles. Not sure about working.
Don't know why the if (encoding) gives a "error: attempt to use a deleted function"
You are null-checking a mozilla::NotNull
.
Assignee | ||
Comment 8•4 years ago
|
||
Thanks Henri, that certainly simplifies the code.
Assignee | ||
Updated•4 years ago
|
Reporter | ||
Comment 9•4 years ago
|
||
Comment on attachment 9115761 [details] [diff] [review] bug1602816.patch Review of attachment 9115761 [details] [diff] [review]: ----------------------------------------------------------------- r+ with the comments addressed. ::: mailnews/base/util/nsMsgUtils.cpp @@ +1820,5 @@ > + numRead = 0; > + while (NS_SUCCEEDED(inputStream->Read(buffer, sizeof(buffer), &numRead))) { > + mozilla::Span<const uint8_t> src = > + mozilla::AsBytes(mozilla::MakeSpan(buffer, sizeof(buffer))); > + Unused << detector->Feed(src, true); The second argument needs to be `false` in the loop. Then after the loop, there should be a call with an empty span and `true`. @@ +1827,3 @@ > } > } > + auto encoding = detector->Guess(nullptr, true); I'm a bit surprised that the compiler accepts `nullptr` here, but if it does, cool. @@ +1831,2 @@ > > + return !aCharset.IsEmpty() ? NS_OK : NS_ERROR_FAILURE; This will never be empty, so this line could be `return NS_OK;`. ::: mailnews/mime/src/comi18n.cpp @@ +46,2 @@ > > + return !aCharset.IsEmpty() ? NS_OK : NS_ERROR_FAILURE; Also just `return NS_OK;` here.
Assignee | ||
Comment 10•4 years ago
|
||
Thx!
Assignee | ||
Comment 11•4 years ago
|
||
Thx!
Comment 12•4 years ago
|
||
Have you checked the tests? Especially https://searchfox.org/comm-central/source/mailnews/compose/test/unit/test_detectAttachmentCharset.js.
Assignee | ||
Comment 13•4 years ago
|
||
No try run. I tried that locally now and it's not happy :/
Comment 14•4 years ago
|
||
No surprise. That test sets a localised pref to trigger some localised detection. No idea how this is going to work now. Ask Henri.
Comment 15•4 years ago
|
||
Would it make sense to just adapt the Firefox tests for this and eliminate the now beoken test?
Comment 16•4 years ago
|
||
These tests fail now:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
mailnews/compose/test/unit/test_messageHeaders.js
mailnews/import/test/unit/test_shiftjis_csv.js
Comment 18•4 years ago
|
||
Thanks. So bug 1603982 is a dupe of this bug.
I took the patch here and applied it to my local tree and could "configure" the tree for build.
Unfortunately, then, I got another error which looks completely unrelated to the issue here...
Bug 1604016
build failure: dist/include/mozilla/TypeTraits.h:392:8: error: partial specialization ‘struct mozilla::IsConst<T>’ does not specialize any template arguments;
Comment 19•4 years ago
|
||
Trunk is double-busted, here and in bug 1603649.
Assignee | ||
Comment 20•4 years ago
|
||
(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #16)
These tests fail now:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
Only testUTF16BE, testUTF16LE work. Not even testUTF8. So maybe there's something wrong with my patch.
Comment 21•4 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #20)
Only testUTF16BE, testUTF16LE work. Not even testUTF8. So maybe there's something wrong with my patch.
They seem to pass OK for me locally when the patch from bug 1603649 is in place...
hmm.. but the try build complains about unused static function IsStreamUTF8() from nsMsgUtils.cpp? Why is my local build letting it through?
in any case, let's try again, after removing that function:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=658df7fa60e715b9b7db799bf7d5325001ce98bc
(have to head out soon, will try and check back in later this evening)
Comment 22•4 years ago
|
||
Gah. test_detectAttachmentCharset.js
still failing on the try build (at least on OSX which is the only one finished as I head out the door!)
Comment 23•4 years ago
|
||
Ben, see comment #16.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 25•4 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=0db2d160793fe733dd7c60fdd2a12d5f3cd7182e&selectedJob=281328743 looks pretty decent so I'll land this in a bit to get builds going.
Let's keep it open to figure out what more needs doing here.
Comment 26•4 years ago
|
||
Pushed by mkmelin@iki.fi: https://hg.mozilla.org/comm-central/rev/028ba2e01818 Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276). r=hsivonen
Assignee | ||
Updated•4 years ago
|
Comment 27•4 years ago
|
||
Pushed by mkmelin@iki.fi: https://hg.mozilla.org/comm-central/rev/449906178fe3 temporarily disable charset detection related test: test_detectAttachmentCharset.js, test_messageHeaders.js, test_shiftjis_csv.js. rs=bustage-fix
Assignee | ||
Updated•4 years ago
|
Comment 28•4 years ago
|
||
Nice simple little fix - extra cruft was being passing into the charset detection, which was swamping the algorithm.
I think this fixes up:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
mailnews/import/test/unit/test_shiftjis_csv.js
..and the utf-8 test in:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
But:
testWindows1252()
fails - the old encoder didn't detect windows-1252
but the new one does. It seems reasonable to change the expected outcome on this, right?
testKOI8R()
fails: I think the test is relying on a preferences hint which is no longer respected. Not quite sure about this one.
The test is:
async function testKOI8R() {
Services.prefs.setStringPref("intl.charset.detector", "ruprob");
await createMessage(do_get_file("data/test-KOI8-R.txt"));
checkAttachmentCharset("KOI8-R");
}
I don't know how significant the intl.charset.detector
hint is... (the new encoder returns KOI8-U
rather than KOI8-R
). I need to do a little more digging.
Reporter | ||
Comment 29•4 years ago
|
||
(In reply to Ben Campbell from comment #28)
Nice simple little fix - extra cruft was being passing into the charset detection, which was swamping the algorithm.
Sorry about not catching that earlier.
testWindows1252()
fails - the old encoder didn't detectwindows-1252
but the new one does. It seems reasonable to change the expected outcome on this, right?
Yes.
testKOI8R()
fails: I think the test is relying on a preferences hint which is no longer respected. Not quite sure about this one.
...
I don't know how significant theintl.charset.detector
hint is... (the new encoder returnsKOI8-U
rather thanKOI8-R
). I need to do a little more digging.
As documented at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/README.md#82 , the new detector always detects both KOI8-R and KOI8-U as KOI8-U. The difference between the two encodings is that the latter replaces some box drawing characters with letters. Since box drawing is generally not a thing on the Web, not distinguishing the two avoids some reloads in the Web case, and having a couple of box drawing characters replaced by letters is not a disaster.
Comment 30•4 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #29)
As documented at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/README.md#82 , the new detector always detects both KOI8-R and KOI8-U as KOI8-U. The difference between the two encodings is that the latter replaces some box drawing characters with letters. Since box drawing is generally not a thing on the Web, not distinguishing the two avoids some reloads in the Web case, and having a couple of box drawing characters replaced by letters is not a disaster.
Yep, that's more or less what I figured - thanks Henri!
(hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)
I'll update the tests and re-enable them.
Comment 31•4 years ago
|
||
Oh, that's infuriating. The windows-1252
test comes back as windows-1254
.
The test data is: "ä - This is text in windows-1252.", and it's perfectly valid windows-1254
, but it doesn't seem the obvious choice :-(
I suspect there's something more going on there, but we're kind of at the point of diminishing returns now, so I just changed the test data to "FensterStraße" which does come back as windows-1252. Cough.
Comment 32•4 years ago
|
||
Henri, how is it possible to reliably detect any ANSI encoding like windows-1252. How do you know it's not Greek or Polish for example? Or Turkish (windows-1254). Anecdotal evidence shows that Turkish uses more öü than German, they put two into every word, like a local Turkish bike shop is called "Ürün Bikes" ;-)
Comment 33•4 years ago
|
||
Comment on attachment 9116334 [details] [diff] [review] 1602816-reinstate-chardet-tests-1.patch Review of attachment 9116334 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/compose/test/unit/data/test-windows-1252.txt @@ +1,1 @@ > +FensterStraße Well, maybe use a real world example here, like: Buenos días - François a été à Paris - Budapester Straße, Berlin. Maybe only the ß does the trick and the accents won't.
Reporter | ||
Comment 34•4 years ago
•
|
||
(In reply to Ben Campbell from comment #30)
(hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)
The encoding that the Web calls "KOI8-U" is actually KOI8-RU. Somewhat like ISO-8859-1 is an alias of windows-1252, and EUC-KR really means with windows-949 extensions included.
(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #32)
Henri, how is it possible to reliably detect any ANSI encoding like windows-1252.
The reliability increases as the length of input increases. Indeed, a single letter ä without any word context for it is not a great test case.
How do you know it's not Greek
Greek words are completely non-ASCII. windows-1252 words tend to mix ASCII and non-ASCII, and four or more consecutive non-ASCII is rare.
or Polish for example?
Less well, but it's pretty good after a couple of sentences.
Or Turkish (windows-1254)
Even less well, but it's pretty OK after a couple of sentences. windows-1252 has two language models: One for Icelandic and Faroese and another that merges everything else. There are Turkish letters that overlap some Icelandic/Faroese-only letters which are penalized in the "everything else" windows-1252 model.
Reporter | ||
Comment 35•4 years ago
|
||
"Model" here means relative frequencies of character pairs where at least one participant in the pair is non-ASCII.
Comment 36•4 years ago
|
||
Henri, thanks for the clarification. Good to know that there's a huge amount of expertise behind all this :-) - Sorry about the bad example of Greek. It was motivated by the fact that Windows-1253 still includes ASCII, and that the highbit characters could also come from any other ANSI table, but of course you end up with nonsensical words only consisting of accents. Just out of interest, where can I see the code for this detection? I'd like to see how Greek is detected.
Reporter | ||
Comment 37•4 years ago
|
||
The code for Greek is at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/src/lib.rs#168 and the data is at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/src/data.rs#655 . (255 is a penalty marker; the highest score is 254. You can see that the final sigma followed by any letter yields the penalty marker. The comments that labels the rows and columns give the first character of a character class. You can find the classes at https://github.com/hsivonen/detector_char_classes/blob/master/src/lib.rs .)
Assignee | ||
Comment 38•4 years ago
|
||
Comment on attachment 9116307 [details] [diff] [review] 1602816-fix-sniff-buffersize.patch Review of attachment 9116307 [details] [diff] [review]: ----------------------------------------------------------------- Doh, thanks for finding that! r=mkmelin
Comment 39•4 years ago
|
||
Pushed by mkmelin@iki.fi: https://hg.mozilla.org/comm-central/rev/fb67882ad9b9 followup - Fix wrong buffersize passed into EncodingDetector, when guessing attachment charsets. r=mkmelin https://hg.mozilla.org/comm-central/rev/69b26858a259 reinstate charset detection related tests: test_detectAttachmentCharset.js, test_messageHeaders.js, test_shiftjis_csv.js.
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Description
•