Closed Bug 1602816 Opened 4 months ago Closed 4 months ago

Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276)

Categories

(MailNews Core :: Internationalization, task, blocker)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 73.0

People

(Reporter: hsivonen, Assigned: mkmelin)

References

Details

(Whiteboard: [thunderbird-disabled-test] )

Attachments

(3 files, 3 obsolete files)

Bug 1551276 changes m-c encoding detection again. Sorry. It takes away the Cyrillic detector and adds mozilla::EncodingDetector as a general-purpose detector. Thunderbird probably wants to run mozilla::EncodingDetector in place of the current three-step detection using the Japanese, Cyrillic, and UTF-8 detection steps.

Duplicate of this bug: 1603687
Severity: normal → blocker
Summary: Adapt to mozilla::EncodingDetector → Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276)

Going to look at this.

Assignee: nobody → mkmelin+mozilla
Attached patch bug1602816.patch (obsolete) — Splinter Review

This compiles. Not sure about working.

Don't know why the if (encoding) gives a "error: attempt to use a deleted function"

Comment on attachment 9115746 [details] [diff] [review]
bug1602816.patch

Review of attachment 9115746 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/nsMsgUtils.cpp
@@ +1822,5 @@
> +   bool forceEncodingDetectorToCyrillicOnly =
> +      detectorName.EqualsLiteral("ruprob") ||
> +      detectorName.EqualsLiteral("ukprob");
> +  nsCString tld;
> +  if (forceEncodingDetectorToCyrillicOnly) {

FWIW, in Firefox, I put in the pref to get this kind of behavior mainly as extra prudence to have a way to soft-unship in beta if something really problematic turns out. Since you are not checking `intl.charset.detector.ng.enabled`, this (and the bit about Japanese below) will end up constraining the detection for the Russian, Ukrainian, and Japanese localizations compared to everything else.

If you don't want to have pref to maybe soft-unship in an unlikely scenario, I suggest just using `mozilla::EncodingDetector` without trying `mozilla::JapaneseDetector` and without examining the old `intl.charset.detector` pref.

@@ +1853,4 @@
>        }
> +    } else {
> +      Unused << detector->Feed(src, true);
> +      auto encoding = detector->Guess(tld, true);

The call to `Guess` should be after the loop when the whole stream has been fed to the detector.

@@ +1866,5 @@
>    // Rewind file again.
>    seekStream->Seek(nsISeekableStream::NS_SEEK_SET, 0);
>  
>    // Check UTF-8.
>    if (IsStreamUTF8(inputStream)) {

This check is obsoleted by passing `true` as the second argument of `mozilla::EncodingDetector::Guess()`.

::: mailnews/mime/src/comi18n.cpp
@@ +43,5 @@
> +   bool forceEncodingDetectorToCyrillicOnly =
> +      detectorName.EqualsLiteral("ruprob") ||
> +      detectorName.EqualsLiteral("ukprob");
> +  nsCString tld;
> +  if (forceEncodingDetectorToCyrillicOnly) {

The same comment applies here.

@@ +82,1 @@
>    if (mozilla::IsUtf8(mozilla::MakeSpan(aBuf, aLength))) {

This check is obsoleted, too.

(In reply to Magnus Melin [:mkmelin] from comment #5)

Created attachment 9115746 [details] [diff] [review]
bug1602816.patch

This compiles. Not sure about working.

Don't know why the if (encoding) gives a "error: attempt to use a deleted function"

You are null-checking a mozilla::NotNull.

Attached patch bug1602816.patch (obsolete) — Splinter Review

Thanks Henri, that certainly simplifies the code.

Attachment #9115746 - Attachment is obsolete: true
Attachment #9115761 - Flags: review?(hsivonen)
Status: NEW → ASSIGNED
Comment on attachment 9115761 [details] [diff] [review]
bug1602816.patch

Review of attachment 9115761 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the comments addressed.

::: mailnews/base/util/nsMsgUtils.cpp
@@ +1820,5 @@
> +  numRead = 0;
> +  while (NS_SUCCEEDED(inputStream->Read(buffer, sizeof(buffer), &numRead))) {
> +    mozilla::Span<const uint8_t> src =
> +        mozilla::AsBytes(mozilla::MakeSpan(buffer, sizeof(buffer)));
> +    Unused << detector->Feed(src, true);

The second argument needs to be `false` in the loop. Then after the loop, there should be a call with an empty span and `true`.

@@ +1827,3 @@
>      }
>    }
> +  auto encoding = detector->Guess(nullptr, true);

I'm a bit surprised that the compiler accepts `nullptr` here, but if it does, cool.

@@ +1831,2 @@
>  
> +  return !aCharset.IsEmpty() ? NS_OK : NS_ERROR_FAILURE;

This will never be empty, so this line could be `return NS_OK;`.

::: mailnews/mime/src/comi18n.cpp
@@ +46,2 @@
>  
> +  return !aCharset.IsEmpty() ? NS_OK : NS_ERROR_FAILURE;

Also just `return NS_OK;` here.
Attachment #9115761 - Flags: review?(hsivonen) → review+
Attached patch bug1602816.patch (obsolete) — Splinter Review

Thx!

Attachment #9115761 - Attachment is obsolete: true
Attachment #9115845 - Flags: review+
Attached patch bug1602816.patchSplinter Review

Thx!

Attachment #9115845 - Attachment is obsolete: true
Attachment #9115846 - Flags: review+

No try run. I tried that locally now and it's not happy :/

No surprise. That test sets a localised pref to trigger some localised detection. No idea how this is going to work now. Ask Henri.

Would it make sense to just adapt the Firefox tests for this and eliminate the now beoken test?

These tests fail now:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
mailnews/compose/test/unit/test_messageHeaders.js
mailnews/import/test/unit/test_shiftjis_csv.js

See Also: → 1603982
Duplicate of this bug: 1603982

Thanks. So bug 1603982 is a dupe of this bug.

I took the patch here and applied it to my local tree and could "configure" the tree for build.
Unfortunately, then, I got another error which looks completely unrelated to the issue here...
Bug 1604016
build failure: dist/include/mozilla/TypeTraits.h:392:8: error: partial specialization ‘struct mozilla::IsConst<T>’ does not specialize any template arguments;

Trunk is double-busted, here and in bug 1603649.

(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #16)

These tests fail now:
mailnews/compose/test/unit/test_detectAttachmentCharset.js

Only testUTF16BE, testUTF16LE work. Not even testUTF8. So maybe there's something wrong with my patch.

(In reply to Magnus Melin [:mkmelin] from comment #20)

Only testUTF16BE, testUTF16LE work. Not even testUTF8. So maybe there's something wrong with my patch.

They seem to pass OK for me locally when the patch from bug 1603649 is in place...

hmm.. but the try build complains about unused static function IsStreamUTF8() from nsMsgUtils.cpp? Why is my local build letting it through?
in any case, let's try again, after removing that function:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=658df7fa60e715b9b7db799bf7d5325001ce98bc

(have to head out soon, will try and check back in later this evening)

Gah. test_detectAttachmentCharset.js still failing on the try build (at least on OSX which is the only one finished as I head out the door!)

Ben, see comment #16.

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=0db2d160793fe733dd7c60fdd2a12d5f3cd7182e&selectedJob=281328743 looks pretty decent so I'll land this in a bit to get builds going.

Let's keep it open to figure out what more needs doing here.

Keywords: leave-open
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/028ba2e01818
Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276). r=hsivonen
Target Milestone: --- → Thunderbird 73.0
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/449906178fe3
temporarily disable charset detection related test: test_detectAttachmentCharset.js, test_messageHeaders.js, test_shiftjis_csv.js. rs=bustage-fix
Whiteboard: [thunderbird-disabled-test]

Nice simple little fix - extra cruft was being passing into the charset detection, which was swamping the algorithm.
I think this fixes up:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
mailnews/import/test/unit/test_shiftjis_csv.js

..and the utf-8 test in:
mailnews/compose/test/unit/test_detectAttachmentCharset.js

But:

testWindows1252() fails - the old encoder didn't detect windows-1252 but the new one does. It seems reasonable to change the expected outcome on this, right?

testKOI8R() fails: I think the test is relying on a preferences hint which is no longer respected. Not quite sure about this one.
The test is:

async function testKOI8R() {
  Services.prefs.setStringPref("intl.charset.detector", "ruprob");
  await createMessage(do_get_file("data/test-KOI8-R.txt"));
  checkAttachmentCharset("KOI8-R");
}

I don't know how significant the intl.charset.detector hint is... (the new encoder returns KOI8-U rather than KOI8-R). I need to do a little more digging.

Attachment #9116307 - Flags: review?(mkmelin+mozilla)

(In reply to Ben Campbell from comment #28)

Nice simple little fix - extra cruft was being passing into the charset detection, which was swamping the algorithm.

Sorry about not catching that earlier.

testWindows1252() fails - the old encoder didn't detect windows-1252 but the new one does. It seems reasonable to change the expected outcome on this, right?

Yes.

testKOI8R() fails: I think the test is relying on a preferences hint which is no longer respected. Not quite sure about this one.
...
I don't know how significant the intl.charset.detector hint is... (the new encoder returns KOI8-U rather than KOI8-R). I need to do a little more digging.

As documented at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/README.md#82 , the new detector always detects both KOI8-R and KOI8-U as KOI8-U. The difference between the two encodings is that the latter replaces some box drawing characters with letters. Since box drawing is generally not a thing on the Web, not distinguishing the two avoids some reloads in the Web case, and having a couple of box drawing characters replaced by letters is not a disaster.

(In reply to Henri Sivonen (:hsivonen) from comment #29)

As documented at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/README.md#82 , the new detector always detects both KOI8-R and KOI8-U as KOI8-U. The difference between the two encodings is that the latter replaces some box drawing characters with letters. Since box drawing is generally not a thing on the Web, not distinguishing the two avoids some reloads in the Web case, and having a couple of box drawing characters replaced by letters is not a disaster.

Yep, that's more or less what I figured - thanks Henri!
(hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)

I'll update the tests and re-enable them.

Oh, that's infuriating. The windows-1252 test comes back as windows-1254.
The test data is: "ä - This is text in windows-1252.", and it's perfectly valid windows-1254, but it doesn't seem the obvious choice :-(
I suspect there's something more going on there, but we're kind of at the point of diminishing returns now, so I just changed the test data to "FensterStraße" which does come back as windows-1252. Cough.

Attachment #9116334 - Flags: review?(mkmelin+mozilla)

Henri, how is it possible to reliably detect any ANSI encoding like windows-1252. How do you know it's not Greek or Polish for example? Or Turkish (windows-1254). Anecdotal evidence shows that Turkish uses more öü than German, they put two into every word, like a local Turkish bike shop is called "Ürün Bikes" ;-)

Comment on attachment 9116334 [details] [diff] [review]
1602816-reinstate-chardet-tests-1.patch

Review of attachment 9116334 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/compose/test/unit/data/test-windows-1252.txt
@@ +1,1 @@
> +FensterStraße

Well, maybe use a real world example here, like:
Buenos días - François a été à Paris - Budapester Straße, Berlin.
Maybe only the ß does the trick and the accents won't.
Attachment #9116334 - Flags: review?(mkmelin+mozilla) → review+

(In reply to Ben Campbell from comment #30)

(hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)

The encoding that the Web calls "KOI8-U" is actually KOI8-RU. Somewhat like ISO-8859-1 is an alias of windows-1252, and EUC-KR really means with windows-949 extensions included.

(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #32)

Henri, how is it possible to reliably detect any ANSI encoding like windows-1252.

The reliability increases as the length of input increases. Indeed, a single letter ä without any word context for it is not a great test case.

How do you know it's not Greek

Greek words are completely non-ASCII. windows-1252 words tend to mix ASCII and non-ASCII, and four or more consecutive non-ASCII is rare.

or Polish for example?

Less well, but it's pretty good after a couple of sentences.

Or Turkish (windows-1254)

Even less well, but it's pretty OK after a couple of sentences. windows-1252 has two language models: One for Icelandic and Faroese and another that merges everything else. There are Turkish letters that overlap some Icelandic/Faroese-only letters which are penalized in the "everything else" windows-1252 model.

"Model" here means relative frequencies of character pairs where at least one participant in the pair is non-ASCII.

Henri, thanks for the clarification. Good to know that there's a huge amount of expertise behind all this :-) - Sorry about the bad example of Greek. It was motivated by the fact that Windows-1253 still includes ASCII, and that the highbit characters could also come from any other ANSI table, but of course you end up with nonsensical words only consisting of accents. Just out of interest, where can I see the code for this detection? I'd like to see how Greek is detected.

The code for Greek is at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/src/lib.rs#168 and the data is at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/src/data.rs#655 . (255 is a penalty marker; the highest score is 254. You can see that the final sigma followed by any letter yields the penalty marker. The comments that labels the rows and columns give the first character of a character class. You can find the classes at https://github.com/hsivonen/detector_char_classes/blob/master/src/lib.rs .)

Comment on attachment 9116307 [details] [diff] [review]
1602816-fix-sniff-buffersize.patch

Review of attachment 9116307 [details] [diff] [review]:
-----------------------------------------------------------------

Doh, thanks for finding that! r=mkmelin
Attachment #9116307 - Flags: review?(mkmelin+mozilla) → review+
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/fb67882ad9b9
followup - Fix wrong buffersize passed into EncodingDetector, when guessing attachment charsets. r=mkmelin
https://hg.mozilla.org/comm-central/rev/69b26858a259
reinstate charset detection related tests: test_detectAttachmentCharset.js, test_messageHeaders.js, test_shiftjis_csv.js.
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.