Closed Bug 1602816 Opened 6 years ago Closed 6 years ago

Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Thunderbird 73.0

People

(Reporter: hsivonen, Assigned: mkmelin)

References

Details

Attachments

(3 files, 3 obsolete files)

bug1602816.patch 6 years ago Magnus Melin [:mkmelin] 12.42 KB, patch		Details \| Diff \| Splinter Review
bug1602816.patch 6 years ago Magnus Melin [:mkmelin] 11.56 KB, patch	hsivonen : review+	Details \| Diff \| Splinter Review
bug1602816.patch 6 years ago Magnus Melin [:mkmelin] 11.54 KB, patch	mkmelin : review+	Details \| Diff \| Splinter Review
bug1602816.patch 6 years ago Magnus Melin [:mkmelin] 11.54 KB, patch	mkmelin : review+	Details \| Diff \| Splinter Review
1602816-fix-sniff-buffersize.patch 6 years ago Ben Campbell 1.16 KB, patch	mkmelin : review+	Details \| Diff \| Splinter Review
1602816-reinstate-chardet-tests-1.patch 6 years ago Ben Campbell 3.61 KB, patch	jorgk-bmo : review+	Details \| Diff \| Splinter Review

Henri Sivonen (:hsivonen)

Reporter

Description

•

6 years ago

Bug 1551276 changes m-c encoding detection again. Sorry. It takes away the Cyrillic detector and adds mozilla::EncodingDetector as a general-purpose detector. Thunderbird probably wants to run mozilla::EncodingDetector in place of the current three-step detection using the Japanese, Cyrillic, and UTF-8 detection steps.

Magnus Melin [:mkmelin]

Assignee

Comment 2

•

6 years ago

https://searchfox.org/comm-central/search?q=nsICharsetDetector&path=
https://searchfox.org/comm-central/search?q=nsICharsetDetectionObserver&path=

As well as https://hg.mozilla.org/mozilla-central/rev/96b8dc8fd886#l2.27 + below

(at least)

Severity: normal → blocker

Summary: Adapt to mozilla::EncodingDetector → Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276)

Magnus Melin [:mkmelin]

Assignee

Comment 3

•

6 years ago

Going to look at this.

Assignee: nobody → mkmelin+mozilla

Magnus Melin [:mkmelin]

Assignee

Comment 4

•

6 years ago

Also https://searchfox.org/comm-central/search?q=nsCyrillicDetector&case=false&regexp=false&path=

Magnus Melin [:mkmelin]

Assignee

Comment 5

•

6 years ago

Attached patch bug1602816.patch (obsolete) — Details — Splinter Review

This compiles. Not sure about working.

Don't know why the if (encoding) gives a "error: attempt to use a deleted function"

Henri Sivonen (:hsivonen)

Reporter

Comment 6

•

6 years ago

Comment on attachment 9115746 [details] [diff] [review] bug1602816.patch Review of attachment 9115746 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/base/util/nsMsgUtils.cpp @@ +1822,5 @@ > + bool forceEncodingDetectorToCyrillicOnly = > + detectorName.EqualsLiteral("ruprob") || > + detectorName.EqualsLiteral("ukprob"); > + nsCString tld; > + if (forceEncodingDetectorToCyrillicOnly) { FWIW, in Firefox, I put in the pref to get this kind of behavior mainly as extra prudence to have a way to soft-unship in beta if something really problematic turns out. Since you are not checking `intl.charset.detector.ng.enabled`, this (and the bit about Japanese below) will end up constraining the detection for the Russian, Ukrainian, and Japanese localizations compared to everything else. If you don't want to have pref to maybe soft-unship in an unlikely scenario, I suggest just using `mozilla::EncodingDetector` without trying `mozilla::JapaneseDetector` and without examining the old `intl.charset.detector` pref. @@ +1853,4 @@ > } > + } else { > + Unused << detector->Feed(src, true); > + auto encoding = detector->Guess(tld, true); The call to `Guess` should be after the loop when the whole stream has been fed to the detector. @@ +1866,5 @@ > // Rewind file again. > seekStream->Seek(nsISeekableStream::NS_SEEK_SET, 0); > > // Check UTF-8. > if (IsStreamUTF8(inputStream)) { This check is obsoleted by passing `true` as the second argument of `mozilla::EncodingDetector::Guess()`. ::: mailnews/mime/src/comi18n.cpp @@ +43,5 @@ > + bool forceEncodingDetectorToCyrillicOnly = > + detectorName.EqualsLiteral("ruprob") || > + detectorName.EqualsLiteral("ukprob"); > + nsCString tld; > + if (forceEncodingDetectorToCyrillicOnly) { The same comment applies here. @@ +82,1 @@ > if (mozilla::IsUtf8(mozilla::MakeSpan(aBuf, aLength))) { This check is obsoleted, too.

Henri Sivonen (:hsivonen)

Reporter

Comment 7

•

6 years ago

(In reply to Magnus Melin [:mkmelin] from comment #5)

Created attachment 9115746 [details] [diff] [review]
bug1602816.patch

This compiles. Not sure about working.

Don't know why the if (encoding) gives a "error: attempt to use a deleted function"

You are null-checking a mozilla::NotNull.

Magnus Melin [:mkmelin]

Assignee

Comment 8

•

6 years ago

Attached patch bug1602816.patch (obsolete) — Details — Splinter Review

Thanks Henri, that certainly simplifies the code.

Attachment #9115746 - Attachment is obsolete: true

Attachment #9115761 - Flags: review?(hsivonen)

Magnus Melin [:mkmelin]

Assignee

Updated

•

6 years ago

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen)

Reporter

Comment 9

•

6 years ago

Comment on attachment 9115761 [details] [diff] [review] bug1602816.patch Review of attachment 9115761 [details] [diff] [review]: ----------------------------------------------------------------- r+ with the comments addressed. ::: mailnews/base/util/nsMsgUtils.cpp @@ +1820,5 @@ > + numRead = 0; > + while (NS_SUCCEEDED(inputStream->Read(buffer, sizeof(buffer), &numRead))) { > + mozilla::Span<const uint8_t> src = > + mozilla::AsBytes(mozilla::MakeSpan(buffer, sizeof(buffer))); > + Unused << detector->Feed(src, true); The second argument needs to be `false` in the loop. Then after the loop, there should be a call with an empty span and `true`. @@ +1827,3 @@ > } > } > + auto encoding = detector->Guess(nullptr, true); I'm a bit surprised that the compiler accepts `nullptr` here, but if it does, cool. @@ +1831,2 @@ > > + return !aCharset.IsEmpty() ? NS_OK : NS_ERROR_FAILURE; This will never be empty, so this line could be `return NS_OK;`. ::: mailnews/mime/src/comi18n.cpp @@ +46,2 @@ > > + return !aCharset.IsEmpty() ? NS_OK : NS_ERROR_FAILURE; Also just `return NS_OK;` here.

Attachment #9115761 - Flags: review?(hsivonen) → review+

Magnus Melin [:mkmelin]

Assignee

Comment 10

•

6 years ago

Attached patch bug1602816.patch (obsolete) — Details — Splinter Review

Thx!

Attachment #9115761 - Attachment is obsolete: true

Attachment #9115845 - Flags: review+

Magnus Melin [:mkmelin]

Assignee

Comment 11

•

6 years ago

Attached patch bug1602816.patch — Details — Splinter Review

Thx!

Attachment #9115845 - Attachment is obsolete: true

Attachment #9115846 - Flags: review+

Jorg K (CEST = GMT+2)

Comment 12

•

6 years ago

Have you checked the tests? Especially https://searchfox.org/comm-central/source/mailnews/compose/test/unit/test_detectAttachmentCharset.js.

Magnus Melin [:mkmelin]

Assignee

Comment 13

•

6 years ago

No try run. I tried that locally now and it's not happy :/

Jorg K (CEST = GMT+2)

Comment 14

•

6 years ago

No surprise. That test sets a localised pref to trigger some localised detection. No idea how this is going to work now. Ask Henri.

mac198442

Comment 15

•

6 years ago

Would it make sense to just adapt the Firefox tests for this and eliminate the now beoken test?

Jorg K (CEST = GMT+2)

Comment 16

•

6 years ago

These tests fail now:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
mailnews/compose/test/unit/test_messageHeaders.js
mailnews/import/test/unit/test_shiftjis_csv.js

ISHIKAWA, Chiaki

Updated

•

6 years ago

Comment 18

•

6 years ago

Thanks. So bug 1603982 is a dupe of this bug.

I took the patch here and applied it to my local tree and could "configure" the tree for build.
Unfortunately, then, I got another error which looks completely unrelated to the issue here...
Bug 1604016
build failure: dist/include/mozilla/TypeTraits.h:392:8: error: partial specialization ‘struct mozilla::IsConst<T>’ does not specialize any template arguments;

Jorg K (CEST = GMT+2)

Comment 19

•

6 years ago

Trunk is double-busted, here and in bug 1603649.

Magnus Melin [:mkmelin]

Assignee

Comment 20

•

6 years ago

(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #16)

These tests fail now:
mailnews/compose/test/unit/test_detectAttachmentCharset.js

Only testUTF16BE, testUTF16LE work. Not even testUTF8. So maybe there's something wrong with my patch.

Ben Campbell

Comment 21

•

6 years ago

(In reply to Magnus Melin [:mkmelin] from comment #20)

Only testUTF16BE, testUTF16LE work. Not even testUTF8. So maybe there's something wrong with my patch.

They seem to pass OK for me locally when the patch from bug 1603649 is in place...

hmm.. but the try build complains about unused static function IsStreamUTF8() from nsMsgUtils.cpp? Why is my local build letting it through?
in any case, let's try again, after removing that function:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=658df7fa60e715b9b7db799bf7d5325001ce98bc

(have to head out soon, will try and check back in later this evening)

Ben Campbell

Comment 22

•

6 years ago

Gah. test_detectAttachmentCharset.js still failing on the try build (at least on OSX which is the only one finished as I head out the door!)

Jorg K (CEST = GMT+2)

Comment 23

•

6 years ago

Ben, see comment #16.

Comment hidden (Intermittent Failures Robot)

Magnus Melin [:mkmelin]

Assignee

Comment 25

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=0db2d160793fe733dd7c60fdd2a12d5f3cd7182e&selectedJob=281328743 looks pretty decent so I'll land this in a bit to get builds going.

Let's keep it open to figure out what more needs doing here.

Keywords: leave-open

Pulsebot

Comment 26

•

6 years ago

Pushed by mkmelin@iki.fi: https://hg.mozilla.org/comm-central/rev/028ba2e01818 Adapt Thunderbird to mozilla::EncodingDetector (port bug 1551276). r=hsivonen

Magnus Melin [:mkmelin]

Assignee

Updated

•

6 years ago

Target Milestone: --- → Thunderbird 73.0

Pulsebot

Comment 27

•

6 years ago

Pushed by mkmelin@iki.fi: https://hg.mozilla.org/comm-central/rev/449906178fe3 temporarily disable charset detection related test: test_detectAttachmentCharset.js, test_messageHeaders.js, test_shiftjis_csv.js. rs=bustage-fix

Magnus Melin [:mkmelin]

Assignee

Updated

•

6 years ago

Whiteboard: [thunderbird-disabled-test]

Ben Campbell

Comment 28

•

6 years ago

Attached patch 1602816-fix-sniff-buffersize.patch — Details — Splinter Review

Nice simple little fix - extra cruft was being passing into the charset detection, which was swamping the algorithm.
I think this fixes up:
mailnews/compose/test/unit/test_detectAttachmentCharset.js
mailnews/import/test/unit/test_shiftjis_csv.js

..and the utf-8 test in:
mailnews/compose/test/unit/test_detectAttachmentCharset.js

But:

testWindows1252() fails - the old encoder didn't detect windows-1252 but the new one does. It seems reasonable to change the expected outcome on this, right?

testKOI8R() fails: I think the test is relying on a preferences hint which is no longer respected. Not quite sure about this one.
The test is:

async function testKOI8R() {
  Services.prefs.setStringPref("intl.charset.detector", "ruprob");
  await createMessage(do_get_file("data/test-KOI8-R.txt"));
  checkAttachmentCharset("KOI8-R");
}

I don't know how significant the intl.charset.detector hint is... (the new encoder returns KOI8-U rather than KOI8-R). I need to do a little more digging.

Attachment #9116307 - Flags: review?(mkmelin+mozilla)

Henri Sivonen (:hsivonen)

Reporter

Comment 29

•

6 years ago

(In reply to Ben Campbell from comment #28)

Nice simple little fix - extra cruft was being passing into the charset detection, which was swamping the algorithm.

Sorry about not catching that earlier.

testWindows1252() fails - the old encoder didn't detect windows-1252 but the new one does. It seems reasonable to change the expected outcome on this, right?

Yes.

testKOI8R() fails: I think the test is relying on a preferences hint which is no longer respected. Not quite sure about this one.
...
I don't know how significant the intl.charset.detector hint is... (the new encoder returns KOI8-U rather than KOI8-R). I need to do a little more digging.

As documented at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/README.md#82 , the new detector always detects both KOI8-R and KOI8-U as KOI8-U. The difference between the two encodings is that the latter replaces some box drawing characters with letters. Since box drawing is generally not a thing on the Web, not distinguishing the two avoids some reloads in the Web case, and having a couple of box drawing characters replaced by letters is not a disaster.

Ben Campbell

Comment 30

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #29)

As documented at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/README.md#82 , the new detector always detects both KOI8-R and KOI8-U as KOI8-U. The difference between the two encodings is that the latter replaces some box drawing characters with letters. Since box drawing is generally not a thing on the Web, not distinguishing the two avoids some reloads in the Web case, and having a couple of box drawing characters replaced by letters is not a disaster.

Yep, that's more or less what I figured - thanks Henri!
(hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)

I'll update the tests and re-enable them.

Ben Campbell

Comment 31

•

6 years ago

Attached patch 1602816-reinstate-chardet-tests-1.patch — Details — Splinter Review

Oh, that's infuriating. The windows-1252 test comes back as windows-1254.
The test data is: "ä - This is text in windows-1252.", and it's perfectly valid windows-1254, but it doesn't seem the obvious choice :-(
I suspect there's something more going on there, but we're kind of at the point of diminishing returns now, so I just changed the test data to "FensterStraße" which does come back as windows-1252. Cough.

Attachment #9116334 - Flags: review?(mkmelin+mozilla)

Jorg K (CEST = GMT+2)

Comment 32

•

6 years ago

Henri, how is it possible to reliably detect any ANSI encoding like windows-1252. How do you know it's not Greek or Polish for example? Or Turkish (windows-1254). Anecdotal evidence shows that Turkish uses more öü than German, they put two into every word, like a local Turkish bike shop is called "Ürün Bikes" ;-)

Jorg K (CEST = GMT+2)

Comment 33

•

6 years ago

Comment on attachment 9116334 [details] [diff] [review] 1602816-reinstate-chardet-tests-1.patch Review of attachment 9116334 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/compose/test/unit/data/test-windows-1252.txt @@ +1,1 @@ > +FensterStraße Well, maybe use a real world example here, like: Buenos días - François a été à Paris - Budapester Straße, Berlin. Maybe only the ß does the trick and the accents won't.

Attachment #9116334 - Flags: review?(mkmelin+mozilla) → review+

Henri Sivonen (:hsivonen)

Reporter

Comment 34

•

6 years ago

•

Edited

(In reply to Ben Campbell from comment #30)

(hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)

The encoding that the Web calls "KOI8-U" is actually KOI8-RU. Somewhat like ISO-8859-1 is an alias of windows-1252, and EUC-KR really means with windows-949 extensions included.

(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #32)

Henri, how is it possible to reliably detect any ANSI encoding like windows-1252.

The reliability increases as the length of input increases. Indeed, a single letter ä without any word context for it is not a great test case.

How do you know it's not Greek

Greek words are completely non-ASCII. windows-1252 words tend to mix ASCII and non-ASCII, and four or more consecutive non-ASCII is rare.

or Polish for example?

Less well, but it's pretty good after a couple of sentences.

Or Turkish (windows-1254)

Even less well, but it's pretty OK after a couple of sentences. windows-1252 has two language models: One for Icelandic and Faroese and another that merges everything else. There are Turkish letters that overlap some Icelandic/Faroese-only letters which are penalized in the "everything else" windows-1252 model.

Henri Sivonen (:hsivonen)

Reporter

Comment 35

•

6 years ago

"Model" here means relative frequencies of character pairs where at least one participant in the pair is non-ASCII.

Jorg K (CEST = GMT+2)

Comment 36

•

6 years ago

Henri, thanks for the clarification. Good to know that there's a huge amount of expertise behind all this :-) - Sorry about the bad example of Greek. It was motivated by the fact that Windows-1253 still includes ASCII, and that the highbit characters could also come from any other ANSI table, but of course you end up with nonsensical words only consisting of accents. Just out of interest, where can I see the code for this detection? I'd like to see how Greek is detected.

Henri Sivonen (:hsivonen)

Reporter

Comment 37

•

6 years ago

The code for Greek is at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/src/lib.rs#168 and the data is at https://searchfox.org/mozilla-central/source/third_party/rust/chardetng/src/data.rs#655 . (255 is a penalty marker; the highest score is 254. You can see that the final sigma followed by any letter yields the penalty marker. The comments that labels the rows and columns give the first character of a character class. You can find the classes at https://github.com/hsivonen/detector_char_classes/blob/master/src/lib.rs .)

Magnus Melin [:mkmelin]

Assignee

Comment 38

•

6 years ago

Comment on attachment 9116307 [details] [diff] [review] 1602816-fix-sniff-buffersize.patch Review of attachment 9116307 [details] [diff] [review]: ----------------------------------------------------------------- Doh, thanks for finding that! r=mkmelin

Attachment #9116307 - Flags: review?(mkmelin+mozilla) → review+

Pulsebot

Comment 39

•

6 years ago

Pushed by mkmelin@iki.fi: https://hg.mozilla.org/comm-central/rev/fb67882ad9b9 followup - Fix wrong buffersize passed into EncodingDetector, when guessing attachment charsets. r=mkmelin https://hg.mozilla.org/comm-central/rev/69b26858a259 reinstate charset detection related tests: test_detectAttachmentCharset.js, test_messageHeaders.js, test_shiftjis_csv.js.

Jorg K (CEST = GMT+2)

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Takanori MATSUURA

Updated

•

6 years ago

Keywords: leave-open

Magnus Melin [:mkmelin]

Assignee

Updated

•

5 years ago

Whiteboard: [thunderbird-disabled-test]

You need to log in before you can comment on or make changes to this bug.