Port bug 1488659: Replace XPCOM use of nsICharsetDetector

RESOLVED FIXED in Thunderbird 64.0

Status

enhancement
RESOLVED FIXED
8 months ago
8 months ago

People

(Reporter: jorgk, Assigned: jorgk)

Tracking

Trunk
Thunderbird 64.0
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 5 obsolete attachments)

Assignee

Description

8 months ago
As per bug 1488659 comment #6. Sample code in:
https://phabricator.services.mozilla.com/D5392#C122482NL203
but let's see the final version since there was a review nit.
Assignee

Updated

8 months ago
Blocks: 1488659
Assignee

Comment 1

8 months ago
Hi Henri and Ehsan, can you please take a look. Sorry about the double review request, one will surely do, but bug 1489949 is already on inbound, so we're a bit in a hurry.

This is copied from
https://hg.mozilla.org/integration/mozilla-inbound/rev/d94ad30a3001#l11.48

In that changeset I can't see any evidence of
  NS_CHARSET_DETECTOR_CONTRACTID_BASE "universal_charset_detector"
ever existing, so hence the:
-  // First try the universal charset detector
-  nsCOMPtr<nsICharsetDetector> detector
-    = do_CreateInstance(NS_CHARSET_DETECTOR_CONTRACTID_BASE
-                        "universal_charset_detector");
-  if (!detector) {
That never worked, or am I missing something?

Strangely the string is still here:
https://dxr.mozilla.org/mozilla-central/search?q=universal_charset_detector&redirect=false
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Attachment #9007760 - Flags: review?(hsivonen)
Attachment #9007760 - Flags: review?(ehsan)
Comment on attachment 9007760 [details] [diff] [review]
1489949-replace-xpcom-nsICharsetDetector.patch

Review of attachment 9007760 [details] [diff] [review]:
-----------------------------------------------------------------

I never realized there is a c-c consumer for this, sorry about that!

Do you think it's worth adding a constructor function to m-c that reads this pref and returns a nsICharsetDetector* so that you can call that directly instead of having this code duplicated in two places?  If yes, I'd be happy to add that for you...

r=me for now so that c-c doesn't get busted.

::: mailnews/base/util/nsMsgUtils.cpp
@@ -1893,5 @@
>  {
> -  // First try the universal charset detector
> -  nsCOMPtr<nsICharsetDetector> detector
> -    = do_CreateInstance(NS_CHARSET_DETECTOR_CONTRACTID_BASE
> -                        "universal_charset_detector");

So this will always fail currently (even before bug 1488659) since this component doesn't exist.

@@ +1905,5 @@
> +      detector = new nsRUProbDetector();
> +    } else if (detectorName.EqualsLiteral("ukprob")) {
> +      detector = new nsUKProbDetector();
> +    } else if (detectorName.EqualsLiteral("ja_parallel_state_machine")) {
> +      detector = new nsJAPSMDetector();

Then yes, this is exactly like the code I landed on m-c.
Attachment #9007760 - Flags: review?(hsivonen)
Attachment #9007760 - Flags: review?(ehsan)
Attachment #9007760 - Flags: review+
Assignee

Comment 3

8 months ago
(In reply to :Ehsan Akhgari from comment #2)
> Do you think it's worth adding a constructor function to m-c that reads this
> pref and returns a nsICharsetDetector* so that you can call that directly
> instead of having this code duplicated in two places?  If yes, I'd be happy
> to add that for you...
Yes please :-)

> r=me for now so that c-c doesn't get busted.
Thanks. I'll make it r=Ehsan, right? - although previously IIRC we used to use lower case.
(I don't care about the case.  :-) )
(In reply to Jorg K (GMT+2) from comment #3)
> (In reply to :Ehsan Akhgari from comment #2)
> > Do you think it's worth adding a constructor function to m-c that reads this
> > pref and returns a nsICharsetDetector* so that you can call that directly
> > instead of having this code duplicated in two places?  If yes, I'd be happy
> > to add that for you...
> Yes please :-)

That would probably lead to *more* c-c churn later this year, when I intend to add more conditions to the detector instantiation in m-c.
(In reply to Jorg K (GMT+2) from comment #1)
>   NS_CHARSET_DETECTOR_CONTRACTID_BASE "universal_charset_detector"
> ever existing

Removed in bug 844115.
Assignee

Updated

8 months ago
Keywords: leave-open
Version: 24 → Trunk
Assignee

Comment 8

8 months ago
Ehsan and Henri: There is another call site which is now busted:
https://searchfox.org/comm-central/rev/27b3d9c5fd2543f47c3943590c72fa560a5b6b9a/mailnews/mime/src/comi18n.cpp#49

That one uses the "string versions" which you also removed. I tried a similar approach, but since you also removed nsJAStringPSMDetector, nsRUStringProbDetector, nsUKStringProbDetector, this code doesn't compile.
Flags: needinfo?(hsivonen)
Flags: needinfo?(ehsan)
Assignee

Comment 9

8 months ago
So for the time being, I have to remove the busted code.
(In reply to Jorg K (GMT+2) from comment #8)
> Created attachment 9007837 [details] [diff] [review]
> 1489949-replace-xpcom-nsICharsetDetector-part2.patch - does NOT compile
> 
> Ehsan and Henri: There is another call site which is now busted:
> https://searchfox.org/comm-central/rev/
> 27b3d9c5fd2543f47c3943590c72fa560a5b6b9a/mailnews/mime/src/comi18n.cpp#49
> 
> That one uses the "string versions" which you also removed. I tried a
> similar approach, but since you also removed nsJAStringPSMDetector,
> nsRUStringProbDetector, nsUKStringProbDetector, this code doesn't compile.

You should probably add the removed classes back to c-c and instantiate them from here then.  This stuff wasn't used in m-c and was therefore removed from there.
Flags: needinfo?(ehsan)

Comment 11

8 months ago
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/cacb2b5c2373
Port bug 1488659: Replace XPCOM use of nsICharsetDetector, part 1. r=Ehsan
https://hg.mozilla.org/comm-central/rev/2275b3b53ada
Port bug 1488659: Temporarily remove code using StringProbDetector. rs=bustage-fix
Assignee

Comment 12

8 months ago
(In reply to :Ehsan Akhgari from comment #10)
> You should probably add the removed classes back to c-c and instantiate them
> from here then.
Unfortunately that didn't occur to me, ah, well, here goes. I'll back out the bustage fix separately.
Attachment #9007837 - Attachment is obsolete: true
Flags: needinfo?(hsivonen)
Attachment #9007852 - Flags: review?(ehsan)
Assignee

Updated

8 months ago
Keywords: leave-open
Target Milestone: --- → Thunderbird 64.0

Comment 13

8 months ago
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/9588a123fb8d
backed out changeset 2275b3b53ada to remove temporary measure. a=backout
https://hg.mozilla.org/comm-central/rev/995fe17c5348
Port bug 1488659: Replace XPCOM use of nsICharsetDetector, part 2. rs=bustage-fix
Status: ASSIGNED → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → FIXED
I think it's not worthwhile to put the nsIStringCharsetDetector.h interface in c-c. You should be able to get the same result by calling DoIt once followed by Done on nsICharsetDetector.
Assignee

Comment 15

8 months ago
Well, the DoIt() has a different interface
https://hg.mozilla.org/comm-central/rev/995fe17c5348#l1.98
something with confidence.

Would you like to give me a snippet here?
Comment on attachment 9007852 [details] [diff] [review]
1489949-replace-xpcom-nsICharsetDetector-part2.patch

I think Henri is a better reviewer... :-)
Attachment #9007852 - Flags: review?(ehsan)
Assignee

Comment 17

8 months ago
Comment on attachment 9007852 [details] [diff] [review]
1489949-replace-xpcom-nsICharsetDetector-part2.patch

I sometimes have to land stuff to fix bustage and a follow-up will be necessary. Unless it's grossly wrong and needs to be backed out, an r+ with the condition for a follow-up would be appreciated (see preceding comments).
Attachment #9007852 - Flags: review?(hsivonen)
Comment on attachment 9007852 [details] [diff] [review]
1489949-replace-xpcom-nsICharsetDetector-part2.patch

Review of attachment 9007852 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/mime/src/comi18n.cpp
@@ +38,5 @@
>                                          default_charset, override_charset,
>                                          eatContinuations, result);
>  }
>  
> +class nsJAStringPSMDetector : public nsXPCOMStringDetector

It looks like the classes being subclassed here are dead code in m-c, so instead of subclassing them in c-c, we should be removing them from m-c.

@@ +69,2 @@
>    *aCharset = nullptr;
> +  nsCOMPtr<nsIStringCharsetDetector> detector;

Please use nsICharsetDetector here.

@@ +73,5 @@
>  
> +  if (!detectorName.IsEmpty()) {
> +    // We recognize one of the three magic strings for the following languages.
> +    if (detectorName.EqualsLiteral("ruprob")) {
> +      detector = new nsRUStringProbDetector();

And then please instantiate the nsICharsetDetector implementation classes in these branches.

@@ +81,5 @@
> +      detector = new nsJAStringPSMDetector();
> +    }
> +  }
> +
> +  if (detector) {

The here you need to initialize the nsICharsetDetector with an implementation of nsICharsetDetectionObserver that stores copies the arguments of Notify to its fields.

Then you can call DoIt and Done on the nsICharsetDetector and then retrieve the values from the fields of your nsICharsetDetectionObserver implemenatation.
Attachment #9007852 - Flags: review?(hsivonen) → review-
(In reply to Henri Sivonen (:hsivonen) from comment #18)
> It looks like the classes being subclassed here are dead code in m-c, so
> instead of subclassing them in c-c, we should be removing them from m-c.

(Bug 1490212)
(In reply to Henri Sivonen (:hsivonen) from comment #18)
> The here you need to initialize the nsICharsetDetector with an
> implementation of nsICharsetDetectionObserver that stores copies the
> arguments of Notify to its fields.

Correction: In what's a very questionable use of XPCOM, the const char* argument of Notify() always points to a literal string, so if the c-c callers assume they don't have to free the pointer, I guess you shouldn't copy that string but just pass that pointer along.
So why exactly do we want to preserve the sniffing in comm-central?
If m-c is able to remove it's usage, if anything I'd imagine mail content with unlabeled charsets would be even more uncommon.
Assignee

Comment 22

8 months ago
Thanks for the review, Henri. Here's a version as you described it. I'll backout the incorrect version separately.

I've done the tweaks in nsMsgUtils.cpp according to your comment #20.

I'm not sure about passing back the charset from MIME_detect_charset(). Now I'm just passing the pointer back. MsgDetectCharsetFromFile() passes back a copy in a nsACString.
Attachment #9007852 - Attachment is obsolete: true
Attachment #9007990 - Flags: review?(hsivonen)
Comment on attachment 9007990 [details] [diff] [review]
1489949-replace-xpcom-nsICharsetDetector-part2.patch (v2)

Review of attachment 9007990 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks and sorry about the churn.
Attachment #9007990 - Flags: review?(hsivonen) → review+
Assignee

Comment 24

8 months ago
(In reply to Henri Sivonen (:hsivonen) from comment #23)
> Thanks and sorry about the churn.
Thanks, the churn was mostly my fault landing incorrect solutions as the trigger-happy sheriff/developer ;-)

M-C sees a whole lot of backouts, so I don't think we should worry about some backouts on C-C.

So the 
+      if (oConfident == eBestAnswer || oConfident == eSureAnswer) {
+        *aCharset = observer->GetDetectedCharset();
is OK?
(In reply to Jorg K (GMT+2) from comment #24)
> (In reply to Henri Sivonen (:hsivonen) from comment #23)
> > Thanks and sorry about the churn.
> Thanks, the churn was mostly my fault landing incorrect solutions as the
> trigger-happy sheriff/developer ;-)
> 
> M-C sees a whole lot of backouts, so I don't think we should worry about
> some backouts on C-C.
> 
> So the 
> +      if (oConfident == eBestAnswer || oConfident == eSureAnswer) {
> +        *aCharset = observer->GetDetectedCharset();
> is OK?

I believe so. I believe both before and after what gets assigned to *aCharset is a pointed to a POD literal, so if it worked previously without anyone trying to free the literal, it should work now.
Assignee

Comment 26

8 months ago
Hmm, well, the const char* pointer we return gets passed into Notify(). All we know is that Notify() shouldn't/can't modify it. As far as I'm aware, the is no guarantee as to the longevity of the data pointed to. The CharsetDetectionObserver will be destroyed once the function exits.

There is no "prior art" here, at least not strictly. Before that code used the string detector versions and the DoIt() function returned the charset, and I don't know where that came from.

All I can observe is that in the equivalent code in MsgDetectCharsetFromFile(), a copy is passed back a copy in a nsACString. That would be 100% water-proof. Shall I change it to that so we can sleep better?

BTW, I'm happy to be wrong on all the above, just asking ;-)

Note that the Japanese version is in fact used:
https://dxr.mozilla.org/l10n-central/rev/fe8de2043f654a30ec39111799bba92c53cc67e0/ja/toolkit/chrome/global/intl.properties#39
Flags: needinfo?(hsivonen)
(In reply to Jorg K (GMT+2) from comment #26)
> All I can observe is that in the equivalent code in
> MsgDetectCharsetFromFile(), a copy is passed back a copy in a nsACString.
> That would be 100% water-proof. Shall I change it to that so we can sleep
> better?

Changing the aCharset outparam of MIME_detect_charset from const char** aCharset to nsACString& aCharset would be better.
Flags: needinfo?(hsivonen)
MIME_detect_charset has C linkage but is only called from C++, so changing its linkage to C++ linkage should be fine.
Assignee

Comment 29

8 months ago
So like this?
Attachment #9008035 - Flags: review?(hsivonen)
Before proceeding, is there reason to think we need this? comment #21

(Because it's in use isn't a reason. Of course it's used since the localization note says to set it.)
Assignee

Comment 31

8 months ago
Good point, but perhaps we should fix the bustage and move the removal, if so decided, two another bug. Note that there are various call sites doing this.
Assignee

Comment 32

8 months ago
MsgDetectCharsetFromFile() fixed in part 1 has two call sites. Does your comment include those or only part 2?
Comment on attachment 9008035 [details] [diff] [review]
1489949-replace-xpcom-nsICharsetDetector-part2.patch (v3)

Review of attachment 9008035 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/nsMsgUtils.cpp
@@ +1883,5 @@
>    {
>      mCharset = aCharset;
>      return NS_OK;
>    };
> +  const char *GetDetectedCharset() { return mCharset; }

This should have nsACString& as outparam instead of returning const char*.

@@ +1888,4 @@
>  
>  private:
>    virtual ~CharsetDetectionObserver() {}
> +  const char* mCharset;

This should stay as nsCString.

::: mailnews/mime/src/comi18n.cpp
@@ +47,5 @@
> +  NS_DECL_ISUPPORTS
> +  CharsetDetectionObserver() {};
> +  NS_IMETHOD Notify(const char* aCharset, nsDetectionConfident aConf) override
> +  {
> +    mCharset = aCharset;

AssignASCII here.
Assignee

Comment 34

8 months ago
Tweaked string processing in both source files as per Henri's comments. Applied all three comments everywhere.

Magnus will be disappointed that his question wasn't answered. I guess MsgDetectCharsetFromFile() from part 1 has its justification, but MIME_detect_charset() appears a bit doubtful. It's only ever configured for Japanese, so as far as I can see, no Russian/Ukrainian/Cyrillic detection. And for Japanese, will it really distinguish - say - UTF-8, Shift-JIS from ISO-2022-JP?

Best discussed in a new bug, I think.
Attachment #9007990 - Attachment is obsolete: true
Attachment #9008035 - Attachment is obsolete: true
Attachment #9008035 - Flags: review?(hsivonen)
Attachment #9008124 - Flags: review?(hsivonen)
Assignee

Comment 35

8 months ago
Oops, forgot on AssignASCII(). Sorry about the noise. Interdiff with v3.
Attachment #9008124 - Attachment is obsolete: true
Attachment #9008124 - Flags: review?(hsivonen)
Attachment #9008125 - Flags: review?(hsivonen)
Comment on attachment 9008125 [details] [diff] [review]
1489949-replace-xpcom-nsICharsetDetector-part2.patch (v3c)

Review of attachment 9008125 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.

::: mailnews/mime/src/comi18n.h
@@ +31,3 @@
>  
>  #ifdef __cplusplus
>  } /* extern "C" */

I'm surprised it compiles without removing the C linkage. If it turns out you need to remove the C linkage, the r+ holds either way.
Attachment #9008125 - Flags: review?(hsivonen) → review+

Comment 37

8 months ago
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/9dc7f65220a4
Backed out changeset 995fe17c5348 for being incorrect. a=backout
https://hg.mozilla.org/comm-central/rev/4d4e022e6697
Port bug 1488659: Replace XPCOM use of nsICharsetDetector, part 2, take 2. r=hsivonen
Assignee

Comment 38

8 months ago
Landed with two changes from the reviewed version:

-  if (NS_SUCCEEDED(res) && detectedCharset && *detectedCharset)  {
+  if (NS_SUCCEEDED(res) && !detectedCharset.IsEmpty()) {
     PR_FREEIF(text->charset);
-    text->charset = strdup(detectedCharset);
+    text->charset = ToNewCString(detectedCharset);       // [1]
 
     //update MsgWindow charset if we are instructed to do so
-    if (text->needUpdateMsgWinCharset && *text->charset)
+    if (text->needUpdateMsgWinCharset && text->charset)  // [2]

[1] Used to ToNewCString(detectedCharset) instead of strdup(detectedCharset.get())
[2] ToNewCString() can return nullptr, so we need to check.
    Checking *text->charset was wrong since
      a) it would have dereferenced null
      b) we checked before that the string wasn't empty, even in the original code.
Thanks. That's indeed better that strdup.
You need to log in before you can comment on or make changes to this bug.