Closed Bug 1342397 Opened 3 years ago Closed 3 years ago

Crash in InvalidArrayIndex_CRASH | mozilla::safebrowsing::Classifier::CopyInUseLookupCacheForUpdate

Categories

(Toolkit :: Safe Browsing, defect, critical)

Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- unaffected
firefox53 --- unaffected
firefox54 --- fixed

People

(Reporter: calixte, Assigned: hchang)

References

(Blocks 1 open bug)

Details

(Keywords: crash, Whiteboard: [clouseau])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-56ecd061-96d7-408d-80f8-b49642170223.
=============================================================

There are 4 crashes in nightly 54 with buildid 20170222030329 and 20170223030204.
In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1338970.

[1] https://hg.mozilla.org/mozilla-central/rev?node=2c1fc455e81e0c7744e38a36b6df8657f462a369
Flags: needinfo?(hchang)
There are 2 crashes with signature "nsTArray_Impl<T>::RemoveElementsAt | nsUrlClassifierPrefixSet::~nsUrlClassifierPrefixSet". The backtrace shows that the files modified to fix bug 1338970 are involved too, it's why I'm adding this signature.
Crash Signature: [@ InvalidArrayIndex_CRASH | mozilla::safebrowsing::Classifier::CopyInUseLookupCacheForUpdate] → [@ InvalidArrayIndex_CRASH | mozilla::safebrowsing::Classifier::CopyInUseLookupCacheForUpdate] [@ nsTArray_Impl<T>::RemoveElementsAt | nsUrlClassifierPrefixSet::~nsUrlClassifierPrefixSet ]
Assignee: nobody → hchang
Flags: needinfo?(hchang)
I can think of one case which may lead to the crash signature [@ InvalidArrayIndex_CRASH | mozilla::safebrowsing::Classifier::CopyInUseLookupCacheForUpdate]:

In [1] we iterate mLookupCache and create new LookupCache for update accordingly.
However, creating new LookupCache (i.e. GetLookupCacheForUpdate()) may cause 
mLookupCache to be cleared [2].

My solution is to only remove update intermediary when GetLookupCacheForUpdate() is failed 
(no matter it's due to file corruption or not. For example, if it's due to OOM, we still
have to remove every thing temporary for update.)

So, after Bug 1339760, 

- GetLookupCacheForUpdate() will only on update thread and may only remove "mNewLookupCaches" and   "safebrowsing-update"; 

- GetLookupCache() will only on worker thread and may only remove "mLookupCache" and "safebrowsing".

[1] http://searchfox.org/mozilla-central/rev/60ae6514e4c559c0c234f0e7aefccb101b8beb2e/toolkit/components/url-classifier/Classifier.cpp#1012
[2] http://searchfox.org/mozilla-central/rev/60ae6514e4c559c0c234f0e7aefccb101b8beb2e/toolkit/components/url-classifier/Classifier.cpp#1352
Attachment #8840939 - Flags: review?(gpascutto)
The line you point to in the above comment, and the added comments in this patch point to Reset() "Not including any intermediary for update.". Was that meant as a comment on its (existent) behavior, or is something missing from this patch?
(In reply to Gian-Carlo Pascutto [:gcp] from comment #7)
> The line you point to in the above comment, and the added comments in this
> patch point to Reset() "Not including any intermediary for update.". Was
> that meant as a comment on its (existent) behavior, or is something missing
> from this patch?

That's the existent behavior and I just wanted to make Reset() more clear. 
I have ever considered to extend Reset() to cover update intermediaries 
but then realized that would be a mess when we introduce more concurrency 
between update and lookup.
Note that my patch can only explain the crash inside CopyInUseLookupCacheForUpdate().

What the patch cannot explain are:

1) Why the copied files will be corrupted (and only on window and android)?
   If the source is corrupted, it should have been removed. Or does it imply
   "copy then open too soon" is problematic on windows and android?

2) The double delete issue of the other crash signature (from CloseDb())
1) Can have many reasons. There are many differences in how Windows handles basic file access, varying from not updating access/modification times in the same manner (this can be observed in safebrowsing directory time), things like not being able to open files multiple times (I forgot exact conditions), etc. Also, there are much more users on Windows so they will trigger more problems. Android by design will kill Firefox randomly so it's much more likely to surface any unsoundness in the update procedure.
Comment on attachment 8840939 [details]
Bug 1342397 - Properly reset stuff when failing to create LookupCache for update.

https://reviewboard.mozilla.org/r/115326/#review116838

It's late and I'm tired, but given the

::: toolkit/components/url-classifier/Classifier.cpp:1327
(Diff revision 4)
>  LookupCache *
>  Classifier::GetLookupCache(const nsACString& aTable, bool aForUpdate)
>  {
> -  if (aForUpdate) {
> -    return GetLookupCacheFrom(aTable, mNewLookupCaches, mUpdatingDirectory);
> -  }
> +  nsTArray<LookupCache*>& lookupCaches = aForUpdate ? mNewLookupCaches
> +                                                    : mLookupCaches;
> +  nsIFile* rootStoreDirectory = aForUpdate ? mUpdatingDirectory

If it cannot be nullptr one can try to make it a reference.

::: toolkit/components/url-classifier/Classifier.cpp:1364
(Diff revision 4)
> +  // GetLookupCache for update and for other usage will run on update thread
> +  // and worker thread respectively (Bug 1339760). Removing stuff only in
> +  // their own realms potentially increases the concurrency.
> +
> +  if (aForUpdate) {
> +    // Remove intermediaries no matter it's due to file corruption or not.

nit: no matter if it's due
Attachment #8840939 - Flags: review?(gpascutto) → review+
Comment on attachment 8840939 [details]
Bug 1342397 - Properly reset stuff when failing to create LookupCache for update.

https://reviewboard.mozilla.org/r/115326/#review116838

Oops, Mozreview saved this from last week :) I asked the question I was going to ask here in a separate comment.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #10)
> 1) Can have many reasons. There are many differences in how Windows handles
> basic file access, varying from not updating access/modification times in
> the same manner (this can be observed in safebrowsing directory time),
> things like not being able to open files multiple times (I forgot exact
> conditions), etc. Also, there are much more users on Windows so they will
> trigger more problems. Android by design will kill Firefox randomly so it's
> much more likely to surface any unsoundness in the update procedure.

Looking again the crash reports, all |InvalidArrayIndex_CRASH| are on Windows.
(I slightly remember there were some on Android. Maybe it was also too late for me...)

From the current reports I cannot tell the corruption rate is high or low.
If it is high, even if my patch fixed the crash, the update will still fail:

If the corruption is due to the source file corruption, then 
1) If the file is never, (usually in the very early update), we have nothing to do.
2) If the file is opened, (which means we already have a LookupCache for it), we can get around
   the "copied file corruption" issue by copying LookupCache internal. (what we are doing right
   now is to re-open the copied file.) See Bug 1341183.


As for the other crash signature, it seems to be a long-standing issue: We happened to
have two pointers for one LookupCache instance, which then leads to double delete issue.
Bug 1338970 changes the call sequence in regard to "removing LookupCaches".
Pushed by hchang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/313bfccbbd9f
Properly reset stuff when failing to create LookupCache for update. r=gcp
https://hg.mozilla.org/mozilla-central/rev/313bfccbbd9f
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
The patch has landed for almost one week and the crash signature 
"InvalidArrayIndex_CRASH | mozilla::safebrowsing::Classifier::CopyInUseLookupCacheForUpdate"
never occurs on nightly which has this patch. Good news :)
Whiteboard: [clouseau]
(In reply to Henry Chang [:henry][:hchang] from comment #17)
> The patch has landed for almost one week and the crash signature 
> "InvalidArrayIndex_CRASH |
> mozilla::safebrowsing::Classifier::CopyInUseLookupCacheForUpdate"
> never occurs on nightly which has this patch. Good news :)

Good job!  Thanks, Henry!
You need to log in before you can comment on or make changes to this bug.