Closed Bug 1074196 Opened 5 years ago Closed 5 years ago

Nightly/Aurora startup crash spike in nsUrlClassifierPrefixSet::Contains(unsigned int, bool*)

Categories

(Toolkit :: Safe Browsing, defect, critical)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla35
Tracking Status
firefox32 --- unaffected
firefox33 --- unaffected
firefox34 + verified
firefox35 + verified

People

(Reporter: kairo, Assigned: gcp)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-fd16294d-0bbf-4530-b2d4-65c2d2140929.
=============================================================

Top frames:
0 	xul.dll 	nsUrlClassifierPrefixSet::Contains(unsigned int, bool*) 	toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
1 	xul.dll 	mozilla::safebrowsing::LookupCache::Has(mozilla::safebrowsing::SafebrowsingHash<32, mozilla::safebrowsing::CompletionComparator> const&, bool*, bool*) 	toolkit/components/url-classifier/LookupCache.cpp
2 	xul.dll 	mozilla::safebrowsing::Classifier::Check(nsACString_internal const&, nsACString_internal const&, nsTArray<mozilla::safebrowsing::LookupResult>&) 	toolkit/components/url-classifier/Classifier.cpp
3 	xul.dll 	nsUrlClassifierDBServiceWorker::DoLookup(nsACString_internal const&, nsACString_internal const&, nsIUrlClassifierLookupCallback*) 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
4 	xul.dll 	nsUrlClassifierDBServiceWorker::HandlePendingLookups() 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
5 	xul.dll 	nsUrlClassifierDBServiceWorker::Lookup(nsIPrincipal*, nsACString_internal const&, nsIUrlClassifierCallback*) 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
6 	xul.dll 	UrlClassifierDBServiceWorkerProxy::LookupRunnable::Run() 	toolkit/components/url-classifier/nsUrlClassifierProxies.cpp
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
[...]

We had a few of those crashes in the days before, but this started *really* spike on 2014-09-28 (yesterday), mostly with the builds from the 27th, but builds from previous days see it as well with lower volume.

This is mostly on startup, on both Nightly and Aurora, is the #1 crash by a large margin for that date.

URLs are not interesting, as you's expect on startup, mostly about:blank, about:sessionrestore, about:home, etc. Looking at installs vs. crashes, it looks like on average, we have 3 crashes per install, so the same people are crashing repeatedly with this one.
In the while-loop here: http://hg.mozilla.org/mozilla-central/annotate/818f353b7aa6/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp#l222

The loop spins around until it hits an unmapped page. |diff| is extremely 'negative' (e.g. 0x8071c2df, 0xabd90594). I guess it never reached exactly zero so it kept going. At the time of crash, mIndexDeltas[i][deltaIndex] is the poison value 0x5a5a, but it's not clear whether entire the array was dead, or whether we're accessing a bad index.

gcp can you help?
Flags: needinfo?(gpascutto)
Would be interesting to get a "safebrowsing" folder from someone affected. It's in
\Users\XXX\AppData\Local\Mozilla\Firefox\Profiles\XXXX\safebrowsing on Windows, and IIRC in a similar place in .cache on Unix.

This spike happening on a certain date could mean that build changes the memory layout such that the crash is triggered, or that an update happened that causes us to write invalid data. Being limited to Aurora/Nightly is likely because bug 1046038 completely changes the memory layout. Notably, it makes the array we're running past much smaller.

If there's a logic error in the code, it's not immediately obvious to me, which is why having a crashing profile would help. If you hit this bug you'd be unable to start Firefox until you blow away that directory in your profile.
Flags: needinfo?(gpascutto)
// Now search through the deltas for the target.
uint32_t diff = target - mIndexPrefixes[i];
uint32_t deltaSize = mIndexDeltas[i].Length();
uint32_t deltaIndex = 0;
while (diff > 0 && deltaIndex < deltaSize) {
  diff -= mIndexDeltas[i][deltaIndex];
  deltaIndex++;
} 

If the mIndexDelta[i][deltaIndex] is running into poison, that must mean "i" is already invalid at this point, because deltaIndex is checked against the size of the array (deltaSize) at each loop iteration.

That's reinforced by for example this: https://crash-stats.mozilla.com/report/index/deaa3289-e0a9-490b-83b2-52fc62140925
which is crashing here: http://hg.mozilla.org/mozilla-central/annotate/372ce1f36116/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp#l224
Can the array change while we're in this loop?

The crash spike here is likely triggered by an updated safe-browsing list, not a change in binary layout or anything like that.
Assignee: nobody → gpascutto
Going into those lines, the i comes directly from BinSearch, and that can only return garbage if mIndexPrefixes.Length() is 0 but mHasPrefixes is true, which seems unlikely.

The alternate explanation is that mIndexPrefixes and mIndexDeltas are of different length. I'm checking now for logic errors that could cause that to happen. MakePrefixSet looks like it can't violate that invariant, but for LoadFromFd I'm not so sure:

http://hg.mozilla.org/mozilla-central/annotate/818f353b7aa6/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp#l301
http://hg.mozilla.org/mozilla-central/annotate/818f353b7aa6/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp#l314

If deltaSize is 0, then the mIndexDeltas array won't be set up properly.

Now, constructing a prefixset that has zero deltas is hard, but it's certainly possible if on a fresh profile the first update is very small.
No idea if it fixes the crash spike but this was a real bug. I included
a test to catch it.
Attachment #8497654 - Flags: review?(mmc)
Comment on attachment 8497654 [details] [diff] [review]
Correctly initialize PrefixSets with no deltas. r=mmc

Review of attachment 8497654 [details] [diff] [review]:
-----------------------------------------------------------------

This is a bug but I don't think it fixes the crash. It looks like in Contains() we access mIndexPrefixes[0] without checking to see if it has at least one element. That seems like a bug.
Attachment #8497654 - Flags: review?(mmc) → review+
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #8)
 
> This is a bug but I don't think it fixes the crash. It looks like in
> Contains() we access mIndexPrefixes[0] without checking to see if it has at
> least one element. 

What makes you believe that? As noted in comment 6, that should be caught by the mHasPrefixes check.
http://hg.mozilla.org/mozilla-central/annotate/372ce1f36116/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp#l197
https://hg.mozilla.org/mozilla-central/rev/3c8818fdb16c
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Setting NEEDINFO for liz to verify whether this actually goes away in crash-stats.
QA Whiteboard: [qa+]
Flags: needinfo?(lhenry)
Keywords: verifyme
QA Contact: lhenry
(In reply to Gian-Carlo Pascutto [:gcp] from comment #9)
> (In reply to [:mmc] Monica Chew (please use needinfo) from comment #8)
>  
> > This is a bug but I don't think it fixes the crash. It looks like in
> > Contains() we access mIndexPrefixes[0] without checking to see if it has at
> > least one element. 
> 
> What makes you believe that? As noted in comment 6, that should be caught by
> the mHasPrefixes check.
> http://hg.mozilla.org/mozilla-central/annotate/372ce1f36116/toolkit/
> components/url-classifier/nsUrlClassifierPrefixSet.cpp#l197

Re-reading again, you are right. Splitting this into multiple state variables instead of using mIndexPrefixes.Length() makes following this logic unnecessarily confusing.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #13)

> Re-reading again, you are right. Splitting this into multiple state
> variables instead of using mIndexPrefixes.Length() makes following this
> logic unnecessarily confusing.

I fully agree, patches welcome ;-)

I'll fix it up in the next iteration of the crash doesn't die down.
Thanks Benjamin. I'll have a look tomorrow and see if the crash rate is going down. If it does over the next few days, maybe this should get uplifted to 34 as well.
Flags: in-testsuite+
There are 35 crashes for the 2014100103 build of Nightly, and none on mozilla-central builds after that - they all look to be on Aurora.  Looks good!
Flags: needinfo?(lhenry)
Comment on attachment 8497654 [details] [diff] [review]
Correctly initialize PrefixSets with no deltas. r=mmc

Approval Request Comment
[Feature/regressing bug #]: Bug 1046038
[User impact if declined]: In very specific situations, startup crash that's only recoverable by clearing profile.
[Describe test coverage new/current, TBPL]: Landed in m-c few days ago, adds additional tests.
[Risks and why]: Very low. Clear bugfix with new test coverage.
[String/UUID change made/needed]: N/A.
Attachment #8497654 - Flags: approval-mozilla-aurora?
Comment on attachment 8497654 [details] [diff] [review]
Correctly initialize PrefixSets with no deltas. r=mmc

Aurora+
Attachment #8497654 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
The stats in Socorro [1] for the past 4 weeks (excluding Firefox 35 prior to October 2nd and Firefox 34 prior to October 7th) show:
- 1 crash in 34.0a2 (20141010004001)
- 1 crash on 35.0a1 (20141012030203) 
- 1 crash in 35.0a2 (20141019004001)
- 5 crashes in 34.0b1 (20141014134955)
- 1 crash in 34.0b2 (20141020184313)

There is clearly no spike anymore. I'm guessing this can be marked as Verified (unless you think the few new crashes need additional work).

[1] https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=nsUrlClassifierPrefixSet%3A%3AContains%28unsigned+int%2C+bool%2A%29#tab-reports
Florin, I agree, one or two crashes is nothing like the hundreds/ thousands we were seeing before. Marking verified for 34.
32 and 33 were only showing a few crashes per week so I think they can be safely marked as wontfix.
Status: RESOLVED → VERIFIED
Thank you Liz!
QA Whiteboard: [qa+] → [qa!]
Keywords: verifyme
I am getting crashes like this with the current tree. Filed bug 1120145 to track it.
You need to log in before you can comment on or make changes to this bug.