Closed Bug 702217 Opened 13 years ago Closed 13 years ago

OOM crash in nsUrlClassifierStore::ReadPrefixes or nsUrlClassifierPrefixSet::SetPrefixes

Categories

(Toolkit :: Safe Browsing, defect)

9 Branch
x86
All
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Firefox 12
Tracking Status
firefox9 + affected
firefox10 + affected
firefox11 + fixed
firefox12 --- fixed
firefox-esr10 - wontfix

People

(Reporter: scoobidiver, Assigned: gcp)

References

Details

(Keywords: crash, Whiteboard: [qa-])

Crash Data

Attachments

(3 files, 2 obsolete files)

It's a new crash signature that first appeared in 9.0a1/20110911.
It's #20 top crasher in 9.0b1.

There are two kinds of stack traces:
0 	mozalloc.dll 	mozalloc_abort 	memory/mozalloc/mozalloc_abort.cpp:77
1 	mozalloc.dll 	mozalloc_handle_oom 	memory/mozalloc/mozalloc_oom.cpp:54
2 	xul.dll 	nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity 	obj-firefox/dist/include/nsTArray-inl.h:106
3 	xul.dll 	nsTArray<nsTimerImpl*,nsTArrayDefaultAllocator>::AppendElements<nsTimerImpl*> 	obj-firefox/dist/include/nsTArray.h:811
4 	xul.dll 	nsUrlClassifierStore::ReadPrefixes 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:3562
5 	xul.dll 	nsUrlClassifierDBServiceWorker::ConstructPrefixSet 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:3589
6 	xul.dll 	nsUrlClassifierDBServiceWorker::ApplyUpdate 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:3159
7 	xul.dll 	nsUrlClassifierDBServiceWorker::FinishUpdate 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:3194
8 	xul.dll 	nsRunnableMethodImpl<unsigned int 	obj-firefox/dist/include/nsThreadUtils.h:345
9 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:631
10 	xul.dll 	nsRunnable::Release 	obj-firefox/xpcom/build/nsThreadUtils.cpp:55
...

0 	mozalloc.dll 	mozalloc_abort 	memory/mozalloc/mozalloc_abort.cpp:77
1 	mozalloc.dll 	mozalloc_handle_oom 	memory/mozalloc/mozalloc_oom.cpp:54
2 	xul.dll 	nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity 	obj-firefox/dist/include/nsTArray-inl.h:77
3 	xul.dll 	nsTArray<nsTimerImpl*,nsTArrayDefaultAllocator>::AppendElements<nsTimerImpl*> 	obj-firefox/dist/include/nsTArray.h:811
4 	xul.dll 	nsUrlClassifierPrefixSet::SetPrefixes 	toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp:122
5 	xul.dll 	nsUrlClassifierDBServiceWorker::ConstructPrefixSet 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:3611
6 	xul.dll 	nsUrlClassifierDBServiceWorker::ApplyUpdate 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:3159
7 	xul.dll 	nsUrlClassifierDBServiceWorker::FinishStream 	
8 	xul.dll 	nsRunnableMethodImpl<unsigned int 	obj-firefox/dist/include/nsThreadUtils.h:345
9 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:631
10 	xul.dll 	nsRunnable::Release 	obj-firefox/xpcom/build/nsThreadUtils.cpp:55
11 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char%20const*%20const%29%20|%20mozalloc_handle_oom%28%29%20|%20nsTArray_base%3CnsTArrayDefaultAllocator%3E%3A%3AEnsureCapacity%28unsigned%20int%2C%20unsigned%20int%29%20|%20nsTArray%3CnsTimerImpl*%2C%20nsTArrayDefaultAllocator%3E%3A%3AAppendElements%3CnsTimerImpl*%3E%28nsTimerImpl*%20const*%2C%20unsign...
Quite some of these are repeated, i.e. if it crashes it will do so multiple times for the same user (unsure how socorro works, but I presume same install data+time=same user). Could these be corrupted databases?

Maybe we should add a sanity check when constructing the tree, and axe the databases if they appear corrupted? Maybe add Telemetry for that, too?
Crash Signature: unsign... ] → unsign... ] [@ mozalloc_abort(char const* const) | mozalloc_handle_oom() | nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) | nsTArray<unsigned int nsTArrayDefaultAllocator>::AppendElements<unsigned int>(unsigned int co…
Sorry, filed a dupe because of the signature being slightly different, here's my comment from there:

This is topcrash #33 in Firefox 9 data from yesterday, it majorly happens in 9.0b2 builds, but I'm also seeing 10.0a2 Aurora and 11.a1 Nightly crashes with that signature in https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char%20const*%20const%29%20|%20mozalloc_handle_oom%28%29%20|%20nsTArray_base%3CnsTArrayDefaultAllocator%3E%3A%3AEnsureCapacity%28unsigned%20int%2C%20unsigned%20int%29%20|%20nsTArray%3Cunsigned%20int%2C%20nsTArrayDefaultAllocator%3E%3A%3AAppendElements%3Cunsigned%20int%3E%28unsigned%20int%20const*%2C%20unsign...
Component: General → Phishing Protection
Product: Toolkit → Firefox
QA Contact: general → phishing.protection
Crash Signature: nsTArrayDefaultAllocator>::AppendElements<unsigned int>(unsigned int const*, unsign... ] → nsTArrayDefaultAllocator>::AppendElements<unsigned int>(unsigned int const*, unsign... ] [@ mozalloc_abort(char const* const) | mozalloc_handle_oom() | nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int unsigned int) | nsTArray<gfxGlyph…
Question about this crash signature: some other code I am working on is doing AppendElements and then checking if that returns a NULL pointer, to detect OOM condiations. However, given that the traces here are aborts inside the allocator and not null pointer dereferences, am I correct that those checks are useless?
I don't think we can do much more than add a sanity check and reset the (corrupted) database if things seem to be going bonkers.
Assignee: nobody → gpascutto
Status: NEW → ASSIGNED
Attachment #578214 - Flags: review?(dcamp)
Attachment #578214 - Flags: approval-mozilla-beta?
Attachment #578214 - Flags: approval-mozilla-aurora?
FWIW, I do think we want to fix/work around this, because users who encounter these issues likely won't be able to use Firefox at all unless they downgrade or kill their entire profile. Uninstall/reinstall won't fix it either.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #4)
> Question about this crash signature: some other code I am working on is
> doing AppendElements and then checking if that returns a NULL pointer, to
> detect OOM condiations. However, given that the traces here are aborts
> inside the allocator and not null pointer dereferences, am I correct that
> those checks are useless?

By default nsTArray is infallible (will oom abort rather than return null).  iirc it used to be fallible.

You can switch back to a fallible version by using FallibleTArray instead of nsTArray.  Might be a good idea here, but could do that in a followup.
Comment on attachment 578214 [details] [diff] [review]
Patch 1. Sanity check prefixes size. Pass down error codes.

Review of attachment 578214 [details] [diff] [review]:
-----------------------------------------------------------------

Would it be reasonable to note in telemetry data how often this is happening?

This is a reasonable sanity check, so r+.
Attachment #578214 - Flags: review?(dcamp) → review+
Comment on attachment 578214 [details] [diff] [review]
Patch 1. Sanity check prefixes size. Pass down error codes.

[Triage Comment]
Significant crasher and new to FF9. Let's take on aurora and beta.
Attachment #578214 - Flags: approval-mozilla-beta?
Attachment #578214 - Flags: approval-mozilla-beta+
Attachment #578214 - Flags: approval-mozilla-aurora?
Attachment #578214 - Flags: approval-mozilla-aurora+
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=4be41994deb7
https://hg.mozilla.org/mozilla-central/rev/636ea2bf3366
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 11
Scoobidiver, no idea why you did set this to fixed for 9 and 10 but I can't see any checkin on mozilla-aurora or mozilla-beta for this, so I'm setting this to affected for them, should be marked fixed on those by whoever checks in on the branches.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #12)
> Scoobidiver, no idea why you did set this to fixed for 9 and 10 but I can't
> see any checkin on mozilla-aurora or mozilla-beta for this
My bad. I saw the Aurora and Beta approval flags but I didn't check the landing in branches.
(In reply to Ed Morley [:edmorley] from comment #11)
> https://hg.mozilla.org/mozilla-central/rev/636ea2bf3366

Please land asap on aurora and beta to make this week's build.
Is there something QA can do to verify this fix?
Whiteboard: [qa?]
>Is there something QA can do to verify this fix?

Constructing a corrupted urlclassifier3.sqlite database that has this effect is tricky - I didn't succeed with adding random errors. If you can't get one from one of the users that were seeing this, then no, you can't verify this.
QA- based on comment 18. Can someone who was previously experiencing this bug please verify the fix on Firefox 9 and 10?
Whiteboard: [qa?] → [qa-]
>is rising somewhat on that version. Is this something different?

I don't see anything in that link?

This one is in the same module, and probably related (I didn't look at the code in detail, but it could potentially have the same cause as this one):
https://bugzilla.mozilla.org/show_bug.cgi?id=691364
>is rising somewhat on that version. Is this something different?

Got it now:

https://crash-stats.mozilla.com/report/list?range_value=7&range_unit=days&date=2011-12-12&signature=mozalloc_abort%28char%20const*%20const%29%20|%20mozalloc_handle_oom%28%29%20|%20nsTArray_base%3CnsTArrayDefaultAllocator%3E%3A%3AEnsureCapacity%28unsigned%20int%2C%20unsigned%20int%29%20|%20nsTArray%3CnsTimerImpl*%2C%20nsTArrayDefaultAllocator%3E%3A%3AAppendElements%3CnsTimerImpl*%3E%28nsTimerImpl*%20const*%2C%20unsign...&version=Firefox%3A9.0b5

It's the same issue as in this bug indeed. (The nsTimerImpl is bogus) Should we reopen this one or make a new bug?

Looking at:

http://hg.mozilla.org/releases/mozilla-beta/annotate/dea4e14dc88a/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#l3578

The fix that was done limits it at 5M entries in all cases, each 4 bytes big, so 20M, say 40M worst case depending on how nsTArray allocates memory. So we're OOM-ing trying to allocate at most 40M RAM, and in typical usage we know it will max out at about 5M RAM (because the real DB is 600-900k entries at most).

What I'm suspecting is that something else entirely is leaking out RAM, and UrlClassifier is hitting it because its one of the only activities that runs automatically in the background even when the browser is idle, and allocates memory.
Assignee: gpascutto → nobody
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
> What I'm suspecting is that something else entirely is leaking out RAM

That's entirely possible, although it strikes me as a little far-fetched that we'd so often run *almost* entirely out of RAM and then, when idle, OOM.

You could switch your TArrays to fallible.  But I think I'd also put a fatal release-build assertion (in Nightly only) checking that the URL classifier is not using more memory than you think, because it strikes me as more likely that the problem is somehow self-contained.
>You could switch your TArrays to fallible.

What do I do if the allocation fails? I can't just disable SafeBrowsing. 

>But I think I'd also put a fatal release-build assertion (in Nightly only) 
>checking that the URL classifier is not using more memory than you think, because 
>it strikes me as more likely that the problem is somehow self-contained.

The initial patch in this bug puts a hard limit on the maximum memory use of 40M, as explained in comment 22. The code still OOM's. Help?

The OOM's happen in this loop, which has the hard protection. The protection is not very complicated, please check it and see if I'm missing something obvious:
http://hg.mozilla.org/releases/mozilla-beta/annotate/745b004fad4a/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#l3594

or here, which is just making a copy to sort:
http://hg.mozilla.org/releases/mozilla-beta/annotate/745b004fad4a/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp#l122
(I think the copy in this case might be avoidable because the callers can tolerate that what they pass gets clobbered, so maybe I should just change the code)

In real use cases (600k entries), lets say even including the extra copy, this code is allocating 10M of RAM. Why should it OOM? There seem to be a disproportional amount of cases on Windows XP, so I'd almost fear that it really is people running out of memory on tiny old systems. But there are many places in Firefox that use way more temporary memory than this, surely?
> The code still OOM's. Help?

Hmm.  So on crash-stats, there's a large spread in uptimes over which this crash occurs.  There are plenty of single-digit uptimes, and there are also some five-digit uptimes.

That suggests this may not be just "collateral OOM" caused by FF using lots of memory.  If it were, I'd expect the crashes to be weighted towards longer sessions.

I guess that's not much help, is it?  :)

Do you need to process the whole thing all at once?  Can you bring it in piecewise?
So should we back this out of beta/aurora or keep it in?
>So should we back this out of beta/aurora or keep it in?

Back what out? The patch here is harmless even if it didn't stop the problem. 

Backing out all recent UrlClassifier changes because of this crash which we don't understand? Bleh, that makes it difficult to progress forward as this is not reproducible.

It might be that we're hitting a *peak* memory usage with the changes that's higher than before, but not more than about 15M (unless there's another copy I missed). I can attempt to reduce this, but that'll be taking time away from getting the *new* urlclassifier landed, so I'd prefer not to unless we judge the amount of crashes we're getting from this a problematic.
(From comment 26)
> Do you need to process the whole thing all at once?  Can you bring it in piecewise?
UrlClassifierPrefixSet supports piecewise construction, but the entries must be brought in in sorted order. The latter part can (currently) be offloaded to SQLite, though.

So yeah, that should be possible.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #28)
> >So should we back this out of beta/aurora or keep it in?
> 
> Back what out? The patch here is harmless even if it didn't stop the
> problem. 
> 
> Backing out all recent UrlClassifier changes because of this crash which we
> don't understand? Bleh, that makes it difficult to progress forward as this
> is not reproducible.
> 
> It might be that we're hitting a *peak* memory usage with the changes that's
> higher than before, but not more than about 15M (unless there's another copy
> I missed). I can attempt to reduce this, but that'll be taking time away
> from getting the *new* urlclassifier landed, so I'd prefer not to unless we
> judge the amount of crashes we're getting from this a problematic.

We took a change into aurora and beta to fix a crash and it did not seem to fix it. We should generally back out to the state that was tested on the channels for many weeks to reduce risk, as the reward we were seeking doesn't look to be there.

That being said, it's your call on what you want to do.
Unfortunately the two things I said above didn't actually work, so I had to try something different:

1) Clobbering what's passed to SetPrefixes is annoying because it's an XPCOM exposed interface. It's simpler to just document that the caller is responsible for sorting the array before passing and just do that. Code adjusted like that now, this saves a complete copy of the array that's OOM-ing.

2) Although PrefixSet can be constructed in parts, we need the array sorted to do this, and that *isn't* possible from within SQLite because we still rehash the array's entries afterwards, messing up the sorted order. So we need at least 1 full copy in RAM. However, we can probe what size the array will be before constructing it, and SetCapacity it to the correct size. This might save another doubling of the array size in the worst case.

3) I made the arrays involved fallible, and added telemetry to detect when this triggers. This will cause us to disable SafeBrowsing if we OOM, at least until the next update (30-45 mins). Disabling a security feature isn't nice, but if the alternate is to crash the browser with OOM... Maybe the Telemetry can help us narrow down the circumstances in which this happens. I considered falling back to SQLite lookups only, but this (a) complicates the code severely (b) is likely to give a terrible experience as there won't be RAM for SQLite cache either.
Assignee: nobody → gpascutto
Attachment #585702 - Flags: review?(justin.lebar+bug)
Attachment #585702 - Flags: feedback?(dcamp)
This looks virtuous to me, but f- because I'd like to take another look.

># HG changeset patch
># Parent f301341f2e02f2ad33525f616dac019dc0f1e158
># User Gian-Carlo Pascutto <gpascutto@mozilla.com>
>Bug 702217 - Avoid double allocation in UrlClassifier. Handle OOM conditions. r=
>
>diff --git a/toolkit/components/build/nsToolkitCompsCID.h b/toolkit/components/build/nsToolkitCompsCID.h
>--- a/toolkit/components/build/nsToolkitCompsCID.h
>+++ b/toolkit/components/build/nsToolkitCompsCID.h
>@@ -158,19 +158,19 @@
>-// {7902e243-9b00-4673-9288-1b7fc246a8f8}
>+// {f806c3b6-a7cd-4ac9-bb5d-8c8a20a6e175}
> #define NS_URLCLASSIFIERPREFIXSET_CID \
>-{ 0x7902e243, 0x9b00, 0x4673, { 0x92, 0x88, 0x1b, 0x7f, 0xc2, 0x46, 0xa8, 0xf8} }
>+{ 0xf806c3b6, 0xa7cd, 0x4ac9, { 0xbb, 0x5d, 0x8c, 0x8a, 0x20, 0xa6, 0xe1, 0x75} }

The CID (class ID) doesn't need to match the IID (interface ID).  In fact, it
probably shouldn't.  I'd change the CID to a new random value and leave the IID
unchanged, since you didn't modify the interfacce.

>diff --git a/toolkit/components/url-classifier/nsIUrlClassifierPrefixSet.idl b/toolkit/components/url-classifier/nsIUrlClassifierPrefixSet.idl
>--- a/toolkit/components/url-classifier/nsIUrlClassifierPrefixSet.idl
>+++ b/toolkit/components/url-classifier/nsIUrlClassifierPrefixSet.idl
>@@ -36,17 +36,17 @@
>  *
>  * ***** END LICENSE BLOCK ***** */
> 
> #include "nsISupports.idl"
> #include "nsIFile.idl"
> 
> interface nsIArray;
> 
>-[scriptable, uuid(7902e243-9b00-4673-9288-1b7fc246a8f8)]
>+[scriptable, uuid(f806c3b6-a7cd-4ac9-bb5d-8c8a20a6e175)]
> interface nsIUrlClassifierPrefixSet : nsISupports

You didn't modify the interface, so no need to change the interface ID.
(Unless you want to modify the IID because you changed the contract, requiring
the input to setPrefixes to be sorted?  I guess that's OK, although it's kind
of a bummer that none of the methods are commented here.)

> {
>   void setPrefixes([const, array, size_is(aLength)] in unsigned long aPrefixes,
>                    in unsigned long aLength);
>   void addPrefixes([const, array, size_is(aLength)] in unsigned long aPrefixes,
>                    in unsigned long aLength);
>   boolean contains(in unsigned long aPrefix);
>   boolean probe(in unsigned long aPrefix, in unsigned long aKey,

>-nsresult nsUrlClassifierStore::ReadPrefixes(nsTArray<PRUint32>& array,
>+nsresult nsUrlClassifierStore::ReadPrefixes(FallibleTArray<PRUint32>& array,
>                                             PRUint32 aKey)
> {
>-  mozStorageStatementScoper scoper(mAllPrefixStatement);
>+  mozStorageStatementScoper scoper(mAllPrefixGetStatement);
>+  mozStorageStatementScoper scoperToo(mAllPrefixCountStatement);
>   bool hasMoreData;
>   PRUint32 pcnt = 0;
>   PRUint32 fcnt = 0;
> 
> #if defined(PR_LOGGING)
>   PRIntervalTime clockStart = 0;
>   if (LOG_ENABLED()) {
>     clockStart = PR_IntervalNow();
>   }
> #endif
> 
>-  while (NS_SUCCEEDED(mAllPrefixStatement->ExecuteStep(&hasMoreData)) && hasMoreData) {
>+  // Make sure we allocate no more than we really need, so first
>+  // check how much entries there are
>+  if (NS_SUCCEEDED(mAllPrefixCountStatement->ExecuteStep(&hasMoreData)) && hasMoreData) {
>+    PRUint32 count = mAllPrefixCountStatement->AsInt32(0);
>+    if (!array.SetCapacity(count)) {
>+      return NS_ERROR_OUT_OF_MEMORY;
>+    }
>+  } else {
>+    return NS_ERROR_FILE_CORRUPTED;
>+  }
>+
>+  while (NS_SUCCEEDED(mAllPrefixGetStatement->ExecuteStep(&hasMoreData)) && hasMoreData) {
>     PRUint32 prefixval;
>     PRUint32 domainval;
>     PRUint32 size;
> 
>-    const PRUint8 *blobdomain = mAllPrefixStatement->AsSharedBlob(0, &size);
>+    const PRUint8 *blobdomain = mAllPrefixGetStatement->AsSharedBlob(0, &size);
>     if (!blobdomain || (size != DOMAIN_LENGTH))
>       return false;
> 
>     domainval = *(reinterpret_cast<const PRUint32*>(blobdomain));
> 
>-    const PRUint8 *blobprefix = mAllPrefixStatement->AsSharedBlob(1, &size);
>+    const PRUint8 *blobprefix = mAllPrefixGetStatement->AsSharedBlob(1, &size);
>     if (!blobprefix || (size != PARTIAL_LENGTH)) {
>-      const PRUint8 *blobfull = mAllPrefixStatement->AsSharedBlob(2, &size);
>+      const PRUint8 *blobfull = mAllPrefixGetStatement->AsSharedBlob(2, &size);
>       if (!blobfull || (size != COMPLETE_LENGTH)) {
>         prefixval = domainval;
>         fcnt++;
>       } else {
>         prefixval = *(reinterpret_cast<const PRUint32*>(blobfull));
>       }
>     } else {
>       prefixval = *(reinterpret_cast<const PRUint32*>(blobprefix));
>     }
> 
>     PRUint32 keyedVal;
>     nsresult rv = KeyedHash(prefixval, domainval, aKey, &keyedVal);
>     NS_ENSURE_SUCCESS(rv, rv);
> 
>-    array.AppendElement(keyedVal);
>+    if (!array.AppendElement(keyedVal)) {
>+      return NS_ERROR_OUT_OF_MEMORY;
>+    }

This shouldn't fail because we already called SetCapacity.  Can you assert
false if it fails here?  (You could use MOZ_ASSERT(), which is fatal in debug
builds.)

The DB isn't going to change under us, will it?

>@@ -3618,47 +3639,65 @@ nsresult
> nsUrlClassifierDBServiceWorker::ConstructPrefixSet()
> {
>   Telemetry::AutoTimer<Telemetry::URLCLASSIFIER_PS_CONSTRUCT_TIME> timer;
> 
>   PRUint32 key;
>   nsresult rv = mPrefixSet->GetKey(&key);
>   NS_ENSURE_SUCCESS(rv, rv);
> 
>-  nsTArray<PRUint32> array;
>+  FallibleTArray<PRUint32> array;
>   rv = mMainStore.ReadPrefixes(array, key);
>-  NS_ENSURE_SUCCESS(rv, rv);
>+  if (NS_FAILED(rv)) {
>+    goto error_bailout;
>+  }
> 
> #ifdef HASHFUNCTION_COLLISION_TEST
>   array.Sort();
>   PRUint32 collisions = 0;
>   for (int i = 1; i < array.Length(); i++) {
>     if (array[i - 1] == array[i]) {
>       collisions++;
>     }
>   }
>   LOG(("%d collisions in the set", collisions));
> #endif
> 
>+  if (array.IsEmpty()) {
>+    // DB is empty, put a sentinel to show that we loaded it
>+    array.AppendElement(0);

Array is now fallible; need to check return value.

>+  }
>+  // SetPrefixes requires sorted arrays
>+  array.Sort();
>+
>   // clear old tree
>   rv = mPrefixSet->SetPrefixes(nsnull, 0);
>   NS_ENSURE_SUCCESS(rv, rv);
>-  if (array.IsEmpty()) {
>-    // DB is empty, but put a sentinel to show that we looked
>-    array.AppendElement(0);
>+
>+  // construct new prefixset
>+  rv = mPrefixSet->SetPrefixes(array.Elements(), array.Length());
>+  if (NS_FAILED(rv)) {
>+    goto error_bailout;
>   }
>-  // construct new one
>-  rv = mPrefixSet->SetPrefixes(array.Elements(), array.Length());
>-  NS_ENSURE_SUCCESS(rv, rv);
> 
>   // store the new tree to disk
>   rv = mPrefixSet->StoreToFile(mPSFile);
>   NS_WARN_IF_FALSE(NS_SUCCEEDED(rv), "failed to store the prefixset");
> 
>   return NS_OK;
>+
>+ error_bailout:
>+  // load an empty prefixset so the browser can work
>+  array.Clear();
>+  array.AppendElement(0);
>+  mPrefixSet->SetPrefixes(array.Elements(), array.Length());
>+  if (rv == NS_ERROR_OUT_OF_MEMORY) {
>+    Telemetry::Accumulate(Telemetry::URLCLASSIFIER_PS_OOM, 1);

Don't you want to accumulate (URLCLASSIFIER_PS_OOM, 0) on success?

Maybe on failure we can fire a memory pressure event and then reschedule the
URL classifier refresh to run sooner than in 30 minutes.  I guess running
without it is better than crashing, but it's not a lot better!

You can fire memory pressure with:

  nsCOMPtr<nsIObserverService> os = services::GetObserverService();
  NS_ENSURE_TRUE(os, NS_ERROR_FAILURE);
  os->NotifyObservers(nsnull, "memory-pressure",
                      NS_LITERAL_STRING("low-memory").get());

>+  }
>+  return rv;
> }
>diff --git a/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp b/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
>--- a/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
>+++ b/toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
>@@ -195,21 +195,17 @@ nsUrlClassifierPrefixSet::SetPrefixes(co
>       LOG(("Clearing PrefixSet"));
>       mDeltas.Clear();
>       mIndexPrefixes.Clear();
>       mIndexStarts.Clear();
>       mHasPrefixes = false;
>     }
>   }
>   if (aLength > 0) {
>-    // Ensure they are sorted before adding
>-    nsTArray<PRUint32> prefixes;
>-    prefixes.AppendElements(aArray, aLength);
>-    prefixes.Sort();
>-    AddPrefixes(prefixes.Elements(), prefixes.Length());
>+    AddPrefixes(aArray, aLength);
>   }

Could you add debug-only code which checks that the array is sorted, please?

Also, AddPrefixes can now return NS_ERROR_OOM, so we need to propagate this
error out of SetPrefixes.

>   return NS_OK;
> }
> 
> NS_IMETHODIMP
> nsUrlClassifierPrefixSet::AddPrefixes(const PRUint32 * prefixes, PRUint32 aLength)
> {
>@@ -224,22 +220,28 @@ nsUrlClassifierPrefixSet::AddPrefixes(co
>   mNewIndexPrefixes.AppendElement(prefixes[0]);
>   mNewIndexStarts.AppendElement(mNewDeltas.Length());

These variables aren't members, so their names shouldn't start with 'm'.  Can
you please change that while you're here?

> 
>   PRUint32 numOfDeltas = 0;
>   PRUint32 currentItem = prefixes[0];
>   for (PRUint32 i = 1; i < aLength; i++) {
>     if ((numOfDeltas >= DELTAS_LIMIT) ||
>           (prefixes[i] - currentItem >= MAX_INDEX_DIFF)) {
>-      mNewIndexStarts.AppendElement(mNewDeltas.Length());
>-      mNewIndexPrefixes.AppendElement(prefixes[i]);
>+      if (!mNewIndexStarts.AppendElement(mNewDeltas.Length())) {
>+        return NS_ERROR_OUT_OF_MEMORY;
>+      }
>+      if (!mNewIndexPrefixes.AppendElement(prefixes[i])) {
>+        return NS_ERROR_OUT_OF_MEMORY;
>+      }
>       numOfDeltas = 0;
>     } else {
>       PRUint16 delta = prefixes[i] - currentItem;
>-      mNewDeltas.AppendElement(delta);
>+      if (!mNewDeltas.AppendElement(delta)) {
>+        return NS_ERROR_OUT_OF_MEMORY;
>+      }

Don't you need to change mNewIndexPrefixes and friends changed from nsTArray to FallibleTArray?

Actually, I'm kind of confused about this whole method: It looks like we copy
mIndexPrefixes/mStarts/mDeltas, modify them, and then discard the originals.

Why don't we just modify the originals?
Comment on attachment 585702 [details] [diff] [review]
Patch 2. Fix double allocation. Reduce peak size. Allow fallible alloc.

(I don't feel comfortable enough with this code to give a formal review, but I'm happy to provide feedback.)
Attachment #585702 - Flags: review?(justin.lebar+bug) → feedback-
>(Unless you want to modify the IID because you changed the contract, requiring
>the input to setPrefixes to be sorted?  I guess that's OK, although it's kind
>of a bummer that none of the methods are commented here.)

Right, the idea is that the interface *did* change. I guess I should move the comments from the .h to the .idl.

>Array is now fallible; need to check return value.

OK. In the error_bailout case we absolutely need it to work, so I guess I should use an nsAutoTArray<PRUint32, 1> there.

>Don't you want to accumulate (URLCLASSIFIER_PS_OOM, 0) on success?

None of the HISTOGRAM_BOOLEAN users I looked at does this, they only signal the "true" case. I think that's because the HISTOGRAM_BOOLEAN's are just normal HISTOGRAMS with 2 bins and not doing anything is the same as accumulating 0. (CC taras)

>Also, AddPrefixes can now return NS_ERROR_OOM, so we need to propagate this
>error out of SetPrefixes.

Oh, f---, missing that is a serious mistake :P

>Actually, I'm kind of confused about this whole method: It looks like we copy
>mIndexPrefixes/mStarts/mDeltas, modify them, and then discard the originals.
>
>Why don't we just modify the originals?

Because they're being accessed by another thread. Note that we don't take the lock at the entry of that function, only just before the pointer swaps.

Now, actually, the way these functions are being used is completely wrong. AddPrefixes was meant to be used to construct the PrefixSet gradually, while not blocking the main thread. But this ended up not being possible, and what is in there right now ends up blocking the main thread anyway because the SetPrefixes(nsnull) clearing in the callers. SetPrefixes should be folded in with AddPrefixes.
I don't know if it is right or not, but I signal both cases in the boolean telemetry in bug 706164.
gcp's telemetry approach is ok. 
Telemetry::Accumulate(Telemetry::URLCLASSIFIER_PS_OOM, rv == NS_ERROR_OUT_OF_MEMORY) would produce more data, but not it's clear that it would be very interesting. Due to limitations in current telemetry web ui, gcp's existing approach would actually be easier to read.
In most cases it makes sense to report 0/1 because you a counting 
a) frequency of something occurring
b) distribution of binary outcome

We are pretty much guaranteed to do url classifier stuff on startup, so recording a) is pointless... existing approach is fine for b)
> existing approach is fine for b)

How do we get this information out of the histogram with the approach currently in the patch?  Aren't we just going to get a graph which says "N out-of-memory events in this time period"?  Then we'd have to correlate this with number of startup in order to figure out how frequently the OOM occurs?  That's kind of lame.
(In reply to Justin Lebar [:jlebar] from comment #39)
> > existing approach is fine for b)
> 
> How do we get this information out of the histogram with the approach
> currently in the patch?  Aren't we just going to get a graph which says "N
> out-of-memory events in this time period"?  Then we'd have to correlate this
> with number of startup in order to figure out how frequently the OOM occurs?
> That's kind of lame.

or compare to frequency of another metric. Works fine for tracking extremes.
>We are pretty much guaranteed to do url classifier stuff on startup, so recording 
>a) is pointless... existing approach is fine for b)

What I specifically want is being able to look at what telemetry submissions had OOM=1, and then compare *all other data* for that submission to "normal" data. I'm less interested in knowing the frequency than am I in figuring out what's causing it. 

We can directly infer that a Telemetry submission that does not have OOM=1 means OOM=0, right? It's one or the other. I understand this is what you mean with (b).
Yup, sounds like your approach is perfect for that. You'll have to get a hadooper to pull the data for you, we don't yet filter by histogram values
Attachment #585702 - Attachment is obsolete: true
Attachment #585702 - Flags: feedback?(dcamp)
Attachment #586108 - Flags: review?(dcamp)
Attachment #586108 - Flags: feedback?(justin.lebar+bug)
https://tbpl.mozilla.org/?tree=Try&rev=af13fe4bfb0d

Patch 2 implements the review comments. It also gets rid of gradual construction in AddPrefixes and reworks it into MakePrefixSet, which no longer blocks the main thread during PrefixSet construction. This code will disable SafeBrowsing if we run out of memory (and send telemetry back).

Patch 3 implements the fallback to SQLite (instead of disabling) in the latter case. This will make the browser slower, but perhaps its better than crashing or running with SafeBrowsing disabled. If the users system is already in the state where we can't allocate a few MB without OOM, it's probably not going to be "fast" to begin with. Unfortunately, this isn't a trivial change due to the thread interactions.
Attachment #586111 - Flags: review?(dcamp)
Attachment #586111 - Flags: feedback?(justin.lebar+bug)
I added the Mac crash signature.
Crash Signature: unsigned int) | nsTArray<gfxGlyphExtents*, nsTArrayDefaultAllocator>::AppendElements<gfxGlyphExtents*>(gfxGlyphExtents* co... ] → unsigned int) | nsTArray<gfxGlyphExtents*, nsTArrayDefaultAllocator>::AppendElements<gfxGlyphExtents*>(gfxGlyphExtents* co... ] [@ mozalloc_abort | mozalloc_handle_oom | moz_xrealloc]
Depends on: 636113
OS: Windows XP → All
Comment on attachment 586108 [details] [diff] [review]
Patch 2. v2 Fix double allocation. Reduce peak size. Fallible alloc.

Just some nits on the comments; otherwise, it looks sane to me!

> interface nsIUrlClassifierPrefixSet : nsISupports
> {
>+  // Can send an empty Array to clear the tree. A truly "empty tree"
>+  // cannot be represented, so put a sentinel value if that is required
>+  // Requires array to be sorted
>   void setPrefixes([const, array, size_is(aLength)] in unsigned long aPrefixes,
>                    in unsigned long aLength);

Although it may be obvious from the name, please use the first sentence of the
comment to describe what the method does.

Please check the punctiation here (missing periods at end of second and third
lines).

>+  // Does the PrefixSet contain this prefix? not thread-safe
>   boolean contains(in unsigned long aPrefix);

Please capitalize "not".

Also, if this method is not thread-safe, are the rest of them thread-safe?

>+  // Do a lookup in the PrefixSet
>+  // if aReady is set, we will block until there are any entries
>+  // if not set, we will return in aReady whether we were ready or not
>   boolean probe(in unsigned long aPrefix, in unsigned long aKey,
>                 inout boolean aReady);

Again, please punctuate and capitalize this properly.

>+  // Return a key that is used to randomize the collisions in the prefixes
>   PRUint32 getKey();

Does this return "a" or "the" key used to randomize collisions?
Attachment #586108 - Flags: feedback?(justin.lebar+bug) → feedback+
Comment on attachment 586111 [details] [diff] [review]
Patch 3. Fallback to SQLite lookups at doomsday

Have you used a browser with the SQLite fallback forced-on?  Is it noticeably slower?  Is it usable?
>Have you used a browser with the SQLite fallback forced-on?  Is it noticeably 
>slower?  Is it usable?

I forced it on to test the codepath. I didn't notice anything, but this box has 6G RAM and an SSD, so that's meaningless. I'm obviously not going to see a real OOM on this system either.
It's one async SQLite lookup for each HTML document retrieved, is that right?
Attachment #586111 - Flags: feedback?(justin.lebar+bug) → feedback+
>It's one async SQLite lookup for each HTML document retrieved, is that right?

The lookup is async, but it blocks the page load. Two lookups due to the domain being tried in different permutations.
>The lookup is async, but it blocks the page load. Two lookups due to the domain 
>being tried in different permutations.

And not only for the HTML, also for the JS and CSS.
>Also, if this method is not thread-safe, are the rest of them thread-safe?

Yes, as implied by the IMPL_THREADSAFE, I think. This method can probably be removed from the XPCOM interface.
OS: All → Windows XP
OS: Windows XP → All
> Yes, as implied by the IMPL_THREADSAFE,

I'd like all the thread-safe/unsafe-ness to be explicitly laid out in the interface comments, if we're going the route of documenting the interface.
Attachment #586108 - Flags: review?(dcamp) → review+
Attachment #586111 - Flags: review?(dcamp) → review+
https://hg.mozilla.org/mozilla-central/rev/a0fb6ed985c6
https://hg.mozilla.org/mozilla-central/rev/9bee22d394a9
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Target Milestone: Firefox 11 → Firefox 12
Attachment #586108 - Flags: approval-mozilla-aurora?
Attachment #586111 - Flags: approval-mozilla-aurora?
This has baked a few days in Nightly. I see no new crashes any more. I also don't see any hits on the telemetry that was added, which suggests reducing the peak memory usage made the problem way less likely to trigger.

- Fixes a topcrasher.
- Reduces peak Firefox memory usage.
- Prevents blocking the main thread during updates.

User impact if declined: occasional OOM crashes, small responsiveness hits every 30-45minutes
Risk to taking this patch (and alternatives if risky): these fixes are more fundamental and involved, and hence touch more code, than the previous patch.
How many people do you see hitting the SQLite fallback in telemetry?
>How many people do you see hitting the SQLite fallback in telemetry?

>> I also don't see any hits on the telemetry that was added,...
(In reply to Gian-Carlo Pascutto (:gcp) from comment #56)
> - Fixes a topcrasher.

Sorry if this is hidden in the comments already - did the previously landed patches for aurora/beta not mitigate the issue?
The first patch protects against a potential corruption case, but data from the field shows that it is not the cause for the problem as the number of crashes didn't drop. (There's no reason to back out the additional error protection though, so patches 2 and 3 build on it).
Comment on attachment 586108 [details] [diff] [review]
Patch 2. v2 Fix double allocation. Reduce peak size. Fallible alloc.

[Triage Comment]
Approved for Aurora - we're going to take the hit on beta given the fact that this carries non-zero risk.
Attachment #586108 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #586111 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I know why the telemetry isn't triggering:

https://crash-stats.mozilla.com/report/index/1d2909f4-3a8f-461e-9948-0c0772120118

We're getting an OOM abort when a FallibleTArray is calling SetCapacity. This appears to be in violation of SetCapacity's documented behavior. So this looks like a bug in nsTArray or in our memory allocation, not in the code fixed here.
Depends on: 719531
It's currently #9 top crasher in the first day of 10.0b5.
I've looked at a bunch of crash reports, and they mostly look like:

"TotalVirtualMemory":     "2147352576"
"AvailableVirtualMemory": "1848963072", 
"SystemMemoryUsePercentage": "90"

"TotalVirtualMemory":     "2147352576"
"AvailableVirtualMemory": "1977462784", 
"SystemMemoryUsePercentage": "80"

These people have plenty of free RAM.
In the second case, we're reporting that the machine is using 162mb of virtual memory.  But on my Windows 7 machine, Firefox uses more than 200mb of vmem after I start up and load about:memory.

Are these crashes happening during startup?
Crash Signature: unsigned int) | nsTArray<gfxGlyphExtents*, nsTArrayDefaultAllocator>::AppendElements<gfxGlyphExtents*>(gfxGlyphExtents* co... ] [@ mozalloc_abort | mozalloc_handle_oom | moz_xrealloc] → unsigned int) | nsTArray<gfxGlyphExtents* nsTArrayDefaultAllocator>::AppendElements<gfxGlyphExtents*>(gfxGlyphExtents* co... ] [@ mozalloc_abort | mozalloc_handle_oom | moz_xrealloc] [@ mozalloc_abort | mozalloc_handle_oom | moz_xrealloc | nsTArray_base…
>In the second case, we're reporting that the machine is using 162mb of virtual 
>memory.  But on my Windows 7 machine, Firefox uses more than 200mb of vmem after I 
>start up and load about:memory.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366589%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366770%28v=vs.85%29.aspx

This is reporting the address space of the Firefox process. So what it's saying is that Firefox is using 162Mb there.

To really see the free memory at OOM, we should log ullAvailPageFile. It's true we don't really know how much free RAM there is.

>Are these crashes happening during startup?

They're all over the place.
> This is reporting the address space of the Firefox process. So what it's saying is that Firefox is 
> using 162Mb there.

Correct; I should have said "*Firefox* is using 162mb of virtual memory."

What I'm saying is: I've never seen Firefox use so little virtual memory, except, presumably while it's in the process of starting up.

Are specifically the two crashes you cited startup crashes?  (You should be able to tell by the uptime figure.)
>Are specifically the two crashes you cited startup crashes?  (You should be able 
>to tell by the uptime figure.)

This is the second one:

https://crash-stats.mozilla.com/report/index/ca4a7e4f-33ad-4394-9e9e-f31e32120121

7 seconds, so pretty much. Sampling a few with higher uptimes shows correspondingly more memory in use. This all looks completely normal. Which is creepy.
What I think we should do is:

 * Add AvailablePagefile, like you suggested, and see what that reports.
 * If we're seeing very low numbers there, consider enabling MALLOC_PAGEFILE in jemalloc for Windows, which will let jemalloc swap even when the machine is out of swap.
 * In parallel, fix bug 716638, which will let us report how much memory is being requested with malloc.  (It's still conceivable that we're somehow requesting way too much memory with malloc.)

gcp, do you mind filing a bug on the first item there?
Depends on: 720444
The new code from bug 673470 has a path that should've used a fallible array but didn't. Patch attached.

After Bug 719531 landed, I now see URLCLASSIFIER_OOM Telemetry hits. So we're still OOMing here in some cases. It's too bad not crashing here makes us unable to use the extra information Bug 720444 or Bug 716638 would bring. This is a top 20 crasher in 10.0 so I would've liked to understand the underlying issue. 

Taras, does it make sense to add Telemetry for what Bug 720444 gathers? We could then look at the Telemetry submissions with OOM=1 and see if they are really low on memory. I think reordering the Telemetry recording (done in the patch) before the memory allocation will give us information on the allocation size similar to bug 716638 does.
Attachment #595994 - Flags: review?(dcamp)
Attachment #595994 - Flags: feedback?(justin.lebar+bug)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It's hard to follow with tracking flags as it's marked as fixed in 11 and 12, but not in 13.
But according to comments:
* patch 1 landed in Fx 9
* patch 2 & 3 landed in Fx 11
* patch 4 will land in Fx 13
Correct. The code fixed here (in ff9<->ff12) was replaced in Firefox 13, and the new code was missing a fix, so it could potentially crash again with this or a similar signature.
Target Milestone: Firefox 12 → Firefox 13
The problem is in Firefox 13, not in the others. They don't even have the files the patch applies to.
This new patch needs to be in a separate bug, please.  Too many landings here.
Attachment #595994 - Attachment is obsolete: true
Attachment #595994 - Flags: review?(dcamp)
Attachment #595994 - Flags: feedback?(justin.lebar+bug)
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Target Milestone: Firefox 13 → Firefox 12
It's currently #4 top browser crasher in 10.0.5 ESR.
(In reply to Scoobidiver from comment #76)
> It's currently #4 top browser crasher in 10.0.5 ESR.

What's the crash data look like for Firefox 11 and 12? Given that QA does not have a reproducible case for verification, and assuming 11 and 12 show no crashes, it might be worth considering landing this fix on ESR10.

Follow-up question, do we let this fix ride to ESR17, fix in time for a normally scheduled 10.0.6 ESR release, or fix for a quick 10.0.6 ESR release?
This likely needs bug 719531 to land too to be effective.
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #77)
> it might be worth considering landing this fix on ESR10.

> Follow-up question, do we let this fix ride to ESR17, fix in time for a
> normally scheduled 10.0.6 ESR release, or fix for a quick 10.0.6 ESR release?

Sure thing, stability fix for ESR is valid and we can take this for the next ESR 10.0.6 instead of waiting until ESR17 - won't do a chemspill ESR for this though. Please go ahead and nominate for ESR landing
We're still waiting for a landing for this patch and the one in bug 719531 on ESR to fix this top ESR crasher.
I guess I'll nom the patches, then...
Comment on attachment 578214 [details] [diff] [review]
Patch 1. Sanity check prefixes size. Pass down error codes.

esr10? per comment 80.
Attachment #578214 - Flags: approval-mozilla-esr10?
Attachment #586108 - Flags: approval-mozilla-esr10?
Attachment #586111 - Flags: approval-mozilla-esr10?
I sent some email 2012-07-22 to point out I didn't think I could land this. Since it seems to have been missed, I'll repost here:

This bug depends on Bug 719531 to have any effect. I took a look at
applying the patches to the mozilla-esr10 tree. Bug 719531 applies
pretty cleanly, but bug 702217 needed some rebasing (though nothing that
looked particularly worrying).

However, the result doesn't compile because it depends on
HISTOGRAM_BOOLEAN (and potentially other Telemetry changes) which are
not in the esr10 tree.

Now, I could go and fix that up as well, but that's already 2 layers of
"new" rebasing changes. And ESR releases do not go through
nightly-aurora-beta as far as I know, so making new patches would be
releasing untested code.
Oh, I remember that e-mail now.  I'm sorry I forgot it, Gian-Carlo.
Attachment #578214 - Flags: approval-mozilla-esr10? → approval-mozilla-esr10+
Attachment #586108 - Flags: approval-mozilla-esr10? → approval-mozilla-esr10+
Attachment #586111 - Flags: approval-mozilla-esr10? → approval-mozilla-esr10+
Thanks for the clarification Gian-Carlo, if we don't have the Telemetry code in the esr10 then I'm going to wontfix this since it doesn't meet ESR criteria anyway (will also wontfix bug 719531 for ESR). The next ESR (17) will have this included.
Attachment #578214 - Flags: approval-mozilla-esr10+
Attachment #586108 - Flags: approval-mozilla-esr10+
Attachment #586111 - Flags: approval-mozilla-esr10+
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: