Last Comment Bug 726002 - Potential OOM in new UrlClassifier.
: Potential OOM in new UrlClassifier.
Status: RESOLVED FIXED
:
Product: Toolkit
Classification: Components
Component: Safe Browsing (show other bugs)
: 13 Branch
: All All
: -- normal (vote)
: Firefox 17
Assigned To: Gian-Carlo Pascutto [:gcp]
:
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-02-10 07:57 PST by Gian-Carlo Pascutto [:gcp]
Modified: 2014-05-27 12:25 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch 4. Fix a potential OOM in the new code. Reorder telemetry. (2.72 KB, patch)
2012-02-10 07:57 PST, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Ms2ger: review-
justin.lebar+bug: feedback+
Details | Diff | Splinter Review
Patch 5. Clear some nsTArrays as quickly as possible (5.70 KB, patch)
2012-02-11 09:25 PST, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Splinter Review
Patch 6. Followup for wrong return value. (974 bytes, patch)
2012-02-13 06:14 PST, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Splinter Review
Patch 6. Followup for wrong return value. (1.50 KB, patch)
2012-02-13 06:25 PST, Gian-Carlo Pascutto [:gcp]
bugs: review+
Details | Diff | Splinter Review
Patch 1. Fix a potential OOM in the new code. Reorder telemetry. (2.81 KB, patch)
2012-02-13 09:02 PST, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Splinter Review
Patch 2. Reduce max memory by freeing arrays as early as possible. (5.80 KB, patch)
2012-02-13 09:07 PST, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Splinter Review
Patch. Backout part 1 (2.72 KB, patch)
2012-04-19 09:20 PDT, Gian-Carlo Pascutto [:gcp]
akeybl: approval‑mozilla‑aurora+
mark.finkle: approval‑mozilla‑central+
Details | Diff | Splinter Review
Patch. Backout part 2 (5.69 KB, patch)
2012-04-19 09:21 PDT, Gian-Carlo Pascutto [:gcp]
akeybl: approval‑mozilla‑aurora+
mark.finkle: approval‑mozilla‑central+
Details | Diff | Splinter Review

Description Gian-Carlo Pascutto [:gcp] 2012-02-10 07:57:33 PST
Created attachment 596051 [details] [diff] [review]
Patch 4. Fix a potential OOM in the new code. Reorder telemetry.

The new code from bug 673470 has a path that should've used a fallible array but didn't. Patch attached.

After Bug 719531 landed, I now see URLCLASSIFIER_OOM Telemetry hits. So we're still OOMing here in some cases. It's too bad not crashing here makes us unable to use the extra information Bug 720444 or Bug 716638 would bring. This is a top 20 crasher in 10.0 so I would've liked to understand the underlying issue. 

Taras, does it make sense to add Telemetry for what Bug 720444 gathers? We could then look at the Telemetry submissions with OOM=1 and see if they are really low on memory. I think reordering the Telemetry recording (done in the patch) before the memory allocation will give us information on the allocation size similar to bug 716638 does.
Comment 1 Justin Lebar (not reading bugmail) 2012-02-10 08:02:20 PST
> Taras, does it make sense to add Telemetry for what Bug 720444 gathers?

Problem is, if we collect this data when telemetry ping runs, it's not necessarily going to reflect the state of things when we ran out of memory here.

Why don't we, in a separate bug, temporarily turn all the relevant fallible TArrays into infallible arrays and then look at the resulting crash reports?
Comment 2 Gian-Carlo Pascutto [:gcp] 2012-02-10 08:23:02 PST
>Problem is, if we collect this data when telemetry ping runs, it's not necessarily 
>going to reflect the state of things when we ran out of memory here.

Good point.

What about making the OOM not a boolean, but transmit the free-RAM value instead? (I'd guess that will confuse our backend right now, so it'd have to be renamed)

>Why don't we, in a separate bug, temporarily turn all the relevant fallible 
>TArrays into infallible arrays and then look at the resulting crash reports?

There aren't that many OOMs (I see 4 pings so far, which might have been from 1 person for all we know), so we might have to do this on Aurora, and maybe even Beta, then back out again.

That's not very pleasant.
Comment 3 Justin Lebar (not reading bugmail) 2012-02-10 08:31:44 PST
> What about making the OOM not a boolean, but transmit the free-RAM value instead? (I'd guess that 
> will confuse our backend right now, so it'd have to be renamed)

We could do that, but we'd lose the stack trace and data on exactly which TArray operation was causing the failure.  (I guess you could add one histogram for each possible failure point.)

I guess we could send the available-commit-space number and investigate further only if it's abnormally high...
Comment 4 Justin Lebar (not reading bugmail) 2012-02-10 09:26:24 PST
Comment on attachment 596051 [details] [diff] [review]
Patch 4. Fix a potential OOM in the new code. Reorder telemetry.

This looks sane, although I really don't know this code.  Note that mCompletions (used in LookupCache::Build) is an *infallible* TArray.

gcp and I discussed on IRC that it might be possible to reduce the working memory of LookupCache::Build by destroying aAddCompletes immediately after constructing mCompletions.

Indeed, you could go further than that and free aAddCompletes *while* you build mCompletions.  Add an item, remove the corresponding item, and periodically call TArray::Compact().  You could do the same thing in ConstructPrefixSet.
Comment 5 Gian-Carlo Pascutto [:gcp] 2012-02-11 09:25:56 PST
Created attachment 596341 [details] [diff] [review]
Patch 5. Clear some nsTArrays as quickly as possible

This basically implements the easiest part of jlebar's suggestion. And I found another place where we keep a temporary big array longer than needed.

Will deal with OOM telemetry in a separate bug.

https://tbpl.mozilla.org/?tree=Try&rev=3fadf2578425
Comment 6 :Ms2ger (⌚ UTC+1/+2) 2012-02-13 04:41:10 PST
Comment on attachment 596051 [details] [diff] [review]
Patch 4. Fix a potential OOM in the new code. Reorder telemetry.

Funnily enough, nsTArray::SetCapacity returns a boolean. Please backout.
Comment 7 Gian-Carlo Pascutto [:gcp] 2012-02-13 06:14:38 PST
Created attachment 596667 [details] [diff] [review]
Patch 6. Followup for wrong return value.

This is a trivial fix - let's do a followup instead.
Comment 8 Gian-Carlo Pascutto [:gcp] 2012-02-13 06:25:25 PST
Created attachment 596669 [details] [diff] [review]
Patch 6. Followup for wrong return value.

It'll be better if the fix actually compiles.

Previous patches already landed in inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e0e2cc5570ac
https://hg.mozilla.org/integration/mozilla-inbound/rev/91eed468cff8
Comment 9 :Ms2ger (⌚ UTC+1/+2) 2012-02-13 07:50:16 PST
Are you landing or not?
Comment 10 Gian-Carlo Pascutto [:gcp] 2012-02-13 08:01:35 PST
The second fix also doesn't compile (I'm glad I checked before landing...). I'm going to back out.

(error_bailout assumes rv contains the error value, so the SetCapacity check should set it, that's *another* error)
Comment 12 Gian-Carlo Pascutto [:gcp] 2012-02-13 09:02:21 PST
Created attachment 596697 [details] [diff] [review]
Patch 1. Fix a potential OOM in the new code. Reorder telemetry.
Comment 13 Gian-Carlo Pascutto [:gcp] 2012-02-13 09:07:03 PST
Created attachment 596699 [details] [diff] [review]
Patch 2. Reduce max memory by freeing arrays as early as possible.

https://tbpl.mozilla.org/?tree=Try&rev=b25710891998

I also did some manual checking that this doesn't break the protection. Our testsuite doesn't actually exercise the "update online from Google" path, and I think the first patch here actually did break that. Good that it was caught...
Comment 16 Gian-Carlo Pascutto [:gcp] 2012-04-19 09:20:32 PDT
Created attachment 616611 [details] [diff] [review]
Patch. Backout part 1

[Approval Request Comment]
Backout due to bug 744993:

a01cf079ee0b Bug 730247
1a6d008acb4f Bug 729928
f8bf3795b851 Bug 729640
35bf0d62cc30 Bug 726002
a010dcf1a973 Bug 726002
e9291f227d63 Bug 725597
db52b4916cde Bug 673470
173f90d397a8 Bug 673470
Comment 17 Gian-Carlo Pascutto [:gcp] 2012-04-19 09:21:04 PDT
Created attachment 616612 [details] [diff] [review]
Patch. Backout part 2
Comment 20 Alex Keybl [:akeybl] 2012-04-24 07:11:19 PDT
Comment on attachment 616611 [details] [diff] [review]
Patch. Backout part 1

Sorry - thought this bug was reopened because the backout was backed out. My mistake. Approved for Aurora 13 (or Beta 13 if the merge occurs before we land).

Note You need to log in before you can comment on or make changes to this bug.