Last Comment Bug 820602 - areweslimyet.com shows ~40MB regression (~11%) in leaked windows
: areweslimyet.com shows ~40MB regression (~11%) in leaked windows
Status: RESOLVED FIXED
[MemShrink:P1]
: regression
Product: Core
Classification: Components
Component: ImageLib (show other bugs)
: Trunk
: All All
: -- normal with 4 votes (vote)
: mozilla21
Assigned To: Seth Fowler [:seth] [:s2h]
:
:
Mentors:
: 828748 (view as bug list)
Depends on: 842756
Blocks: 815471
  Show dependency treegraph
 
Reported: 2012-12-11 15:09 PST by Benoit Girard (:BenWa)
Modified: 2013-03-26 05:44 PDT (History)
37 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
+
fixed


Attachments
Remove change to decoder notification behavior that causes AWSY leak. (2.51 KB, patch)
2013-01-25 18:52 PST, Seth Fowler [:seth] [:s2h]
joe: review+
bajaj.bhavana: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Benoit Girard (:BenWa) 2012-12-11 15:09:37 PST
https://areweslimyet.com/

Note the last graph that gives a breakdown.
Comment 2 Nicholas Nethercote [:njn] 2012-12-11 15:16:31 PST
Details are wonderful things.

The increase is in the blue, green, yellow and pink lines.

There's a ~15 MiB spike on December 7.  The changeset for that run is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3bb2d53dde34&tochange=8432562db685.  Of those, https://hg.mozilla.org/mozilla-central/rev/c98fd0f8f3f3 (sfkin) and https://hg.mozilla.org/mozilla-central/rev/d96db52bedc4 (billm) caught my eye.

johns said he would trigger additional runs to narrow down the regression range for the big spike.

Apart from that there's just been a gradual rise, which is a nightmare to debug.
Comment 3 Bill McCloskey (:billm) 2012-12-11 15:24:55 PST
I doubt that Steve's push caused this. My change might have, although it would be weird.
Comment 4 Benoit Girard (:BenWa) 2012-12-18 09:02:17 PST
This got significantly worse today :(. We're now looking at a 40MB regression.
Comment 5 John Schoenick [:johns] 2012-12-18 11:07:42 PST
Second regression range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2e70b718903a&tochange=dd277d439d31

I queued this range for per-changeset tests as well. The first range has completed tests, but because AWSY uses the push timestamp, all the builds have the same timestamp and are impossible to distinguish on the graph. I'm fixing that now so this data can actually be useful.
Comment 6 Nicholas Nethercote [:njn] 2012-12-18 14:46:20 PST
(In reply to Benoit Girard (:BenWa) from comment #4)
> This got significantly worse today :(. We're now looking at a 40MB
> regression.

Please file a separate bug.  Bonus points for including details like those in comment 2 (when it happened, which measurements were affected, etc).  Thanks.
Comment 7 Nicholas Nethercote [:njn] 2012-12-26 14:09:54 PST
johns:  any progress on bisection here?  It'd be great to work out what happened in both these cases.  Thanks.
Comment 8 Benoit Girard (:BenWa) 2013-01-08 11:39:13 PST
The regression is now at 40MB and is on aurora.
Comment 9 Olli Pettay [:smaug] (way behind * queues, especially ni? queue) 2013-01-09 17:10:34 PST
*** Bug 828748 has been marked as a duplicate of this bug. ***
Comment 10 Nicholas Nethercote [:njn] 2013-01-09 19:22:54 PST
The resident numbers were all over the shop in December.  In contrast, the explicit numbers were more stable, with a clear jump on December 18, and it looks like it's JS at fault.  So let's focus on that.

johns is working on AWSY to make bisecting 200+ changesets faster and simpler.
Comment 11 Brendan Eich [:brendan] 2013-01-10 15:44:56 PST
Bill: can you take assignment of this bug or assign it to someone? Thanks,

/be
Comment 12 Bill McCloskey (:billm) 2013-01-10 16:00:42 PST
John said he would take assignment of this until we get the regression ranges.
Comment 13 John Schoenick [:johns] 2013-01-14 20:11:36 PST
Okay, I updated http://areweslimyyet.com a bit to make it easier to visualize this. If you click on any datapoint representing multiple builds, it will bring up a list of those builds with deltas between each, with a [view] button to load each one's about:memory dump.

The most suspicious commit is on Dec 18th:
https://hg.mozilla.org/mozilla-central/rev/b0c4f68563c3

Which shows a spike of ~41M in most lines. The commits immediately before and after it are fairly stable.

The second spike of interest is on December 8, where we regress by about 18M. We don't have per-changeset resolution here, but the spike looks to be in this range:

8a0c82cf76ab	Saurabh Anand — Backed out changeset cdb687853418 (bug 817785)
75e0d521bc2b	Ian Stakenvicius — Bug 809430 - Add symbol-versions to libxul and libmozjs for linux. r=glandium
d5179738abfe	Peter Van der Beken — Fix for bug 818219 (Replace HTMLElement quickstubs with new binding methods). r=bz.
7d70a78e4d5f	Jim Blandy — Bug 815045: Delete JOF_TYPE_IS_EXTENDED_JUMP; it is unused. Fix comment in js::frontend::AddToSrcNoteDelta. r=luke
144a65ad3059	Jim Blandy — Bug 815468: BaseProxyHandler::iteratorNext is never used; delete it. r=luke
cdb687853418	Saurabh Anand — Bug 817785 - Split out WebGLProgram into separate files

(The last commit in this range backs out the first, so those aren't likely)

Interestingly, though, the *resident* memory seems to go back down somewhere in this range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=718145f0d606&tochange=1cc19f36ee66

While explicit remains regressed. We don't have high resolution data for this range at all, and the tests are fairly noisy.

I've queued a ton[1] of builds to get per-changeset resolution on the remaining spots, as well as to retest the range around b0c to rule out very-unlucky noise. I'll update here when those are done, though it looks like this is two separate bugs.

[1] http://areweslimyet.com/status/
Comment 14 Nicholas Nethercote [:njn] 2013-01-14 20:24:45 PST
 > The most suspicious commit is on Dec 18th:
> https://hg.mozilla.org/mozilla-central/rev/b0c4f68563c3

Seth, Joe:  this is your change.  Could it have affected JS memory consumption?
Comment 15 Seth Fowler [:seth] [:s2h] 2013-01-14 21:54:53 PST
Quite odd that it would have affected JS memory consumption. I could easily imagine there being leaks hidden in any large patch, but I (perhaps naively) wouldn't have expected this patch to cause a JS memory leak.

Being still new to this codebase, I don't know enough about what is being measured here; is this specifically JS-level allocation? Or could it be things like wrappers around native objects? Could misuse of nsCOMPtrs and such show up as a JS memory leak if the objects are accessible from JavaScript?
Comment 16 Boris Zbarsky [:bz] (still a bit busy) 2013-01-14 22:24:30 PST
It's quote possible that the HTMLElement change would lead to larger JS memory consumption.  Instead of jspropertyops that's using JSNative getters and setters, which means more function objects and whatnot....
Comment 17 Andrew McCreight [:mccr8] 2013-01-14 22:41:37 PST
That should be easy enough to check looking at the about:memory log stuff that AWSY produces.
Comment 18 John Schoenick [:johns] 2013-01-15 14:22:02 PST
Looking at this more:

b0c4f68563c3 landed previously as
https://hg.mozilla.org/mozilla-central/rev/6066a23ba701
... which caused a ~48Meg spike, but was backed out by
https://hg.mozilla.org/mozilla-central/rev/47bd1f6fd8ed
... which dropped memory back to previous levels

So Bug 815471 is definitely the culprit here. Comparing the memory dumps (click the 'view' button next to any changeset on AWSY) shows that we are probably leaking windows. This explains why several values went up, including JS:

http://people.mozilla.com/~jschoenick/snapshot-1358288220.png

I will open a separate bug for the other regression once it's been narrowed down more
Comment 19 Seth Fowler [:seth] [:s2h] 2013-01-15 14:26:33 PST
Will look into this.
Comment 20 John Schoenick [:johns] 2013-01-15 14:31:23 PST
(In reply to John Schoenick [:johns] from comment #18)
> http://people.mozilla.com/~jschoenick/snapshot-1358288220.png

To be clearer with what's going on in that screenshot: In the "After TP5 [+30s, forced GC]" checkpoint, we previously would have a handful of active windows. After bug 815471, we also have a number of cached windows, which should not be around after waiting 30s and forcing a GC.
Comment 21 Seth Fowler [:seth] [:s2h] 2013-01-17 18:02:14 PST
So far I've had no luck reproducing this locally. John, can you share the exact procedure you used? I built a version of Firefox from mozilla-central at 116228:38889f86a465 (the last of my set of patches that you said eliminated the problem when they were backed out). I then ran a local install of Talos on the Tp5n pageset linked from |https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5|.

It wasn't obvious how to make Talos pause at the end of the run so I could check about:memory, so I added a "% wait.html" test at the end of the pageset, and then checked about:memory in a new tab. This may mean I was checking before the forced GC, but even so, in that case I'd expected _worse_ results, not better.

It's a bit difficult to fix something I can't measure, so I'd be very grateful for any help you can offer. Thanks!
Comment 22 John Schoenick [:johns] 2013-01-17 18:49:55 PST
(In reply to Seth Fowler [:seth] from comment #21)
> So far I've had no luck reproducing this locally. John, can you share the
> exact procedure you used? I built a version of Firefox from mozilla-central
> at 116228:38889f86a465 (the last of my set of patches that you said
> eliminated the problem when they were backed out). I then ran a local
> install of Talos on the Tp5n pageset linked from
> |https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5|.
> 
> It wasn't obvious how to make Talos pause at the end of the run so I could
> check about:memory, so I added a "% wait.html" test at the end of the
> pageset, and then checked about:memory in a new tab. This may mean I was
> checking before the forced GC, but even so, in that case I'd expected
> _worse_ results, not better.
> 
> It's a bit difficult to fix something I can't measure, so I'd be very
> grateful for any help you can offer. Thanks!

AWSY is using the TP5 pageset, but not talos directly. What it's doing is loading the full TP5 pageset into 30 tabs (so, overwriting the oldest tab for pages 31+). It then sits for 30 seconds, forces a GC, closes all tabs, sits for thirty seconds, forces another GC, and then repeats five times. You can see the fairly simple mozmill script that does this here:
https://github.com/Nephyrin/MozAreWeSlimYet/blob/master/mozmill_endurance_test/test1.js

If you want to run AWSY locally, the test looks for the TP5 pageset at http://localhost:8001 through http://localhost:8100 (an inelegant way to force separate domains/compartments for each page). The tester accomplishes this by just including each port in its nginx.conf:

>  listen 8004;
>  listen 8005;
>  listen 8006;
>  ...

If you can connect to the MV VPN you can also just replace localhost with albus.mv.mozilla.com in test1.js, which is already hosting TP5 on those ports.

To actually run a test you need mozmill installed (pip install mozmill --user) and the build you want to test

> ./run_slimtest.py -b path/to/firefox --sqlitedb mydatabase.sqlite --buildname <commit> --buildtime <commit timestamp>

The default test takes about 90 minutes to run, though, so you may want to do this in a VNC server:
> vncserver :9
> DISPLAY=:9 ./run_slimtest.py ...
> vncserver -kill :9

You can also edit slimtest_config.py to get an abbreviated test (though we don't know at what point the regression manifests). Reducing the per-tab timeout from 10s will greatly speed up the test, but if you set it to zero GC scheduling behaves very badly and you get erratic results. For local testing I usually set the tab timeout to a few seconds and two iterations. This reduces the test to ~4 minutes while staying somewhat similar to the full test.

To export data from the sqlitedb to a more useful format:
> mkdir data
> ./create_graph_json.py mydatabase.sqlite my_data ./data
> ./merge_graph_json.py my_data ./data  # (only required if you want to launch the website locally to view the data)

Which will give you fairly straight forward JSON files with the memory dumps from each checkpoint. You can also start the AWSY site on a local webserver and use it to browse the data. (it looks for the json files in ./html/data/, if your webserver of choice doesn't support transparent .gz files just uncompress the json first)

Feel free to ping :johns on IRC if you run into any issues with this
Comment 23 Nicholas Nethercote [:njn] 2013-01-17 19:37:30 PST
Seth, is this a difficult patch to back out -- i.e. has other stuff landed on top of it?  Because at this point I think backing it out is the default option.
Comment 24 Seth Fowler [:seth] [:s2h] 2013-01-18 14:16:46 PST
Unfortunately this is a very difficult patch to back out. Let's give it a while before we start considering that. Major refactoring has gone on in imagelib recently and this is unfortunately nearly at the bottom of that stack of patches. (Wouldn't you know it?)

As of now this is officially my top priority. Hopefully a fix will be in next week and we won't need to worry about backing out.
Comment 25 Seth Fowler [:seth] [:s2h] 2013-01-18 14:18:52 PST
(In reply to John Schoenick [:johns] from comment #22)
> Feel free to ping :johns on IRC if you run into any issues with this

Thanks so much John! I really appreciate you taking the time to write this up. This will help a lot.
Comment 26 Seth Fowler [:seth] [:s2h] 2013-01-18 18:36:25 PST
OK, I've been able to run the AWSY tests locally and reproduced the leak in commit 6066a23ba701. Great. I'll try to reduce this to a minimal test case and hopefully track down the problem soon.
Comment 27 Nicholas Nethercote [:njn] 2013-01-20 14:50:27 PST
Hopefully you'll find the problem soon.  Because shipping a change that causes a 15% regression in our primary memory consumption benchmark *simply isn't acceptable*.
Comment 28 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2013-01-20 16:05:33 PST
If you need help you know where to find me.
Comment 29 Peter Van der Beken [:peterv] 2013-01-21 01:21:25 PST
(In reply to John Schoenick [:johns] from comment #13)
> The most suspicious commit is on Dec 18th:
> https://hg.mozilla.org/mozilla-central/rev/b0c4f68563c3

I'm probably missing somthing, but I have a question about this changeset, in particular this:

+      mImage = ImageFactory::CreateImage(aRequest, mStatusTracker.forget(), mContentType,
+                                         mURI, mIsMultiPartChannel, mInnerWindowId);

imgRequest's mStatusTracker is a nsRefPtr, so this transfers its reference to ImageFactory::CreateImage but that just takes a raw imgStatusTracker*. The pointer eventually gets stored in ImageResource's mStatusTracker, which is also a nsRefPtr and so addrefs. So you didn't release imgRequest's reference, what code is supposed to do that? If you actually want to transfer that reference it would be clearer to make CreateImage and the various functions that this gets past through up to ImageResource's constructor take an already_AddRefed. Or simply null out mStatusTracker after the ImageFactory::CreateImage and don't use forget. Is this code so performance critical that we can't afford an extra addref/release pair?
Comment 30 Seth Fowler [:seth] [:s2h] 2013-01-21 21:13:48 PST
(In reply to Peter Van der Beken [:peterv] from comment #29)

Thanks for taking a look at this code, Peter! Unfortunately, I don't think this particular thing is the source of the issue.

> imgRequest's mStatusTracker is a nsRefPtr

At the time of this commit, it was an nsAutoPtr.

> Or simply null out mStatusTracker after the ImageFactory::CreateImage
> and don't use forget.

That's what the current version of the code does.
Comment 31 Seth Fowler [:seth] [:s2h] 2013-01-25 01:48:09 PST
OK, I've found the problem here. The patch makes changes to the Decoder class to allow it to send OnStartDecode lazily, when it receives its first Write() call and thus its first piece of data. This was needed because otherwise we couldn't call RasterImage::Init() from ImageFactory, since OnStartDecode's listeners assume that the image that owns the decoder is already available by calling GetImage() on the associated imgRequest. Unfortunately, AWSY shows that there are circumstances where that first Write() call doesn't happen until its too late, because the decoder ends up getting shut down first, and in that case we don't block page load correctly which seems to be the cause of the issue we're discussing in this bug.

The options for fixing the problem are as follows:

1. Fix the OnStartDecode listeners (and their transitive listeners). I tried this before going with the solution that ended up in the patch, and it proved quite difficult. This may be the best longterm solution, but I think we should avoid it for the purposes of fixing the immediate regression.

2. Make ImageFactory's Create*() methods have an out parameter, and write to it before calling RasterImage::Init. This will work but we can't use a standard XPCOM-style out parameter for this. I tried that as well while creating this patch and the fact that getter_AddRefs nulls out the destination before the function is called _also_ breaks things. Stuff relies on the fact that the old image is still around until the new one is swapped in, I am pained to say. However, we could do something a bit more unconventional, like take an |nsRefPtr<Image>&| as the out parameter. Ugly, but it'd work.

3. Separate ImageFactory's Create*() methods into two phases, one which just constructs and returns an appropriate image and one which does the remaining initialization steps. This is also pretty ugly, especially because some of these methods take so many parameters and most or all of them would probably have to be supplied twice. Again, though, this would fix the problem.

I'm going to take a brief look at option 1 above again tomorrow morning to see whether things still look as bad as they did when I wrote the original patch. If so, I'll compare option 2 and option 3 and see which produces less ugly code. We should have a patch tomorrow, and hopefully if review/try goes well this will get fixed in the next couple of days.
Comment 32 Seth Fowler [:seth] [:s2h] 2013-01-25 02:42:35 PST
Actually there might be a fourth possibility: create a notification that just blocks page load, separate from the StartDecode notification. Something like OnPrepareDecode. I'll take a look at this possibility as well.
Comment 33 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2013-01-25 09:42:58 PST
There is a notification that just blocks page load.

http://hg.mozilla.org/mozilla-central/annotate/a207f33adc1a/image/public/imgIOnloadBlocker.idl#l53
Comment 34 Seth Fowler [:seth] [:s2h] 2013-01-25 10:13:11 PST
Hmm, good point, and imgStatusTracker already implements that interface. This may be the path of least resistance, then. I'll give it a shot.
Comment 35 Seth Fowler [:seth] [:s2h] 2013-01-25 13:32:07 PST
Hmm, took a deeper look and actually imgIOnloadBlocker is not what we want. (My mistake: imgStatusTracker doesn't implement it; it just has non-virtual versions of its methods.) It makes a lot more sense to make a version of BlockOnload() on imgIDecoderObserver that does not take an imgIRequest argument.

Regardless, I've turned most of the ideas discussed above into preliminary patches and pushed them to try. This stuff can be tricky to get right; I remember that this specific issue delayed the original patch that introduced this regression substantially. I'll take a look at the try results when I get back from lunch and it should give me an idea of where dragons are hiding here and which solutions are feasible. Hopefully one of the candidates will be good enough to turn into a review-worthy patch by the end of the day.
Comment 36 Seth Fowler [:seth] [:s2h] 2013-01-25 18:52:20 PST
Created attachment 706708 [details] [diff] [review]
Remove change to decoder notification behavior that causes AWSY leak.

OK, looks like we are ending up with a trivial fix here. As an experiment I
tried simply reverting the changes to the Decoder class and in my testing so far
it seems like everything's OK.

This is quite surprising to me, as I remember struggling with many try failures
related to notifications from the decoder that were solved by this change. It
may be that the issues I had were solved by later changes in the original patch,
or it could be that things were fixed by unrelated commits between the time when
I wrote the patch and when it got r+'d and checked in.

At any rate, the important thing right now is that this patch appears to solve
the problem. I've got a try job going here to verify that nothing has been
broken:

https://tbpl.mozilla.org/?tree=Try&rev=f7cdbe4ae1d5
Comment 37 Nicholas Nethercote [:njn] 2013-01-25 19:41:15 PST
Excellent news!
Comment 38 Seth Fowler [:seth] [:s2h] 2013-01-28 10:26:04 PST
Thanks for the review, Joe. Try looks OK too. Requesting checkin.
Comment 40 Ryan VanderMeulen [:RyanVM] 2013-01-28 16:08:27 PST
https://hg.mozilla.org/mozilla-central/rev/d9a0933dfe2a
Comment 41 Nicholas Nethercote [:njn] 2013-01-28 19:53:26 PST
The first post-landing AWSY run has happened, and it looks like this fixed the problem!  All the lines are down by approximately the same amount that they jumped on Dec 18.
Comment 42 Seth Fowler [:seth] [:s2h] 2013-01-29 11:10:55 PST
Great news!
Comment 43 Seth Fowler [:seth] [:s2h] 2013-01-29 11:16:08 PST
Comment on attachment 706708 [details] [diff] [review]
Remove change to decoder notification behavior that causes AWSY leak.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 815471
User impact if declined: Significant memory leak.
Testing completed (on m-c, etc.): Passes tests, landed on m-c. This change just reverts part of the code changed in bug 815471 to its state before that bug landed, so this is well-tested code that has been around for a long time.
Risk to taking this patch (and alternatives if risky): Low risk as we're simply reverting a small change. There is no real alternative other than living with the memory leak.
String or UUID changes made by this patch: None.
Comment 44 bhavana bajaj [:bajaj] 2013-01-30 12:42:35 PST
Comment on attachment 706708 [details] [diff] [review]
Remove change to decoder notification behavior that causes AWSY leak.

Low risk, memory win.Approving on aurora.
Comment 45 Daniel Holbert [:dholbert] 2013-01-30 13:45:50 PST
https://hg.mozilla.org/releases/mozilla-aurora/rev/4ae4e4a76ea7
Comment 46 Seth Fowler [:seth] [:s2h] 2013-01-30 13:46:20 PST
Thanks Daniel!
Comment 47 Seth Fowler [:seth] [:s2h] 2013-01-30 13:46:40 PST
And Bhavana!
Comment 48 David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) 2013-02-05 20:06:11 PST
Did a bug ever get filed on the other regressions?  We're still substantially worse off than we were in mid-November on https://areweslimyet.com/ , and I didn't see anything here to indicate another bug that got spun off.  (It looks like this bug actually ended up covering the second regression originally noted in comment 4, and I don't see any sign of another bug being filed.)
Comment 49 Nicholas Nethercote [:njn] 2013-02-05 20:59:56 PST
> Did a bug ever get filed on the other regressions?  

Bug 833518.
Comment 50 Mihai Morar, (:MihaiMorar) 2013-03-26 05:40:06 PDT
How can QA verify this fix? Is necessary to test using Talos?
Comment 51 Justin Lebar (not reading bugmail) 2013-03-26 05:44:11 PDT
(In reply to MarioMi (:MarioMi) from comment #50)
> How can QA verify this fix? Is necessary to test using Talos?

Talos and AWSY are not the same test.

You can verify this fix by looking at areweslimyet.com.

Note You need to log in before you can comment on or make changes to this bug.