Closed Bug 820602 Opened 12 years ago Closed 11 years ago

areweslimyet.com shows ~40MB regression (~11%) in leaked windows

Categories

(Core :: Graphics: ImageLib, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla21
Tracking Status
firefox20 + fixed
firefox21 + fixed

People

(Reporter: BenWa, Assigned: seth)

References

Details

(Keywords: regression, Whiteboard: [MemShrink:P1])

Attachments

(1 file)

https://areweslimyet.com/

Note the last graph that gives a breakdown.
Details are wonderful things.

The increase is in the blue, green, yellow and pink lines.

There's a ~15 MiB spike on December 7.  The changeset for that run is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3bb2d53dde34&tochange=8432562db685.  Of those, https://hg.mozilla.org/mozilla-central/rev/c98fd0f8f3f3 (sfkin) and https://hg.mozilla.org/mozilla-central/rev/d96db52bedc4 (billm) caught my eye.

johns said he would trigger additional runs to narrow down the regression range for the big spike.

Apart from that there's just been a gradual rise, which is a nightmare to debug.
I doubt that Steve's push caused this. My change might have, although it would be weird.
This got significantly worse today :(. We're now looking at a 40MB regression.
Second regression range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2e70b718903a&tochange=dd277d439d31

I queued this range for per-changeset tests as well. The first range has completed tests, but because AWSY uses the push timestamp, all the builds have the same timestamp and are impossible to distinguish on the graph. I'm fixing that now so this data can actually be useful.
(In reply to Benoit Girard (:BenWa) from comment #4)
> This got significantly worse today :(. We're now looking at a 40MB
> regression.

Please file a separate bug.  Bonus points for including details like those in comment 2 (when it happened, which measurements were affected, etc).  Thanks.
johns:  any progress on bisection here?  It'd be great to work out what happened in both these cases.  Thanks.
The regression is now at 40MB and is on aurora.
The resident numbers were all over the shop in December.  In contrast, the explicit numbers were more stable, with a clear jump on December 18, and it looks like it's JS at fault.  So let's focus on that.

johns is working on AWSY to make bisecting 200+ changesets faster and simpler.
Whiteboard: [js:p1]
Bill: can you take assignment of this bug or assign it to someone? Thanks,

/be
John said he would take assignment of this until we get the regression ranges.
Assignee: general → jschoenick
Okay, I updated http://areweslimyyet.com a bit to make it easier to visualize this. If you click on any datapoint representing multiple builds, it will bring up a list of those builds with deltas between each, with a [view] button to load each one's about:memory dump.

The most suspicious commit is on Dec 18th:
https://hg.mozilla.org/mozilla-central/rev/b0c4f68563c3

Which shows a spike of ~41M in most lines. The commits immediately before and after it are fairly stable.

The second spike of interest is on December 8, where we regress by about 18M. We don't have per-changeset resolution here, but the spike looks to be in this range:

8a0c82cf76ab	Saurabh Anand — Backed out changeset cdb687853418 (bug 817785)
75e0d521bc2b	Ian Stakenvicius — Bug 809430 - Add symbol-versions to libxul and libmozjs for linux. r=glandium
d5179738abfe	Peter Van der Beken — Fix for bug 818219 (Replace HTMLElement quickstubs with new binding methods). r=bz.
7d70a78e4d5f	Jim Blandy — Bug 815045: Delete JOF_TYPE_IS_EXTENDED_JUMP; it is unused. Fix comment in js::frontend::AddToSrcNoteDelta. r=luke
144a65ad3059	Jim Blandy — Bug 815468: BaseProxyHandler::iteratorNext is never used; delete it. r=luke
cdb687853418	Saurabh Anand — Bug 817785 - Split out WebGLProgram into separate files

(The last commit in this range backs out the first, so those aren't likely)

Interestingly, though, the *resident* memory seems to go back down somewhere in this range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=718145f0d606&tochange=1cc19f36ee66

While explicit remains regressed. We don't have high resolution data for this range at all, and the tests are fairly noisy.

I've queued a ton[1] of builds to get per-changeset resolution on the remaining spots, as well as to retest the range around b0c to rule out very-unlucky noise. I'll update here when those are done, though it looks like this is two separate bugs.

[1] http://areweslimyet.com/status/
 > The most suspicious commit is on Dec 18th:
> https://hg.mozilla.org/mozilla-central/rev/b0c4f68563c3

Seth, Joe:  this is your change.  Could it have affected JS memory consumption?
Quite odd that it would have affected JS memory consumption. I could easily imagine there being leaks hidden in any large patch, but I (perhaps naively) wouldn't have expected this patch to cause a JS memory leak.

Being still new to this codebase, I don't know enough about what is being measured here; is this specifically JS-level allocation? Or could it be things like wrappers around native objects? Could misuse of nsCOMPtrs and such show up as a JS memory leak if the objects are accessible from JavaScript?
It's quote possible that the HTMLElement change would lead to larger JS memory consumption.  Instead of jspropertyops that's using JSNative getters and setters, which means more function objects and whatnot....
That should be easy enough to check looking at the about:memory log stuff that AWSY produces.
Looking at this more:

b0c4f68563c3 landed previously as
https://hg.mozilla.org/mozilla-central/rev/6066a23ba701
... which caused a ~48Meg spike, but was backed out by
https://hg.mozilla.org/mozilla-central/rev/47bd1f6fd8ed
... which dropped memory back to previous levels

So Bug 815471 is definitely the culprit here. Comparing the memory dumps (click the 'view' button next to any changeset on AWSY) shows that we are probably leaking windows. This explains why several values went up, including JS:

http://people.mozilla.com/~jschoenick/snapshot-1358288220.png

I will open a separate bug for the other regression once it's been narrowed down more
Assignee: jschoenick → nobody
Blocks: 815471
Component: JavaScript Engine → ImageLib
OS: Mac OS X → All
Hardware: x86 → All
Summary: areweslimyet.com shows 25MB JS regression → areweslimyet.com shows ~40MB regression in leaked windows
Whiteboard: [js:p1]
Will look into this.
(In reply to John Schoenick [:johns] from comment #18)
> http://people.mozilla.com/~jschoenick/snapshot-1358288220.png

To be clearer with what's going on in that screenshot: In the "After TP5 [+30s, forced GC]" checkpoint, we previously would have a handful of active windows. After bug 815471, we also have a number of cached windows, which should not be around after waiting 30s and forcing a GC.
Whiteboard: [MemShrink]
So far I've had no luck reproducing this locally. John, can you share the exact procedure you used? I built a version of Firefox from mozilla-central at 116228:38889f86a465 (the last of my set of patches that you said eliminated the problem when they were backed out). I then ran a local install of Talos on the Tp5n pageset linked from |https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5|.

It wasn't obvious how to make Talos pause at the end of the run so I could check about:memory, so I added a "% wait.html" test at the end of the pageset, and then checked about:memory in a new tab. This may mean I was checking before the forced GC, but even so, in that case I'd expected _worse_ results, not better.

It's a bit difficult to fix something I can't measure, so I'd be very grateful for any help you can offer. Thanks!
(In reply to Seth Fowler [:seth] from comment #21)
> So far I've had no luck reproducing this locally. John, can you share the
> exact procedure you used? I built a version of Firefox from mozilla-central
> at 116228:38889f86a465 (the last of my set of patches that you said
> eliminated the problem when they were backed out). I then ran a local
> install of Talos on the Tp5n pageset linked from
> |https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5|.
> 
> It wasn't obvious how to make Talos pause at the end of the run so I could
> check about:memory, so I added a "% wait.html" test at the end of the
> pageset, and then checked about:memory in a new tab. This may mean I was
> checking before the forced GC, but even so, in that case I'd expected
> _worse_ results, not better.
> 
> It's a bit difficult to fix something I can't measure, so I'd be very
> grateful for any help you can offer. Thanks!

AWSY is using the TP5 pageset, but not talos directly. What it's doing is loading the full TP5 pageset into 30 tabs (so, overwriting the oldest tab for pages 31+). It then sits for 30 seconds, forces a GC, closes all tabs, sits for thirty seconds, forces another GC, and then repeats five times. You can see the fairly simple mozmill script that does this here:
https://github.com/Nephyrin/MozAreWeSlimYet/blob/master/mozmill_endurance_test/test1.js

If you want to run AWSY locally, the test looks for the TP5 pageset at http://localhost:8001 through http://localhost:8100 (an inelegant way to force separate domains/compartments for each page). The tester accomplishes this by just including each port in its nginx.conf:

>  listen 8004;
>  listen 8005;
>  listen 8006;
>  ...

If you can connect to the MV VPN you can also just replace localhost with albus.mv.mozilla.com in test1.js, which is already hosting TP5 on those ports.

To actually run a test you need mozmill installed (pip install mozmill --user) and the build you want to test

> ./run_slimtest.py -b path/to/firefox --sqlitedb mydatabase.sqlite --buildname <commit> --buildtime <commit timestamp>

The default test takes about 90 minutes to run, though, so you may want to do this in a VNC server:
> vncserver :9
> DISPLAY=:9 ./run_slimtest.py ...
> vncserver -kill :9

You can also edit slimtest_config.py to get an abbreviated test (though we don't know at what point the regression manifests). Reducing the per-tab timeout from 10s will greatly speed up the test, but if you set it to zero GC scheduling behaves very badly and you get erratic results. For local testing I usually set the tab timeout to a few seconds and two iterations. This reduces the test to ~4 minutes while staying somewhat similar to the full test.

To export data from the sqlitedb to a more useful format:
> mkdir data
> ./create_graph_json.py mydatabase.sqlite my_data ./data
> ./merge_graph_json.py my_data ./data  # (only required if you want to launch the website locally to view the data)

Which will give you fairly straight forward JSON files with the memory dumps from each checkpoint. You can also start the AWSY site on a local webserver and use it to browse the data. (it looks for the json files in ./html/data/, if your webserver of choice doesn't support transparent .gz files just uncompress the json first)

Feel free to ping :johns on IRC if you run into any issues with this
Seth, is this a difficult patch to back out -- i.e. has other stuff landed on top of it?  Because at this point I think backing it out is the default option.
Unfortunately this is a very difficult patch to back out. Let's give it a while before we start considering that. Major refactoring has gone on in imagelib recently and this is unfortunately nearly at the bottom of that stack of patches. (Wouldn't you know it?)

As of now this is officially my top priority. Hopefully a fix will be in next week and we won't need to worry about backing out.
(In reply to John Schoenick [:johns] from comment #22)
> Feel free to ping :johns on IRC if you run into any issues with this

Thanks so much John! I really appreciate you taking the time to write this up. This will help a lot.
OK, I've been able to run the AWSY tests locally and reproduced the leak in commit 6066a23ba701. Great. I'll try to reduce this to a minimal test case and hopefully track down the problem soon.
Hopefully you'll find the problem soon.  Because shipping a change that causes a 15% regression in our primary memory consumption benchmark *simply isn't acceptable*.
Keywords: regression
If you need help you know where to find me.
(In reply to John Schoenick [:johns] from comment #13)
> The most suspicious commit is on Dec 18th:
> https://hg.mozilla.org/mozilla-central/rev/b0c4f68563c3

I'm probably missing somthing, but I have a question about this changeset, in particular this:

+      mImage = ImageFactory::CreateImage(aRequest, mStatusTracker.forget(), mContentType,
+                                         mURI, mIsMultiPartChannel, mInnerWindowId);

imgRequest's mStatusTracker is a nsRefPtr, so this transfers its reference to ImageFactory::CreateImage but that just takes a raw imgStatusTracker*. The pointer eventually gets stored in ImageResource's mStatusTracker, which is also a nsRefPtr and so addrefs. So you didn't release imgRequest's reference, what code is supposed to do that? If you actually want to transfer that reference it would be clearer to make CreateImage and the various functions that this gets past through up to ImageResource's constructor take an already_AddRefed. Or simply null out mStatusTracker after the ImageFactory::CreateImage and don't use forget. Is this code so performance critical that we can't afford an extra addref/release pair?
(In reply to Peter Van der Beken [:peterv] from comment #29)

Thanks for taking a look at this code, Peter! Unfortunately, I don't think this particular thing is the source of the issue.

> imgRequest's mStatusTracker is a nsRefPtr

At the time of this commit, it was an nsAutoPtr.

> Or simply null out mStatusTracker after the ImageFactory::CreateImage
> and don't use forget.

That's what the current version of the code does.
Assignee: nobody → seth
Whiteboard: [MemShrink] → [MemShrink P1]
Assignee: seth → nobody
Whiteboard: [MemShrink P1] → [MemShrink:P1]
Assignee: nobody → seth
Summary: areweslimyet.com shows ~40MB regression in leaked windows → areweslimyet.com shows ~40MB regression (~11%) in leaked windows
OK, I've found the problem here. The patch makes changes to the Decoder class to allow it to send OnStartDecode lazily, when it receives its first Write() call and thus its first piece of data. This was needed because otherwise we couldn't call RasterImage::Init() from ImageFactory, since OnStartDecode's listeners assume that the image that owns the decoder is already available by calling GetImage() on the associated imgRequest. Unfortunately, AWSY shows that there are circumstances where that first Write() call doesn't happen until its too late, because the decoder ends up getting shut down first, and in that case we don't block page load correctly which seems to be the cause of the issue we're discussing in this bug.

The options for fixing the problem are as follows:

1. Fix the OnStartDecode listeners (and their transitive listeners). I tried this before going with the solution that ended up in the patch, and it proved quite difficult. This may be the best longterm solution, but I think we should avoid it for the purposes of fixing the immediate regression.

2. Make ImageFactory's Create*() methods have an out parameter, and write to it before calling RasterImage::Init. This will work but we can't use a standard XPCOM-style out parameter for this. I tried that as well while creating this patch and the fact that getter_AddRefs nulls out the destination before the function is called _also_ breaks things. Stuff relies on the fact that the old image is still around until the new one is swapped in, I am pained to say. However, we could do something a bit more unconventional, like take an |nsRefPtr<Image>&| as the out parameter. Ugly, but it'd work.

3. Separate ImageFactory's Create*() methods into two phases, one which just constructs and returns an appropriate image and one which does the remaining initialization steps. This is also pretty ugly, especially because some of these methods take so many parameters and most or all of them would probably have to be supplied twice. Again, though, this would fix the problem.

I'm going to take a brief look at option 1 above again tomorrow morning to see whether things still look as bad as they did when I wrote the original patch. If so, I'll compare option 2 and option 3 and see which produces less ugly code. We should have a patch tomorrow, and hopefully if review/try goes well this will get fixed in the next couple of days.
Actually there might be a fourth possibility: create a notification that just blocks page load, separate from the StartDecode notification. Something like OnPrepareDecode. I'll take a look at this possibility as well.
Hmm, good point, and imgStatusTracker already implements that interface. This may be the path of least resistance, then. I'll give it a shot.
Hmm, took a deeper look and actually imgIOnloadBlocker is not what we want. (My mistake: imgStatusTracker doesn't implement it; it just has non-virtual versions of its methods.) It makes a lot more sense to make a version of BlockOnload() on imgIDecoderObserver that does not take an imgIRequest argument.

Regardless, I've turned most of the ideas discussed above into preliminary patches and pushed them to try. This stuff can be tricky to get right; I remember that this specific issue delayed the original patch that introduced this regression substantially. I'll take a look at the try results when I get back from lunch and it should give me an idea of where dragons are hiding here and which solutions are feasible. Hopefully one of the candidates will be good enough to turn into a review-worthy patch by the end of the day.
OK, looks like we are ending up with a trivial fix here. As an experiment I
tried simply reverting the changes to the Decoder class and in my testing so far
it seems like everything's OK.

This is quite surprising to me, as I remember struggling with many try failures
related to notifications from the decoder that were solved by this change. It
may be that the issues I had were solved by later changes in the original patch,
or it could be that things were fixed by unrelated commits between the time when
I wrote the patch and when it got r+'d and checked in.

At any rate, the important thing right now is that this patch appears to solve
the problem. I've got a try job going here to verify that nothing has been
broken:

https://tbpl.mozilla.org/?tree=Try&rev=f7cdbe4ae1d5
Attachment #706708 - Flags: review?(joe)
Excellent news!
Attachment #706708 - Flags: review?(joe) → review+
Thanks for the review, Joe. Try looks OK too. Requesting checkin.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/d9a0933dfe2a
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
The first post-landing AWSY run has happened, and it looks like this fixed the problem!  All the lines are down by approximately the same amount that they jumped on Dec 18.
Great news!
Comment on attachment 706708 [details] [diff] [review]
Remove change to decoder notification behavior that causes AWSY leak.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 815471
User impact if declined: Significant memory leak.
Testing completed (on m-c, etc.): Passes tests, landed on m-c. This change just reverts part of the code changed in bug 815471 to its state before that bug landed, so this is well-tested code that has been around for a long time.
Risk to taking this patch (and alternatives if risky): Low risk as we're simply reverting a small change. There is no real alternative other than living with the memory leak.
String or UUID changes made by this patch: None.
Attachment #706708 - Flags: approval-mozilla-aurora?
Comment on attachment 706708 [details] [diff] [review]
Remove change to decoder notification behavior that causes AWSY leak.

Low risk, memory win.Approving on aurora.
Attachment #706708 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Thanks Daniel!
And Bhavana!
Did a bug ever get filed on the other regressions?  We're still substantially worse off than we were in mid-November on https://areweslimyet.com/ , and I didn't see anything here to indicate another bug that got spun off.  (It looks like this bug actually ended up covering the second regression originally noted in comment 4, and I don't see any sign of another bug being filed.)
> Did a bug ever get filed on the other regressions?  

Bug 833518.
Depends on: 842756
How can QA verify this fix? Is necessary to test using Talos?
Flags: needinfo?
(In reply to MarioMi (:MarioMi) from comment #50)
> How can QA verify this fix? Is necessary to test using Talos?

Talos and AWSY are not the same test.

You can verify this fix by looking at areweslimyet.com.
Flags: needinfo?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: