Closed Bug 1100253 Opened 10 years ago Closed 10 years ago

~40M AWSY regression after bug 1084136

Categories

(Core :: Graphics: ImageLib, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox35 --- unaffected
firefox36 + fixed
firefox37 + fixed

People

(Reporter: Swatinem, Assigned: seth)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [MemShrink:P1][awsy+][awsy-])

There is a huge memory spike on November 7th. According to the pushlog, that’s when e10s got enabled by default on nightly. Is AWSY running and measuring that? If so, I would argue that the memory usage include is to be expected.
It's definitely e10s. The pushlog for the jump is https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e228bf3b5f02&tochange=c4b831696f15, which includes 1093691, which enabled e10s. The next step is to work out exactly what we're now measuring (I think it's the sum of the two processes, e.g. "resident" is "resident from process 1" + "resident from process 2") and whether that still makes sense. Measuring separate processes separately is probably better.
Whiteboard: [MemShrink]
tracking-e10s: --- → ?
Summary: ~60M AWSY regression on November 7 → e10s: ~60M AWSY regression on November 7
Does the AWSY platform (linux?) have the shared page stuff from b2g? I assume we're not trying to use NUWA on desktop yet, right?
No NUWA, (after searching what it is.) e10s desktop on *nix is fork+exec. Defaults to a single content-process, (that is a separate executable.) Multiple content processes run well with very few problems if preference set. A blank page content-process takes ~70m rss. Expect a large chunk could be saved using fork. I've been tempted to make a bug/feature request as I haven't spotted one suggesting it (but I've been running fine as is.)
In order to fix AWSY after e10s landed I had to switch to mozmill 2.0, my understanding is this actually disables e10s. So in this case I don't think it's actually a reporting problem (ie RSS + RSS vs RSS + USS). This could indicate a couple of things: - e10s is causing high memory usage even when disabled - mozmill 2.0 is using more memory than mozmill 1.5 - there's still an extra content process even thought e10s is disabled
I can confirm there's only one process w/ the e10s pref disabled, so it's definitely not AWSY numbers being skewed by multiple processes.
I've backported the pref changes from mozmill-2.0 to our 1.5 version of AWSY, I'll restart our server w/ the changes and we can see if the issue is mozmill-2.0 related.
Reverted to mozmill-1.5 and reran overnight, numbers look better: Still elevated, but not by as much: RSS: After TP5 RSS: After TP5 [+30s] RSS: After TP5 [+30s, forced GC] Back to normal (more or less): RSS: After TP5, tabs closed RSS: After TP5, tabs closed [+30s] RSS: After TP5, tabs closed [+30s, forced GC] Never really changed: RSS: Fresh start RSS: Fresh start [+30s] So we can see that mozmill-2.0 had an impact, but there's still a regression. Whether or not it was e10s is hard to say for sure at this point. I've queued a rerun of the Nov 7 - Nov 17 timerange, but that's going to take a while.
Where can I find the actual data?
Flags: needinfo?(erahm)
The data isn't super reliable right now, but the values for the past day should have eliminated 2.0 from the mix: https://areweslimyet.com/ Zoom in to the past few weeks. Each node can have 1 or more changesets associated w/ it. You can click a [view] next to changeset you're interested in and dig into the actual memory numbers. Each iteration also has an [export] option that'll give you an about:memory compatible report.
Flags: needinfo?(erahm)
This looks like a 40MB regression on explicit for the "After TP5" categories. The primary offenders are "cache/window" entries. Below or the results of diffing an 11/6 result w/ an 11/18 result. High-level: > 40.46 MB (100.0%) -- explicit > ├──31.79 MB (78.58%) ++ window-objects > ├───3.97 MB (09.82%) ── heap-unclassified > ├───3.55 MB (08.78%) ++ images So that's *mostly* window-objects, lets look at a snippet under there: > ├──31.79 MB (78.58%) -- window-objects > │ ├───8.18 MB (20.23%) -- top(http://localhost:8088/tp5/icious.com/www.delicious.com/index.html, id=2484) > │ │ ├──8.14 MB (20.11%) ++ cached/window(http://localhost:8058/tp5/dailymail.co.uk/www.dailymail.co.uk/ushome/index.html) > │ │ └──0.05 MB (00.12%) ++ (2 tiny) > │ ├───3.67 MB (09.06%) -- top(http://localhost:8083/tp5/etsy.com/www.etsy.com/category/geekery/videogame.html, id=2459) > │ │ ├──3.73 MB (09.23%) ++ cached/window(http://localhost:8053/tp5/douban.com/www.douban.com/index.html) > │ │ └──-0.07 MB (-0.17%) ++ (2 tiny) > │ ├───3.33 MB (08.24%) -- top(http://localhost:8090/tp5/web.de/web.de/index.html, id=2490) > │ │ ├──3.30 MB (08.15%) ++ cached/window(http://localhost:8060/tp5/indiatimes.com/www.indiatimes.com/index.html) > │ │ └──0.04 MB (00.09%) ++ (2 tiny) > │ ├───3.30 MB (08.15%) -- (41 tiny) > │ │ ├──0.38 MB (00.95%) -- top(http://localhost:8056/tp5/digg.com/dads.new.digg.com/view.html@kw=zone%253A5&kw=mozilla&kw=nice&kw=logo&kw=firefox&kw=mozzilla&kw=new&kw=proposal&kw=really&kw=browser&kw=check&kw=pagetype%253Apermalink&template=5.html, id=2566) > │ │ │ ├──0.38 MB (00.94%) ++ cached/window(http://localhost:8056/tp5/digg.com/dads.new.digg.com/view.html@kw=zone%253A5&kw=mozilla&kw=nice&kw=logo&kw=firefox&kw=mozzilla&kw=new&kw=proposal&kw=really&kw=browser&kw=check&kw=pagetype%253Apermalink&template=5.html) > │ │ │ └──0.00 MB (00.00%) ++ active/window(http://localhost:8056/tp5/digg.com/dads.new.digg.com ...snip... So that's virtually all under |window-objects/top/cached/window|. I get the feeling some caching behavior changed for the worse. There also seems to be a regression in images and heap-unclassified, but that could just be cache related or random noise.
24th Bug 1101193 a good time to watch.
Kyle, I heard you might have insight into what's going on w/ caching here.
Flags: needinfo?(khuey)
That's the bfcache. Are we evicting it properly in the content process?
Flags: needinfo?(khuey)
This is with e10s disabled.
So this doesn't have anything to do with e10s?
(In reply to Bill McCloskey (:billm) from comment #15) > So this doesn't have anything to do with e10s? The current thinking is no, AFAICT the issue just slipped in while AWSY was busted due to e10s being enabled. I'll clear the e10s stuff for now.
tracking-e10s: ? → ---
Summary: e10s: ~60M AWSY regression on November 7 → ~40M AWSY regression on November 7
Regression range is: Thu, 06 Nov 2014 18:54:00 GMT -> Fri, 07 Nov 2014 03:08:20 GMT https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=2033324d4571&tochange=e60e90aa209d AWSY is currently backfilling for that date range.
:smaug we think this regression is bfcache related, can you help in tracking it down? See comment 10 and comment 13.
Flags: needinfo?(bugs)
I don't see anything obviously DOM-related in there. Bug 1084136 is in there, but I only suspect that because of how tricky imagelib is. ;)
The only session history related changes recently landed 2014-10-15 (bug 855443) and 2014-11-14 (bug 1090918). What all did mozmill 2.0 change? Comment 10 is talking about 11/6 result which had mozmill 2.0, right?
Flags: needinfo?(bugs)
In that regression range bug 1084136 sounds like the possible candidate causing memory usage changes.
seth, would it be possible that bug 1084136 caused changes to memory usage patterns?
Flags: needinfo?(seth)
I'm not really sure how it would. It should have *reduced* memory usage, since it eliminated a bunch of state that we used to keep track of for every image. Bugs are always possible, though.
Flags: needinfo?(seth)
(In reply to Olli Pettay [:smaug] from comment #20) > The only session history related changes recently landed 2014-10-15 (bug > 855443) and > 2014-11-14 (bug 1090918). > > What all did mozmill 2.0 change? Comment 10 is talking about 11/6 result > which had mozmill 2.0, right? That was a diff of runs that used 1.5 (albeit w/ 10 days in between).
And it looks like bug 1084136 is the offender: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=f129b17f9067&tochange=f245578c4fa4 I'm pretty sure the repro steps are: #1 - open tab to site A #2 - navigate to site B in the same tab (maybe repeat) #3 - open about:memore, measure #4 - note the |window-objects/top/cached/window| entry
Depends on: 1084136
Flags: needinfo?(seth)
Summary: ~40M AWSY regression on November 7 → ~40M AWSY regression after bug 1084136
[Tracking Requested - why for this release]: Increased memory usage due to holding alive pages for longer.
Blocks: 1084136
No longer depends on: 1084136
Keywords: regression
Component: Untriaged → ImageLib
It would appear that bug 1089046 improves the situation for "Tabs Closed" case. We saw a 40MB drop in explicit when that landed on 11/15. The following are all still regressed: Explicit: After TP5 Explicit: After TP5 [+30s] Explicit: After TP5 [+30s, forced GC]
Seth, any idea what accounts for the remainder? Presumably something related to the fix you already landed...
Whiteboard: [MemShrink] → [MemShrink:P1]
(In reply to Nicholas Nethercote [:njn] from comment #29) > Seth, any idea what accounts for the remainder? Presumably something related > to the fix you already landed... The current dev cycle finishes in a couple of days. This regression is bad enough that we need to consider backing it out if it doesn't get fixed ASAP.
(In reply to Nicholas Nethercote [:njn] from comment #29) > Seth, any idea what accounts for the remainder? Presumably something related > to the fix you already landed... No, I don't. There's no obvious explanation as to why the code in bug 1084136 *or* bug 1089046 should affect whether we leak or don't leak window objects. Those bugs really only affect the notifications we send from ImageLib. It's likely that the notification changes are tickling a bug in some notification consumer. I'll go through the different patches in that bug to try to narrow down when the problem was introduced. That should make it much easier to pinpoint the problem.
Flags: needinfo?(seth)
So following these steps: (1) Minimize memory usage in an about:memory tab (2) Load reddit.com in tab A (3) Load reddpics.com in tab A (4) Measure memory usage in the about:memory tab I looked at a set of revisions and observed the cached/window entries: changeset: 214508:30a89ed1abf3. 6.09 MB (02.88%) ++ cached/window(http://www.reddit.com/) changeset: 214518:a28e572e9919. 6.00 MB (03.17%) ++ cached/window(http://www.reddit.com/) changeset: 214528:f245578c4fa4. 5.84 MB (03.02%) ++ cached/window(http://www.reddit.com/) 214508 is from well before bug 1084136. 214518 is the first patch in bug 1084136, and 214528 is the last. My conclusion is that we still don't have STR that demonstrate the problem locally. =(
Eric, Nicholas, what's the best path forward here? As I mentioned in the previous comment, right now I don't know how to reproduce this locally. I also realized that I don't have access to the VPN now (something's wrong with my config; can't figure it out) and I probably won't until sometime tomorrow afternoon. If you see this before you get access to the VPN, I'd really appreciate it if you could trigger the first couple of levels of a binary search between revision 214518 and revision 214528. If we can figure out which part caused the regression, it will be a lot easier to fix it.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(erahm)
Update: Eric has tried running AWSY locally with my patches and observed a ~23MB *reduction* in memory usage. So this thing must be timing dependent. Eric was kind enough to try to set up the bisect on AWSY, but apparently there's some issue with AWSY keeping it from running against specific commits. So we seem to be stuck, with no way to dig further into this.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(erahm)
So johns let me know that we can still do this AWSY bisection via try builds. I'm going to push a sequence of try jobs now that will do the first couple of binary search steps. (Due to time limitations I think we're going to have to try some possibilities in parallel.) For rev 214520: https://tbpl.mozilla.org/?tree=Try&rev=58a902dff876 For rev 214523: https://tbpl.mozilla.org/?tree=Try&rev=ae994a7f4675 For rev 214526: https://tbpl.mozilla.org/?tree=Try&rev=b2667ebc9625 When these are done, I'll trigger AWSY runs for them. Based on those results, we'll need to push two more try jobs/AWSY runs to narrow it down to the offending revision.
All those try jobs finished building, so I queued them up in an AWSY series. The results should appear here: https://areweslimyet.com/?series=bug1100253
Right now it looks very much to me like the culprit is part 2, which dealt with onload blocking. That's rev 214519, which corresponds to this try job in the AWSY series: https://tbpl.mozilla.org/?tree=Try&rev=c15f5e5160db (Note that the AWSY series doesn't list the parts in the correct order. Sigh.) I will push an experimental patch to try to restore the notification order to how it was before part 2. Let's see how it affects the AWSY results.
OK, here's one such patch, which makes us now send UnblockOnload after all other notifications, as we did before. https://tbpl.mozilla.org/?tree=Try&rev=1f96124c1a15
Looks like you're getting closer, so I'll cancel the needinfo request that was made of me.
Eric, could you please add the try job in comment 39 to AWSY series bug1100253? I won't have access to the Mozilla VPN for a few days.
Flags: needinfo?(erahm)
Tests are now running for the try build in comment 39.
Flags: needinfo?(erahm)
(In reply to Eric Rahm [:erahm] from comment #42) > Tests are now running for the try build in comment 39. Thanks, Eric. It looks like that change alone was not sufficient to fix the problem. The three other possibilities I see at this point are: (1) There is a bug in that patch that makes us somehow not send UnblockOnload after sending BlockOnload, which would certainly cause problems. (2) There is a dependency in one of the observers on BlockOnload/UnblockOnload being sent *multiple times*, which is the thing that patch was trying to eliminate. Case (1) would be most easily debugged by adding assertions. What happens if we assert during an AWSY? Is the output logged somewhere? Case (2) really is a bug in whichever observer is causing the problem. We can narrow it down by modifying the patch to send BlockOnload/UnblockOnload multiple times, verifying that that fixes the problem, then adding code to ignore multiple BlockOnload/UnblockOnload calls in each observer individually until the problem reappears.
(Two other possibilities, apparently. =)
(In reply to Seth Fowler [:seth] from comment #43) > Case (1) would be most easily debugged by adding assertions. What happens if > we assert during an AWSY? Is the output logged somewhere? I believe we have logs for each test run, just make sure you use release assertions.
So this regression is still around, what's the status?
Flags: needinfo?(seth)
Right, no progress has been made recently because I ended up having no time during Mozlandia and then I had to focus on more serious crashes and rendering regressions. My plan is to come back to this tomorrow.
Flags: needinfo?(seth)
Alright, sorry for the long delay on coming back to this. I've pushed a try job that will investigate case (2) above, since I think it's the more likely case. Will add this to the AWSY series once the try job completes. https://tbpl.mozilla.org/?tree=Try&rev=57389a1f4c48
Actually, I'd like test results too, so I've pushed the same patch as in comment 48, but this time running tests. https://tbpl.mozilla.org/?tree=Try&rev=0101f616c516
The patch in comment 48 appears, indeed, to fix the regression - it causes an improvement of ~38MB on AWSY. It also causes some oranges! That's OK, though, because the intention is not to use this patch directly. We now know that some observer wants the onload block/unblock notifications sent multiple times. The next step is to investigate and determine which observer is causing the problem.
(In reply to Seth Fowler [:seth] from comment #50) > The patch in comment 48 appears, indeed, to fix the regression - it causes > an improvement of ~38MB on AWSY. That sounds much better! Let me know if I can help out.
So there are only three implementations of imgIOnloadBlocker that actually do any onload blocking. I've prepared a patch stack that stops each of them from propagating the onload blocking to the document if more than one onload block/unblock notification is received. These are built on top of the patch from comment 48. Most likely, one of these patches will restore the regression, and that will clarify which imgIOnloadBlocker implementation is buggy. For nsImageBoxFrame (which is XUL, so I doubt this is the culprit): https://tbpl.mozilla.org/?tree=Try&rev=3f27019f4d37 For ImageLoader (which handles images loaded through CSS): https://tbpl.mozilla.org/?tree=Try&rev=f4bd878bcfac For nsImageLoadingContent (which handles images loaded through JavaScript and the DOM): https://tbpl.mozilla.org/?tree=Try&rev=fe486e71e340
So the big bump seems to come from ImageLoader (f4bd878bcfac). I'm kinda not surprised, as compared to nsImageLoadingContent, ImageLoader is not very robust. =\ So the next step is to try to figure out what's going wrong in ImageLoader. My hypothesis at this point is that without the duplicate block/unblock onload notifications, ImageLoader ends up blocking onload but never unblocking it somehow, which indirectly causes the regression in memory usage.
Thanks for the diligent investigation, Seth. Unfortunately I'm getting nervous about this... it sounds like it's more complicated than expected, the holiday break is coming up, and I can see this regression hitting Beta without having been fixed. (Not saying it's certain, but possible.) Is there a Plan B?
(In reply to Nicholas Nethercote [:njn] from comment #54) > Thanks for the diligent investigation, Seth. Unfortunately I'm getting > nervous about this... it sounds like it's more complicated than expected, > the holiday break is coming up, and I can see this regression hitting Beta > without having been fixed. (Not saying it's certain, but possible.) Is there > a Plan B? It's exactly as complicated as expected from my perspective, because I knew the bug was not in ImageLib, but in one of the observers. I'm still pretty confident that we can get this fixed soon. It's only today that we've identified the root cause. The timing of both Portlandia and the holiday break have unfortunately given us less usable time to fix this than in other release cycles, but (for me at least) there are still more than two weeks of working time until the next uplift. Plan B would be to back out every patch that has landed in ImageLib in the last couple of months, pretty much. Not pretty, and it'd be preferable to avoid it, but it's an option. I don't think we need to have that discussion yet, though. If we did that, it should only be from 36 and not from 37.
Thanks for the update. I hope your optimism is well-founded :)
OK, so now that we've narrowed down the problem to ImageLoader, I took a look at the code and came up with two patches that might fix the problem. One is a patch that tries to ensure that we always unblock onload after blocking it in ImageLoader: https://tbpl.mozilla.org/?tree=Try&rev=fbf289da5419 A second ensures that we block all notifications when cloning an imgRequestProxy in ImageLoader (the existing code tries to do this, but fails badly): https://tbpl.mozilla.org/?tree=Try&rev=102995acb7fb Finally, here's a push with both patches at the same time: https://tbpl.mozilla.org/?tree=Try&rev=bae1a82fc1b6 Once these pushes build I'll run them through AWSY; hopefully one of them fixes the problem.
Hmm.. I'm unclear why TBPL acts as if it hasn't seen the hashes for the patches from comment 52 before. Just to make sure nothing funny's going on, I pushed just the patches from comment 52 as a baseline, as well: https://tbpl.mozilla.org/?tree=Try&rev=a2d88c1d056f
Hmm... I think something may have gone wrong with my last set of pushes. Let me try again. One is a patch that tries to ensure that we always unblock onload after blocking it in ImageLoader: https://tbpl.mozilla.org/?tree=Try&rev=4f7ee42cc7c9 A second ensures that we block all notifications when cloning an imgRequestProxy in ImageLoader (the existing code tries to do this, but fails badly): https://tbpl.mozilla.org/?tree=Try&rev=1c9bfcefb452 Finally, here's a push with both patches at the same time: https://tbpl.mozilla.org/?tree=Try&rev=ff90be1ba459 Baseline: https://tbpl.mozilla.org/?tree=Try&rev=d89482846552
We've got one more week before uplift.
Assignee: nobody → seth
The patch in this try job is believed to resolve the problem, looks good on try, and is based on Aurora tip: https://tbpl.mozilla.org/?tree=Try&rev=6e7d4c06b47 Unfortunately, AWSY is not accepting try job submissions right now, so until that's fixed I cannot verify that the issue is resolved.
Depends on: 1120149
(In reply to Seth Fowler [:seth] from comment #61) > The patch in this try job is believed to resolve the problem, looks good on > try, and is based on Aurora tip: > > https://tbpl.mozilla.org/?tree=Try&rev=6e7d4c06b47 > > Unfortunately, AWSY is not accepting try job submissions right now, so until > that's fixed I cannot verify that the issue is resolved. It would appear that a 'c' was left off the end of that revision, I've queued up a run for 6e7d4c06b47c.
(In reply to Eric Rahm [:erahm] from comment #62) > It would appear that a 'c' was left off the end of that revision, I've > queued up a run for 6e7d4c06b47c. Thanks for noticing that problem and taking care of it, Eric. It's interesting that the 'c' is also missing from the TBPL link, but the link still works - I didn't realize that could happen! Based upon the results, I've gone ahead and pushed that patch to Aurora. (The work is taking is place in bug 1120149.) It looks like it completely resolves the regression once tabs are closed. I'm not sure whether there's still a regression when tabs are open or whether it's just an artifact of the fact that we now wait 60 seconds before discarding surfaces, instead of (IIRC) 15 seconds as in the past.
Blocks: AWSY
(In reply to Seth Fowler [:seth] from comment #63) > Based upon the results, I've gone ahead and pushed that patch to Aurora. > (The work is taking is place in bug 1120149.) It looks like it completely > resolves the regression once tabs are closed. I'm not sure whether there's > still a regression when tabs are open or whether it's just an artifact of > the fact that we now wait 60 seconds before discarding surfaces, instead of > (IIRC) 15 seconds as in the past. Are we going to land the hack on m-c as well? If not I don't think we'll see a change in memory usage on AWSY so the regression will remain until we get a proper fix in. I filed bug 1120607 to follow up on the possible regression from changing the amount of time before discarding surfaces.
Flags: needinfo?(seth)
(In reply to Eric Rahm [:erahm] from comment #64) > Are we going to land the hack on m-c as well? If not I don't think we'll see > a change in memory usage on AWSY so the regression will remain until we get > a proper fix in. Yes, but it needs to be rewritten again. Unfortunately that code has been in a constant state of flux lately. However, if I get it done soon, we should be able to use the same patch for Aurora and trunk, as they obviously haven't diverged much yet. I can't get to it immediately, as I have other fires to put out related to the merge, but I'll aim to get it done sometime next week.
Flags: needinfo?(seth)
I added annotations to AWSY for this regression.
Whiteboard: [MemShrink:P1] → [MemShrink:P1][awsy+]
So bug 1120149 has landed on trunk, and I'm requesting uplift for Aurora. This should be fixed on every affected branch. Can we resolve this now?
Status: NEW → UNCONFIRMED
Ever confirmed: false
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to Seth Fowler [:seth] from comment #67) > So bug 1120149 has landed on trunk, and I'm requesting uplift for Aurora. > This should be fixed on every affected branch. > > Can we resolve this now? AWSY confirms, the fix looks good. I've added an annotation for the fix.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [MemShrink:P1][awsy+] → [MemShrink:P1][awsy+][awsy-]
Marking it has fixed thanks to the patch in bug 1120149
Marking 37 as fixed as well as bug 1120149 landed on 37.
You need to log in before you can comment on or make changes to this bug.