Closed Bug 1230215 Opened 9 years ago Closed 9 years ago

3% Linux/Win damp regression on Fx-Team (v.45) on Dec 02, 2015 from push 15522bc2931a52ec435f22e233f747264b90f647

Categories

(DevTools :: Debugger, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jmaher, Assigned: jlong)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Talos has detected a Firefox performance regression from your commit 15522bc2931a52ec435f22e233f747264b90f647 in bug 1215959.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=15522bc2931a52ec435f22e233f747264b90f647&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#DAMP

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64 -u none -t g2  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a damp

Making a decision:
As the patch author we need your feedback to help us handle this regression.


Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
a compare view of all affected platforms/tests:
https://treeherder.allizom.org/perf.html#/compare?originalProject=fx-team&originalRevision=15522bc2931a&newProject=fx-team&newRevision=0e47cb064701&filter=damp&showOnlyConfident=1&showUnreliablePlatforms=1

looking in details (linux64 e10s for example):
https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=fx-team&originalRevision=15522bc2931a&newProject=fx-team&newRevision=0e47cb064701&originalSignature=13dd9b8ab5c9d40a946e35a9dc085970caebce48&newSignature=13dd9b8ab5c9d40a946e35a9dc085970caebce48

you can see the big offenders are:
damp complicated.jsdebugger.reload.DAMP opt e10s
damp simple.jsdebugger.close.DAMP opt e10s 
damp simple.jsdebugger.reload.DAMP opt e10s 
damp simple.netmonitor.close.DAMP opt e10s 

There is a good win:
damp simple.jsdebugger.open.DAMP opt e10s 

the rest of the data in the compare view is more than likely attributed to noise.

:jlong, can you take a look at this and weigh in on why we might be seeing these changes as a result of your patch: https://hg.mozilla.org/integration/fx-team/rev/0e47cb064701?
Flags: needinfo?(jlong)
That change is a big refactoring of the debugger, so I'm not surprised there might be negative effects. I think I can solve the `reload` case, there's actually a very small section of code that affects it greatly and it can be simplified/removed (leftover code from before).

I'm not sure about the `close` case. I can take a look though.

Hardest part about this is to figure how to reproduce it locally and get familiar with damp. I've had limited experience with it.
Flags: needinfo?(jlong)
Assignee: nobody → jlong
Oh I think I actually know why `close` is taking longer. We actually wait for the debugger to finish async requests now, while before we didn't. The problem with not waiting was that async requests would come in after the UI is destroyed, and it would error. Previously there were literally just errors. So we sacrificed performance for correctness.

I developed a different way to handle this though later in the refactoring. Instead of waiting for them to finish, when they come in we just ignore them.

I think I can remove the code that waits for the requests to finish now. I'll look into it.
(In reply to James Long (:jlongster) from comment #3)
> Oh I think I actually know why `close` is taking longer. We actually wait
> for the debugger to finish async requests now, while before we didn't. The
> problem with not waiting was that async requests would come in after the UI
> is destroyed, and it would error. Previously there were literally just
> errors. So we sacrificed performance for correctness.
> 
> I developed a different way to handle this though later in the refactoring.
> Instead of waiting for them to finish, when they come in we just ignore them.
> 
> I think I can remove the code that waits for the requests to finish now.
> I'll look into it.

Waiting for async requests to finish when it wasn't before seems like a reasonable reason to regress close time.  With your other way of handling this, does it work properly when a toolbox is opened/closed a bunch of times rapidly so that there are lots of pending requests?
(In reply to Brian Grinstead [:bgrins] from comment #4)
> 
> Waiting for async requests to finish when it wasn't before seems like a
> reasonable reason to regress close time.  With your other way of handling
> this, does it work properly when a toolbox is opened/closed a bunch of times
> rapidly so that there are lots of pending requests?

Yep, those are different instances of the debugger so it wouldn't even know about previous async requests. Reloading really fast is actually a harder scenario because it is the same instance, but previous async requests may "bleed through". My solution even works there, though. In fact that's the very use case which made me solve this. See this code: https://github.com/mozilla/gecko-dev/blob/master/devtools/client/debugger/debugger-controller.js#L191-L206

Basically, every async request has an ID, and I keep a list of "open requests" as a simple array of IDs. That array is reset to empty whenever the page reloads or devtools is closed. When future requests finish it'll come through and be ignore because it's not a "current request".

I had to do this because you can't block a page navigation, while you can block the debugger shutdown. But it's actually a better solution because we don't have to block on it (we don't really care about the results anyway)
Here's a push without the code to wait on async requests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=10053727eabb

I seem to remember trying this before but still ran into some edge cases. We'll see what comes up.
Need to do this before implementing an optimization
Depends on: 1230299
Awesome, according to the try run in comment 6 I can just remove the code the blocks shutdown on async requests because we just ignore them now. I'm going to split off work into individual bugs which are now blocking this bug.
Depends on: 1230556
Here's a try run with patches from both bugs and `-t g2` so we can see if it fixes this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3cd0dd87d907

Talos: https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=3cd0dd87d907

I'll need some help understanding the data. Should we see a 3-5% improvement now by comparing it with mozilla-central, because it's comparing it with my slower code?
I retriggered jobs on the try run, found the base revision on fx-team and retriggered the g2 jobs there- in due time we will have enough data to have more confidence in what comes out:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=fx-team&originalRevision=c9c2015c63c5&newProject=try&newRevision=3cd0dd87d907&filter=damp&showOnlyConfident=1

to answer your question- yes, we should see an improvement on the damp test with your patch!  Lets see if that does the trick
Thanks Joel - looks like I need to fix some tests with those patches, so I'll have to make another try run.
if you can keep the parent rev the same, it would be easy to just switch the compare view to point to the newRevision
Ok this should have the same parent rev, tests pass locally and I'm pretty sure this should work: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e292c2b07e38
I'm probably going to go ahead and land those 2 bugs tonight regardless; they clean up the code and implement optimizations I wanted to make anyway.
Unfortunately it looks like all those tests runs had a bad parent commit. So we're going to have to change the parent rev.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=17d448ab398e
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=17d448ab398e
I am not seeing much of an improvement here after retriggers;
https://treeherder.mozilla.org/perf.html#/compare?originalProject=fx-team&originalRevision=b1821f8cf261&newProject=try&newRevision=17d448ab398e&filter=damp&showOnlyImportant=0

either way, I think we have cleaned up what we can.  This sounds like a good cleanup and there might not be much more that can realistically be done.
Ok. Looking at the original benchmarks, it looks like it's a consistent 3-5ms difference. It's really cool that we can detect that! But it's also a really small change, so I agree that we are ok to ignore it.

I am surprised that avoiding blocking on pending async requests during shutdown didn't change anything, but maybe the way the DAMP tests run there aren't any async requests going on anyway. 

The only other thing I can think of is that we are loading more JS files (some bigger ones like React) on startup. I wouldn't be surprised if that added 3-5ms. But that shouldn't affect reload or shutdown.

We're going to be making more changes to the debugger so we can profile startup/reload at some point and see if there's anything we can do to optimize it in general.
I would be fine closing this as wontfix or other topics.  I prefer to see some form of resolution or a rough plan (we need to do X in 2 months, etc.)  That is my preference, this is your bug :)

Thanks again for looking into this!
Ok, let's keep it for now. It's a bit hard to give a plan because the debugger is in a state of flux right now, and we are going to continue doing some major frontend refactors. Every pass of refactoring it's easier to rationalize about and profile and change the code to be fast.

Note that all of these numbers are about to change quite a bit. Bug 1132501 will attach the thread when the toolbox opens, not when the debugger is initialized, so the debugger open/reload events should be faster but the toolbox opening is going to take a hit. We've actually already don a ton of damp runs in that bug and it seems like opening the toolbox will be ~30ms slower, but it's a sorely needed architectural piece and we should have been doing it like that anyway. That 30ms just moves from the debugger opening to the toolbox opening.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.