Closed Bug 1162622 Opened 7 years ago Closed 7 years ago

Investigate slowdown in mochitest-devtools-chrome-1 since late April

Categories

(DevTools :: General, defect)

defect
Not set
normal

Tracking

(firefox41 fixed)

RESOLVED FIXED
Firefox 41
Tracking Status
firefox41 --- fixed

People

(Reporter: jgriffin, Assigned: terrence)

References

Details

Attachments

(5 files)

The mochitest-devtools-chrome-1 job has been taking increasing amounts of time to run on linux32 debug since around April 27.

This manifests as "command timed out: 4800 seconds elapsed" errors in Treeherder, logged in bug 1137757.

We should use ActiveData to determine whether this increase in runtime is sudden or gradual, and whether it is across-the-board or whether it affects particular tests more than others.

Kyle, can you generate data for runtimes over time of individual tests in mochitest-devtools-chrome-1 on linux32 debug from about April 21 through May 4, so we can see if there are trends that would point us in the direction of a cause?
Flags: needinfo?(klahnakoski)
yes, this can be done, but it will take a few days:  The cluster collapsed under the weight of the data.
Assignee: nobody → klahnakoski
Depends on: 1162604
Flags: needinfo?(klahnakoski)
The new cluster is rebuilding, and the most recent days are still missing.

Here is the query I am using to do the pull.  It must be imported, normalized, put into pivot table, and charted.  I will do this once the rest of the days get indexed.

> {
> 	"from":"unittest",
> 	"select":{
> 		"name":"duration",
> 		"value":{"sub":["result.end_time","result.start_time"]},
> 		"aggregate":"min"
> 	},
> 	"edges":[
> 		"result.test",
> 		{
> 			"value":"build.date",
> 			"domain":{
> 				"type":"time",
> 				"min":"{{20 APR 2015}}",
> 				"max":"{{8 MAY 2015}}",
> 				"interval":"day"
> 			}
> 		}
> 	],
> 	"where":{"and":[
> 		{"eq":{
> 			"build.branch":"mozilla-inbound",
> 			"run.suite":"mochitest-devtools-chrome",
> 			"run.chunk":1,
> 			"build.platform":"linux",
> 			"build.type":"debug"
> 		}},
> 		{
> 			"gte":{"build.date":"{{20 APR 2015}}"},
> 			"lt":{"build.date":"{{8 MAY 2015}}"}
> 		}
> 	]},
> 	"limit":100000
> }
I thought we stored result.duration? Is there some reason we expect that to be unreliable in this case?
Do we?  Oh dear, it looks like you are correct.  The `result.duration` will be a better measure.
Just eyeballing the data on Treeherder, it looks like a significant regression occurred in the Apr 23-25 timeframe.  Average runtimes for the job went from an avg of about ~70 min to ~80 min.
Attached image Charting the slowdown
Here is the PNG of the slowdown.  A couple of days data is still being indexed, but we can clearly see the slowdown is almost everywhere and impacts all five of the main directories: debugger, commandline, animationinspector, canvasdebugger, and framework.  I will zoom into the day with the problem
Attached image Chart by Hour
Looking at just one test, browser/devtools/debugger/test/browser_dbg_variables-view-02.js and it's run duration by the build.date hour, we get this chart.  Since some hours did not see a build, I filled it in with the last known value; this prevents visual noise from the zeros.
Attached image Specific revisions
Listing the revisions and making a scatter plot of duration by build.date, we can see the same step.   Using build.date has known ordering issues, and it reveals itself as a little bouncing before the step.
is this across all platforms?  Maybe a machine update is to blame for this?  Likewise a mozharness update could cause it.  Alternatively a real firefox bug is causing it!  food for thought
See Also: → 1163420
(In reply to Joel Maher (:jmaher) from comment #9)
> is this across all platforms?  Maybe a machine update is to blame for this? 
> Likewise a mozharness update could cause it.  Alternatively a real firefox
> bug is causing it!  food for thought

There weren't any mozharness changes on m-c between Apr 17 and Apr 30.

Here are the buildbot changes in that window:  https://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&oldid=1070162

Here are the puppet changes:  https://hg.mozilla.org/build/puppet/shortlog

Nothing jumps out as a likely cause, although I could be missing something.

If we wanted to eliminate environment changes, we could check out a test slave and run the tests with an April 20th binary and an April 30th one, and see if it can be reproduced.
Even just running locally and reproducing the regression would be pretty convincing that it's a gecko change to me.
I've retriggered this job several times on a commit from Apr 22, at https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=ef1fb0e0911d

This should tell us whether the problem is related to something in-tree or not.  If this confirms an in-tree change, we can push to try from this tip with the latest mozharness to determine whether mozharness might be to blame.
(In reply to Jonathan Griffin (:jgriffin) from comment #12)
> I've retriggered this job several times on a commit from Apr 22, at
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=ef1fb0e0911d
> 
> This should tell us whether the problem is related to something in-tree or
> not.  If this confirms an in-tree change, we can push to try from this tip
> with the latest mozharness to determine whether mozharness might be to blame.

Retriggers are back and show the lower duration, implying that the cause is in-tree.
I'm running a try push from Apr 22 against current mozharness, but I don't expect that to show a problem.

If not, it looks like the culprit is one of these changes, based on Kyle's charts:

Thu Apr 23 20:06:50 2015 +0000	365450b3d3ef	Eitan Isaacson — Bug 1155917 - Added volume argument to nsISpeechService.speak(). r=smaug

Thu Apr 23 20:05:45 2015 +0000	95fd54df106b	Terrence Cole — Bug 1156390 - Do CheckMarkedThing on internal edges as well as roots; r=sfink

Thu Apr 23 20:01:18 2015 +0000	408faa11152a	Nikhil Marathe — Bug 1154494 - Hit network only once. r=baku,bkelly

Eyeballing it, it looks like there was a similar regression on Windows 7 debug.  Kyle, can you run the same report for Windows 7 debug?
Flags: needinfo?(klahnakoski)
Attached image Windows7 debug
Win7 debug is not as blatantly clear, but it is still there.  There is one test that actually improved: browser/devtools/animationinspector/test/browser_animation_playerFronts_are_refreshed.js

In this sample I only looked at browser/devtools/animationinspector/test/*; durations are normalized to help visualization; missing hours were filled in with the previous hour with data;  I used the following query

> {
>     "from":"unittest",
>     "select":{
>         "name":"duration",
>         "value":"result.duration",
>         "aggregate":"min"
>     },
>     "edges":[
>         "result.test",
>         {
>             "value":"build.date",
>             "domain":{
>                 "type":"time",
>                 "min":"{{22 APR 2015}}",
>                 "max":"{{25 APR 2015}}",
>                 "interval":"hour"
>             }
>         }
>     ],
>     "where":{"and":[
>         {"eq":{
>             "build.branch":"mozilla-inbound",
>             "run.suite":"mochitest-devtools-chrome",
>             "run.chunk":1,
>             "build.platform":"win32",
>             "build.type":"debug",
>             "machine.os":"win7-ix"
>         }},
>         {"prefix":{"result.test":"browser/devtools/animationinspector/test/"}},
>         {
>             "gte":{"build.date":"{{22 APR 2015}}"},
>             "lt":{"build.date":"{{25 APR 2015}}"}
>         }
>     ]},
>     "limit":100000
> }
Flags: needinfo?(klahnakoski)
Terrence, it's looking like your patch https://hg.mozilla.org/integration/mozilla-inbound/rev/95fd54df106b may have caused the devtools chrome tests to run slower, at least on debug; see above.  Is this a possible side-effect of your patch?
Flags: needinfo?(terrence)
Flags: needinfo?(sphink)
Yes, my patch added some extra debugging checks, which I guess are slower than intended. I'll back out the regressing bug.
Flags: needinfo?(terrence)
Flags: needinfo?(sphink)
Here is a targeted backout that gets us back to previous behavior. We should find a faster subset of the tests to run on every internal edge.
Assignee: klahnakoski → terrence
Status: NEW → ASSIGNED
Attachment #8604986 - Flags: review?(sphink)
Attachment #8604986 - Flags: review?(sphink) → review+
A try run so we can test if this solves the problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea61b59bfa85
Previous try run was on a busted tip, let's try again.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c8616116962
Finally, the tree opened when I was looking! Sorry it took so long to land this!
https://hg.mozilla.org/mozilla-central/rev/9f9797766a5d
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 41
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.