Closed Bug 1291815 Opened 4 years ago Closed 3 years ago

DAMP (Talos): Wait for requests to settle before closing toolbox

Categories

(DevTools :: General, defect, P1)

defect

Tracking

(firefox53 fixed)

RESOLVED FIXED
Firefox 53
Tracking Status
firefox53 --- fixed

People

(Reporter: jryans, Assigned: jryans)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

While investigating a Talos regression (bug 1289410), I noticed that the toolbox close portion does not wait for RDP requests to complete first.  This means there may be many tasks in progress processing the reload that just took place, and those could make the close numbers much higher than they would be in real usage of a particular tool.  It also leads various errors being logged when the toolbox closes with requests still in progress.

If we agree it's good to address this for more accurate close measurements, we can do so by adding a utility function to wait until all requests have settled.

Making this change will invalidate previous DAMP results.
I think we will need something like this to land the new console frontend (Bug 1304178).  All of my damp runs that flip the pref show regressions in other tools, which doesn't match up with what should be happening, since the frontend isn't opened on those other tools probes.

I have a push where I just set a timeout before proceeding with reloading the page and closing the tab, which causes a slight improvement on current damp times alone:

Results: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3889408efb2d&newProject=try&newRevision=f45881b2dbc0&framework=1&filter=damp&showOnlyImportant=0 and the changeset: https://hg.mozilla.org/try/rev/0d7e82b19167a33b794914aca11cb6dbe6039687.

But more importantly, it stabilizes the regression caused by preffing on the new frontend (note there are still console-specific improvements to do to get this regression down even further). Without the change is significantly worse than with the change, moreso than the small improvement we see above by just applying the change alone.  That makes me think we are definitely letting one test bleed into others.

Without: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3889408efb2d&newProject=try&newRevision=65e07684396b&framework=1&filter=damp&showOnlyImportant=0

With: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3889408efb2d&newProject=try&newRevision=f6a48435889e&framework=1&filter=damp&showOnlyImportant=0

Ryan, did you have an implementation in mind for waiting for RDP requests to settle?  I'm OK with invalidating previous DAMP results since we are obviously having some noise and inaccuracy in the measurement.
Blocks: 1304178
Flags: needinfo?(jryans)
I have a WIP approach in a branch from when I filed this, but I don't remember how close it was to being done...  I'll take another look tomorrow.
Assignee: nobody → jryans
Status: NEW → ASSIGNED
Flags: needinfo?(jryans)
Comment on attachment 8796332 [details]
Bug 1291815 - Wait for requests to settle between DAMP subtests.

https://reviewboard.mozilla.org/r/82220/#review80840

Just a quick inline note - haven't reviewed the actual settling code yet.

::: testing/talos/talos/tests/devtools/addon/content/damp.js:281
(Diff revision 1)
>      let subtests = {
>        webconsoleOpen: Task.async(function*() {
>          yield this.testSetup(url);
>          yield openToolboxAndLog(label + ".webconsole", "webconsole");
>          yield reloadPageAndLog(label + ".webconsole");
> +        yield this.waitForRequestsToSettle();

I'd prefer this logic is folded into the this.closeToolbox helper (before startRecordTimestamp is set) since I believe that will cover all individual usages here.

We could also get rid of the waitForRequestsToSettle function there since we already have a handle on `target`
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #4)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=d8eb10444620
> https://treeherder.mozilla.org/perf.html#/
> comparechooser?newProject=try&newRevision=d8eb10444620

I went ahead and retriggered from the base revision on this push - here's the compare page for that one: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=2a3ee1d58982&newProject=try&newRevision=d8eb10444620&framework=1&showOnlyImportant=0
I'm seeing errors/timeouts on non-e10s runs with this applied: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d8eb10444620&selectedJob=28277414. I don't see any halfway done jobs in the log (no match on 'simple.webconsole.open.DAMP' for instance): https://archive.mozilla.org/pub/firefox/try-builds/jryans@gmail.com-d8eb10444620a242dc6f583ac1fa4a3fa2a72c85/try-linux64/try_ubuntu64_hw_test-g2-bm104-tests1-linux-build2028.txt.gz.

Joel, does this log tell you anything?  Maybe this is just a case where it didn't complete in time.  And if so, can we bump it up just for a try run at least to see the results?
Flags: needinfo?(jmaher)
See Also: → 1306780
I honestly think this is related to bug 1306780, most likely we have an addon signing issue
Flags: needinfo?(jmaher)
Should we attempt to proceed with landing here (after resolving review issues)?  

I guess it appears we can't know the e10s disabled impact ahead of time due to the try failure, which seems to be a new failure mode from last time I tried to change DAMP (for example in bug 1154874 I was able to successfully get e10s off and on results from DAMP in a try run from Aug. 23).
Flags: needinfo?(bgrinstead)
Hm, yes I guess so if the failure is just infrastructure related.  Going to forward the review to Alex.
Flags: needinfo?(bgrinstead)
Attachment #8796332 - Flags: review?(bgrinstead) → review?(poirot.alex)
Comment on attachment 8796332 [details]
Bug 1291815 - Wait for requests to settle between DAMP subtests.

https://reviewboard.mozilla.org/r/82220/#review82182

r+ for all but damp modifications. I don't know enough about it to have much opinion and don't really follow the issues you are facing on try.

::: devtools/shared/client/main.js:1185
(Diff revision 1)
> +    this._pendingRequests.forEach(requestsForActor => {
> +      requests = requests.concat(requestsForActor);
> +    });
> +    this._activeRequests.forEach(requestForActor => {
> +      requests = requests.concat(requestForActor);
> +    });

let requests = {...this._pendingRequests.keys(), ...this.activeRequest.keys()];
\o/ JS from 2016!

::: devtools/shared/client/main.js:1197
(Diff revision 1)
> +    // With protocol.js, each front can potentially have it's own pools containing child
> +    // fronts, forming a tree.  Descend through all the pools to locate all child fronts.
> +    while (poolsToVisit.length) {
> +      let pool = poolsToVisit.shift();
> +      fronts.add(pool);
> +      for (let child of pool.poolChildren()) {

I imagine if we have a tree, that should happen, but I don't know protocol.js very well to trust it.
Shouldn't we check if pool is already the set and continue if that's the case?

::: devtools/shared/client/main.js:1204
(Diff revision 1)
> +      }
> +    }
> +
> +    // For all fronts, extract the promise from any ongoing requests.
> +    for (let front of fronts) {
> +      let frontRequests = front._requests.map(({ deferred }) => deferred.promise);

If feel wrong to access to _requests from here. Would be better to expose an helper. May be not _requests itself, but a waitForRequestsToSettle on it?
Attachment #8796332 - Flags: review?(poirot.alex) → review+
(In reply to Brian Grinstead [:bgrins] from comment #5)
> Comment on attachment 8796332 [details]
> Bug 1291815 - Wait for requests to settle between DAMP subtests.
> 
> https://reviewboard.mozilla.org/r/82220/#review80840
> 
> Just a quick inline note - haven't reviewed the actual settling code yet.
> 
> ::: testing/talos/talos/tests/devtools/addon/content/damp.js:281
> (Diff revision 1)
> >      let subtests = {
> >        webconsoleOpen: Task.async(function*() {
> >          yield this.testSetup(url);
> >          yield openToolboxAndLog(label + ".webconsole", "webconsole");
> >          yield reloadPageAndLog(label + ".webconsole");
> > +        yield this.waitForRequestsToSettle();
> 
> I'd prefer this logic is folded into the this.closeToolbox helper (before
> startRecordTimestamp is set) since I believe that will cover all individual
> usages here.
> 
> We could also get rid of the waitForRequestsToSettle function there since we
> already have a handle on `target`

Okay, I've folded it into to closeToolbox as suggested.
Comment on attachment 8796332 [details]
Bug 1291815 - Wait for requests to settle between DAMP subtests.

https://reviewboard.mozilla.org/r/82220/#review82182

> let requests = {...this._pendingRequests.keys(), ...this.activeRequest.keys()];
> \o/ JS from 2016!

This doesn't seem quite right, at least for the the `_pendingRequests` case.  Each value is itself an entire array.  I am currently depending on concat's behavior of extracting the elements when passed an array.

I'll add a note about the types for each of these.

> I imagine if we have a tree, that should happen, but I don't know protocol.js very well to trust it.
> Shouldn't we check if pool is already the set and continue if that's the case?

Are you concerned it's actually a graph with cycles and not a tree, so we could return to a pool we've already visited?  I don't believe that's possible right now... if it were, I guess I would expect the algorithm to loop forever, but that doesn't happen in practice.

`poolChildren` does filter out the one case I am aware of, where something can be "self-owned" so thing A appears in a pool owned by A.

> If feel wrong to access to _requests from here. Would be better to expose an helper. May be not _requests itself, but a waitForRequestsToSettle on it?

Okay, I added methods to the front to avoid exposing internals.
Okay, this definitely does hang when e10s is disabled, need to investigate further.
Analysis is currently blocked by bug 1309294 in Talos log processing.
Depends on: 1309294
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #17)
> Analysis is currently blocked by bug 1309294 in Talos log processing.

Fixed now.
See Also: → 1317312
Running locally with ./mach talos-test --disable-e10s -a damp

I'm seeing it not proceed past the inspector open on the simple page with an error in the browser console:

Protocol error (noSuchActor): No such actor for ID: server1.conn1.child1/domnode33  inspector.js:188:7
	Inspector.prototype._handleRejectionIfNotDestroyed chrome://devtools/content/inspector/inspector.js:188:7
	Handler.prototype.process resource://gre/modules/Promise-backend.js:935:21
	this.PromiseWalker.walkerLoop resource://gre/modules/Promise-backend.js:813:7
	this.PromiseWalker.scheduleWalkerLoop resource://gre/modules/Promise-backend.js:744:11
	this.PromiseWalker.schedulePromise resource://gre/modules/Promise-backend.js:776:7
	this.PromiseWalker.completePromise resource://gre/modules/Promise-backend.js:711:7
	Front<.onPacket/< resource://devtools/shared/protocol.js:1295:9
	(Async: DevTools RDP)
	Front<.request resource://devtools/shared/protocol.js:1239:14
	generateRequestMethods/</frontProto[name] resource://devtools/shared/protocol.js:1377:14
	Inspector.prototype.onNewSelection chrome://devtools/content/inspector/inspector.js:782:7
	emit resource://devtools/shared/event-emitter.js:191:13
	Selection.prototype.setNodeFront resource://devtools/client/framework/selection.js:127:5
	Inspector.prototype._deferredOpen/< chrome://devtools/content/inspector/inspector.js:264:9
	once/handler resource://devtools/shared/event-emitter.js:135:11
	emit resource://devtools/shared/event-emitter.js:191:13
	Inspector.prototype._onMarkupFrameLoad chrome://devtools/content/inspector/inspector.js:1338:5
	Inspector.prototype._initMarkup chrome://devtools/content/inspector/inspector.js:1320:5
	Inspector.prototype._deferredOpen chrome://devtools/content/inspector/inspector.js:256:5
	Inspector.prototype.init< chrome://devtools/content/inspector/inspector.js:137:18
	TaskImpl.prototype._run resource://devtools/shared/task.js:311:39
	Handler.prototype.process resource://gre/modules/Promise-backend.js:932:23
	this.PromiseWalker.walkerLoop resource://gre/modules/Promise-backend.js:813:7
	this.PromiseWalker.scheduleWalkerLoop resource://gre/modules/Promise-backend.js:744:11
	this.PromiseWalker.schedulePromise resource://gre/modules/Promise-backend.js:776:7
	this.PromiseWalker.completePromise resource://gre/modules/Promise-backend.js:711:7
	Front<.onPacket/< resource://devtools/shared/protocol.js:1297:9
	(Async: DevTools RDP)
	Front<.request resource://devtools/shared/protocol.js:1239:14
	generateRequestMethods/</frontProto[name] resource://devtools/shared/protocol.js:1377:14
	WalkerFront<.querySelector< resource://devtools/shared/fronts/inspector.js:621:12
	Inspector.prototype._getDefaultNodeForSelection/< chrome://devtools/content/inspector/inspector.js:330:14
	Handler.prototype.process resource://gre/modules/Promise-backend.js:932:23
	this.PromiseWalker.walkerLoop resource://gre/modules/Promise-backend.js:813:7
	this.PromiseWalker.scheduleWalkerLoop resource://gre/modules/Promise-backend.js:744:11
	this.PromiseWalker.schedulePromise resource://gre/modules/Promise-backend.js:776:7
	this.PromiseWalker.completePromise resource://gre/modules/Promise-backend.js:711:7
	Front<.onPacket/< resource://devtools/shared/protocol.js:1297:9
	(Async: DevTools RDP)
	Front<.request resource://devtools/shared/protocol.js:1239:14
	generateRequestMethods/</frontProto[name] resource://devtools/shared/protocol.js:1377:14
	InspectorFront<.getPageStyle< resource://devtools/shared/fronts/inspector.js:976:12
	Inspector.prototype._getPageStyle chrome://devtools/content/inspector/inspector.js:290:12
	Inspector.prototype.init< chrome://devtools/content/inspector/inspector.js:130:11
	TaskImpl.prototype._run resource://devtools/shared/task.js:311:39
	Handler.prototype.process resource://gre/modules/Promise-backend.js:932:23
	this.PromiseWalker.walkerLoop resource://gre/modules/Promise-backend.js:813:7
	this.PromiseWalker.scheduleWalkerLoop resource://gre/modules/Promise-backend.js:744:11
	this.PromiseWalker.schedulePromise resource://gre/modules/Promise-backend.js:776:7
	this.PromiseWalker.completePromise resource://gre/modules/Promise-backend.js:711:7
	Front<.onPacket/< resource://devtools/shared/protocol.js:1297:9
	(Async: DevTools RDP)
	Front<.request resource://devtools/shared/protocol.js:1239:14
	generateRequestMethods/</frontProto[name] resource://devtools/shared/protocol.js:1377:14
	initCssProperties< resource://devtools/shared/fronts/css-properties.js:208:28
	TaskImpl.prototype._run resource://devtools/shared/task.js:311:39
	TaskImpl resource://devtools/shared/task.js:273:3
	createAsyncFunction/asyncFunction resource://devtools/shared/task.js:247:14
	Inspector.prototype.init< chrome://devtools/content/inspector/inspector.js:127:33
	TaskImpl.prototype._run resource://devtools/shared/task.js:311:39
	TaskImpl resource://devtools/shared/task.js:273:3
	createAsyncFunction/asyncFunction resource://devtools/shared/task.js:247:14
	open resource://devtools/client/inspector/panel.js:12:12
	Toolbox.prototype.loadTool/onLoad resource://devtools/client/framework/toolbox.js:1338:19
Looks like I can reproduce the issue with e10s off in my local branch after rebasing.

I'll try to investigate it now.
The interdiff is hard to read since it includes rebased code + other review comments.  What part was needed to fix the non-e10s hang?
(In reply to Brian Grinstead [:bgrins] from comment #25)
> The interdiff is hard to read since it includes rebased code + other review
> comments.  What part was needed to fix the non-e10s hang?

Yeah, rebasing is tricky for MozReview!  Here are the core changes as separate patches before I squashed them:

https://github.com/jryans/gecko-dev/commit/bf094537135a1659d9c9f60d59f348d19696bd04
https://github.com/jryans/gecko-dev/commit/c904e1f3949f4db8f9ab76231d703886f470b001
Looks like on average the suite speeds up around 5%.  Let's accept that as the 'new normal' since it should be more accurate and will hopefully limit noise between subtests
Yep, the results seem as expected overall: the close numbers reduce in many cases, especially Net Monitor, since we're waiting for tool traffic to settle before timing close.  A few numbers grew slightly, but that should more accurately reflect the work they are doing.
Pushed by jryans@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/25cf55049850
Wait for requests to settle between DAMP subtests. r=ochameau
https://hg.mozilla.org/mozilla-central/rev/25cf55049850
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 53
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.