[e10s] Expose child process thread hang stats to Javascript

RESOLVED FIXED in Firefox 45

Status

()

defect
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: azhang, Assigned: azhang)

Tracking

Trunk
mozilla47
Points:
---

Firefox Tracking Flags

(e10s+, firefox45 fixed, firefox46 fixed, firefox47 fixed)

Details

Attachments

(1 attachment, 3 obsolete attachments)

Posted patch child-hangs.patch (obsolete) — Splinter Review
Currently, parent process thread hang stats are available to extensions such as Statuser [1] via Services.telemetry.threadHangStats (in Services.jsm).

However, there's currently no support for doing the same thing for the child process.

This patch exposes the child process thread hang stats to the parent process via TelemetrySession.getChildThreadHangs(). Since communication between the processes should be asynchronous, the function returns a Promise:

    TelemetrySession.getChildThreadHangs().then((hangs) => {
        // here, each element of "hangs" has the same structure as Services.telemetry.threadHangStats
    });

This patch extends TelemetrySession and does not change any existing functionality. Essentially, when getChildThreadHangs() is called in the parent, it sends a message to the child, which then sends its thread hang stats back to the parent. The resulting data is used to resolve the promise.

Measurements on my own machine show that the promise takes ~8ms to resolve. This makes it quite suitable for monitoring hangs in realtime. Support for getChildThreadHangs() in Statuser will be added soon.

[1]: https://github.com/chutten/statuser
Attachment #8711911 - Flags: review?(vladan.bugzilla)
Comment on attachment 8711911 [details] [diff] [review]
child-hangs.patch

Review of attachment 8711911 [details] [diff] [review]:
-----------------------------------------------------------------

Let's see if I have this flow right:

1) TelemetrySession gets an API call for child threadHangs
2) Impl sends a "get me those hangs" message to each child process and holds onto the hanging Promise
3) Each child process reports back its hangs
4) The first child hangs report that comes back resolves all the waiting promises with the combined collection of child thread hangs it has received, then clears that collection of child hangs

I see two problems: the first is dealing with multiple API users, the second is dealing with multiple child processes.

1) If two promises come in, two requests go out, but if the second promise gets added before the first is resolved, the second promise will be resolved to an empty threadHangs object, no? this._childHangs will be emptied by the first resolve.

2) If there is more than one child process, that's more than one response. You don't seem to mux them together before resolving the promise or otherwise seem to handle this case.
Attachment #8711911 - Flags: feedback-
tracking-e10s: --- → +
Comment on attachment 8711911 [details] [diff] [review]
child-hangs.patch

I'll wait for chutten's feedback to be addressed first
Attachment #8711911 - Flags: review?(vladan.bugzilla)
Posted patch child-hangs.patch (obsolete) — Splinter Review
Changes since the last patch:

* Use a timeout to ensure that the promise is always resolved, even when the content processes hang.
* Fix requesting the hang stats multiple times when getChildThreadHangs() is called multiple times.
* Accumulate all the child process hang stats and return them all when all of the content processes have responded.
Attachment #8711911 - Attachment is obsolete: true
Attachment #8712757 - Flags: review?(chutten)
Comment on attachment 8712757 [details] [diff] [review]
child-hangs.patch

Review of attachment 8712757 [details] [diff] [review]:
-----------------------------------------------------------------

There's nothing we can do if the child dies between call and response, I'm just highlighting that the comment is a tad incomplete.

Tangent Rant: We're already using a feature-full IPC library, why do we not expose it to JS so we can have call-and-response? With Promises? They're built for exactly this. Gah. Then all this would be:

  var childrenPromise = Promise.all(children.map(child => child.sendAsyncWithReply(CHILD_THREAD_HANGS)));
  return Promise.race([...childrenPromise, new Promise((r, reject) => setTimeout(reject, TIMEOUT_MS))]);

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1714,5 @@
> +      // If we are, then the resolve function we registered above will receive the results without needing to request them again
> +      if (this._childThreadHangsResolveFunctions.length === 1) {
> +        // Request that the child send us (the parent) some thread hang stats
> +        this._childThreadHangs = []; // Clear the child hangs
> +        this._childCount = ppmm.childCount; // We have to cache the number of children we send messages to, in case the child count changes while waiting for messages to arrive

Could still backfire if a child dies between call and response.

@@ +1721,5 @@
> +        }
> +
> +        // Set up a timeout in case one of the content processes never responds
> +        this._childThreadHangsTimeout = setTimeout(() => {
> +          // Resolve all the promises that are waiting on these thread hang stats

Shouldn't the timeout reject the promises, not resolve them?
(In reply to Chris H-C :chutten from comment #4)

By which of course I meant:
>   return Promise.race([childrenPromise, new Promise((r, reject) =>
> setTimeout(reject, TIMEOUT_MS))]);

(( note the strip of the spread operator. childrenPromise isn't an iterable ))
> Could still backfire if a child dies between call and response.

There seem to be two possibilities if this happens:

1. A child dies before the last process responds. In this case, we could simply take the smaller of ppmm.childCount and this._childCount when checking if we've got everything, which would be the correct answer.
2. The dying child is the last process that needs to respond. In this case, we rely on the timeout to resolve the promise or we could use a timer to constantly poll for the value of childCount.

> Shouldn't the timeout reject the promises, not resolve them?

In addition to point 2 above, the function is supposed to get thread hang stats for all child processes. The way I see it, this means "get the stats from all child processes that will give us stats", so if one process is hanging we should still be able to get stats from the others. Is this reasonable?

To me, it seems like resolving the promise would make more sense.

Next steps:

I'll add a check for child processes dying before the last living child responds, and also a better commit message as suggested in IRC.
I would assume that the API would be "get me the stats for all current children" which means we should reject on timeout or if we detect the child count changing.

"Get me hangs from whoever wants to hand them out" is a bit wishy washy.

It depends on what an API consumer would want.

Honestly, I don't expect this to come up terribly often so I might be indulging in some bike shedding.
Posted patch child-hangs.patch (obsolete) — Splinter Review
Changes since last patch:

* More accurate comments for what the function's doing.
* More detailed commit message.
* Handle when there are no child processes better (before, we relied on the timeout, and now we detect no child processes and simply return an empty result).
* Handle the case when child processes die in between the promise being fully created, but not resolved. Note that we still need this._childCount to account for child processes being added.
Attachment #8712757 - Attachment is obsolete: true
Attachment #8712757 - Flags: review?(chutten)
Attachment #8712882 - Flags: review?(chutten)
Hmm, just saw your latest comment.

I might describe the current behaviour better as "give me stats for current child processes, likely excluding those that die or are added between the promise being created and the promise being resolved." (I say "likely" since a child process could send out data and then die, all before the promise is resolved).

As for "current" processes, I'm not sure if that's doable since it is possible that all child processes died and an equal number of new ones were spawned. What do you think?

FWIW, Statuser's child process hang stats diffing would be a bit simpler by always resolving, though it could be made to work either way.
Comment on attachment 8712882 [details] [diff] [review]
child-hangs.patch

Review of attachment 8712882 [details] [diff] [review]:
-----------------------------------------------------------------

Seems legit.

I like the commit message's completeness. The "What" section can probably be omitted in future commits as what the commit does should be apparent from the code without further explanation. The "How" section is excellent. The "Why" isn't so much "why are we doing this at all" because that there's a bug# in the commit (there's no bug# in this commit. Top line should be "bug# - one-liner r=reviewer") is enough reason. The "Why" is "why we did it this way instead of another way". 

The audience of a commit message is someone who's digging through history to answer, usually, if a particular side-effect of a change was deliberate. By writing how you did it and why you didn't do it in another way, you highlight the behaviours you actually wanted to include and warn the future dev against trying alternatives that proved to not work.
Attachment #8712882 - Flags: review?(chutten) → review+
Commit message changed. Is there anything else, or is this ready for checkin?
Attachment #8712882 - Attachment is obsolete: true
Attachment #8713282 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/8ee5a700e4a2
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Comment on attachment 8713282 [details] [diff] [review]
child-hangs.patch

Approval Request Comment
[Feature/regressing bug #]:
1241362

[User impact if declined]:
Extensions such as Statuser cannot report on child process BHR hangs in realtime.

[Describe test coverage new/current, TreeHerder]:
No changes.

[Risks and why]: 
Low risk; only adds a relatively self-contained function, two message types, and a bunch of comments to TelemetrySession.

[String/UUID change made/needed]:
No changes.
Attachment #8713282 - Flags: approval-mozilla-beta?
Attachment #8713282 - Flags: approval-mozilla-aurora?
Backed out to see if it fixes the intermittent hazard build failures like https://treeherder.mozilla.org/logviewer.html#?job_id=20736122&repo=mozilla-inbound
https://hg.mozilla.org/integration/mozilla-inbound/rev/c67d0eb738ed
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 8713282 [details] [diff] [review]
child-hangs.patch

Original patch backout, please resubmit an uplift request when it landed in m-c
Attachment #8713282 - Flags: approval-mozilla-beta?
Attachment #8713282 - Flags: approval-mozilla-beta-
Attachment #8713282 - Flags: approval-mozilla-aurora?
Attachment #8713282 - Flags: approval-mozilla-aurora-
As it's quite unlikely that this patch is part of the hazard build failures, I've submitted it for checkin again.
Assignee: nobody → azhang
Status: REOPENED → ASSIGNED
https://hg.mozilla.org/mozilla-central/rev/512496acf11d
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Comment on attachment 8713282 [details] [diff] [review]
child-hangs.patch

Approval Request Comment
[Feature/regressing bug #]:
1241362

[User impact if declined]:
Extensions such as Statuser cannot report on child process BHR hangs in realtime.

[Describe test coverage new/current, TreeHerder]:
No changes.

[Risks and why]: 
Low risk; only adds a relatively self-contained function, two message types, and a bunch of comments to TelemetrySession.

[String/UUID change made/needed]:
No changes.
Attachment #8713282 - Flags: approval-mozilla-beta?
Attachment #8713282 - Flags: approval-mozilla-beta-
Attachment #8713282 - Flags: approval-mozilla-aurora?
Attachment #8713282 - Flags: approval-mozilla-aurora-
Attachment #8713282 - Flags: approval-mozilla-beta?
Attachment #8713282 - Flags: approval-mozilla-beta+
Attachment #8713282 - Flags: approval-mozilla-aurora?
Attachment #8713282 - Flags: approval-mozilla-aurora+
Should be in 45 beta 3.
Anthony, i see that the _childThreadHangs data you added here is not sent out with any Telemetry ping?
Can you add a comment to that effect to _childThreadHangs et al in a follow-up bug?
Flags: needinfo?(azhang)
Sure thing, would you mind giving the patch a quick once-over?

https://bugzilla.mozilla.org/show_bug.cgi?id=1250224
Flags: needinfo?(azhang)
You need to log in before you can comment on or make changes to this bug.