Closed
Bug 1242777
Opened 10 years ago
Closed 10 years ago
[e10s] Expose child process thread hang stats to Javascript
Categories
(Toolkit :: Telemetry, defect)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
mozilla47
People
(Reporter: azhang, Assigned: azhang)
Details
Attachments
(1 file, 3 obsolete files)
14.31 KB,
patch
|
chutten
:
review+
Sylvestre
:
approval-mozilla-aurora+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
Currently, parent process thread hang stats are available to extensions such as Statuser [1] via Services.telemetry.threadHangStats (in Services.jsm).
However, there's currently no support for doing the same thing for the child process.
This patch exposes the child process thread hang stats to the parent process via TelemetrySession.getChildThreadHangs(). Since communication between the processes should be asynchronous, the function returns a Promise:
TelemetrySession.getChildThreadHangs().then((hangs) => {
// here, each element of "hangs" has the same structure as Services.telemetry.threadHangStats
});
This patch extends TelemetrySession and does not change any existing functionality. Essentially, when getChildThreadHangs() is called in the parent, it sends a message to the child, which then sends its thread hang stats back to the parent. The resulting data is used to resolve the promise.
Measurements on my own machine show that the promise takes ~8ms to resolve. This makes it quite suitable for monitoring hangs in realtime. Support for getChildThreadHangs() in Statuser will be added soon.
[1]: https://github.com/chutten/statuser
Assignee | ||
Updated•10 years ago
|
Attachment #8711911 -
Flags: review?(vladan.bugzilla)
Comment 1•10 years ago
|
||
Comment on attachment 8711911 [details] [diff] [review]
child-hangs.patch
Review of attachment 8711911 [details] [diff] [review]:
-----------------------------------------------------------------
Let's see if I have this flow right:
1) TelemetrySession gets an API call for child threadHangs
2) Impl sends a "get me those hangs" message to each child process and holds onto the hanging Promise
3) Each child process reports back its hangs
4) The first child hangs report that comes back resolves all the waiting promises with the combined collection of child thread hangs it has received, then clears that collection of child hangs
I see two problems: the first is dealing with multiple API users, the second is dealing with multiple child processes.
1) If two promises come in, two requests go out, but if the second promise gets added before the first is resolved, the second promise will be resolved to an empty threadHangs object, no? this._childHangs will be emptied by the first resolve.
2) If there is more than one child process, that's more than one response. You don't seem to mux them together before resolving the promise or otherwise seem to handle this case.
Attachment #8711911 -
Flags: feedback-
![]() |
||
Updated•10 years ago
|
tracking-e10s:
--- → +
Comment 2•10 years ago
|
||
Comment on attachment 8711911 [details] [diff] [review]
child-hangs.patch
I'll wait for chutten's feedback to be addressed first
Attachment #8711911 -
Flags: review?(vladan.bugzilla)
Assignee | ||
Comment 3•10 years ago
|
||
Changes since the last patch:
* Use a timeout to ensure that the promise is always resolved, even when the content processes hang.
* Fix requesting the hang stats multiple times when getChildThreadHangs() is called multiple times.
* Accumulate all the child process hang stats and return them all when all of the content processes have responded.
Attachment #8711911 -
Attachment is obsolete: true
Attachment #8712757 -
Flags: review?(chutten)
Comment 4•10 years ago
|
||
Comment on attachment 8712757 [details] [diff] [review]
child-hangs.patch
Review of attachment 8712757 [details] [diff] [review]:
-----------------------------------------------------------------
There's nothing we can do if the child dies between call and response, I'm just highlighting that the comment is a tad incomplete.
Tangent Rant: We're already using a feature-full IPC library, why do we not expose it to JS so we can have call-and-response? With Promises? They're built for exactly this. Gah. Then all this would be:
var childrenPromise = Promise.all(children.map(child => child.sendAsyncWithReply(CHILD_THREAD_HANGS)));
return Promise.race([...childrenPromise, new Promise((r, reject) => setTimeout(reject, TIMEOUT_MS))]);
::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1714,5 @@
> + // If we are, then the resolve function we registered above will receive the results without needing to request them again
> + if (this._childThreadHangsResolveFunctions.length === 1) {
> + // Request that the child send us (the parent) some thread hang stats
> + this._childThreadHangs = []; // Clear the child hangs
> + this._childCount = ppmm.childCount; // We have to cache the number of children we send messages to, in case the child count changes while waiting for messages to arrive
Could still backfire if a child dies between call and response.
@@ +1721,5 @@
> + }
> +
> + // Set up a timeout in case one of the content processes never responds
> + this._childThreadHangsTimeout = setTimeout(() => {
> + // Resolve all the promises that are waiting on these thread hang stats
Shouldn't the timeout reject the promises, not resolve them?
Comment 5•10 years ago
|
||
(In reply to Chris H-C :chutten from comment #4)
By which of course I meant:
> return Promise.race([childrenPromise, new Promise((r, reject) =>
> setTimeout(reject, TIMEOUT_MS))]);
(( note the strip of the spread operator. childrenPromise isn't an iterable ))
Assignee | ||
Comment 6•10 years ago
|
||
> Could still backfire if a child dies between call and response.
There seem to be two possibilities if this happens:
1. A child dies before the last process responds. In this case, we could simply take the smaller of ppmm.childCount and this._childCount when checking if we've got everything, which would be the correct answer.
2. The dying child is the last process that needs to respond. In this case, we rely on the timeout to resolve the promise or we could use a timer to constantly poll for the value of childCount.
> Shouldn't the timeout reject the promises, not resolve them?
In addition to point 2 above, the function is supposed to get thread hang stats for all child processes. The way I see it, this means "get the stats from all child processes that will give us stats", so if one process is hanging we should still be able to get stats from the others. Is this reasonable?
To me, it seems like resolving the promise would make more sense.
Next steps:
I'll add a check for child processes dying before the last living child responds, and also a better commit message as suggested in IRC.
Comment 7•10 years ago
|
||
I would assume that the API would be "get me the stats for all current children" which means we should reject on timeout or if we detect the child count changing.
"Get me hangs from whoever wants to hand them out" is a bit wishy washy.
It depends on what an API consumer would want.
Honestly, I don't expect this to come up terribly often so I might be indulging in some bike shedding.
Assignee | ||
Comment 8•10 years ago
|
||
Changes since last patch:
* More accurate comments for what the function's doing.
* More detailed commit message.
* Handle when there are no child processes better (before, we relied on the timeout, and now we detect no child processes and simply return an empty result).
* Handle the case when child processes die in between the promise being fully created, but not resolved. Note that we still need this._childCount to account for child processes being added.
Attachment #8712757 -
Attachment is obsolete: true
Attachment #8712757 -
Flags: review?(chutten)
Attachment #8712882 -
Flags: review?(chutten)
Assignee | ||
Comment 9•10 years ago
|
||
Hmm, just saw your latest comment.
I might describe the current behaviour better as "give me stats for current child processes, likely excluding those that die or are added between the promise being created and the promise being resolved." (I say "likely" since a child process could send out data and then die, all before the promise is resolved).
As for "current" processes, I'm not sure if that's doable since it is possible that all child processes died and an equal number of new ones were spawned. What do you think?
FWIW, Statuser's child process hang stats diffing would be a bit simpler by always resolving, though it could be made to work either way.
Comment 10•10 years ago
|
||
Comment on attachment 8712882 [details] [diff] [review]
child-hangs.patch
Review of attachment 8712882 [details] [diff] [review]:
-----------------------------------------------------------------
Seems legit.
I like the commit message's completeness. The "What" section can probably be omitted in future commits as what the commit does should be apparent from the code without further explanation. The "How" section is excellent. The "Why" isn't so much "why are we doing this at all" because that there's a bug# in the commit (there's no bug# in this commit. Top line should be "bug# - one-liner r=reviewer") is enough reason. The "Why" is "why we did it this way instead of another way".
The audience of a commit message is someone who's digging through history to answer, usually, if a particular side-effect of a change was deliberate. By writing how you did it and why you didn't do it in another way, you highlight the behaviours you actually wanted to include and warn the future dev against trying alternatives that proved to not work.
Attachment #8712882 -
Flags: review?(chutten) → review+
Assignee | ||
Comment 11•10 years ago
|
||
Commit message changed. Is there anything else, or is this ready for checkin?
Attachment #8712882 -
Attachment is obsolete: true
Updated•10 years ago
|
Attachment #8713282 -
Flags: review+
Assignee | ||
Updated•10 years ago
|
Keywords: checkin-needed
Comment 12•10 years ago
|
||
Keywords: checkin-needed
Comment 13•10 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Assignee | ||
Comment 14•10 years ago
|
||
Comment on attachment 8713282 [details] [diff] [review]
child-hangs.patch
Approval Request Comment
[Feature/regressing bug #]:
1241362
[User impact if declined]:
Extensions such as Statuser cannot report on child process BHR hangs in realtime.
[Describe test coverage new/current, TreeHerder]:
No changes.
[Risks and why]:
Low risk; only adds a relatively self-contained function, two message types, and a bunch of comments to TelemetrySession.
[String/UUID change made/needed]:
No changes.
Attachment #8713282 -
Flags: approval-mozilla-beta?
Attachment #8713282 -
Flags: approval-mozilla-aurora?
Comment 15•10 years ago
|
||
backout bugherder landing |
Backed out to see if it fixes the intermittent hazard build failures like https://treeherder.mozilla.org/logviewer.html#?job_id=20736122&repo=mozilla-inbound
https://hg.mozilla.org/integration/mozilla-inbound/rev/c67d0eb738ed
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 16•10 years ago
|
||
Comment on attachment 8713282 [details] [diff] [review]
child-hangs.patch
Original patch backout, please resubmit an uplift request when it landed in m-c
Attachment #8713282 -
Flags: approval-mozilla-beta?
Attachment #8713282 -
Flags: approval-mozilla-beta-
Attachment #8713282 -
Flags: approval-mozilla-aurora?
Attachment #8713282 -
Flags: approval-mozilla-aurora-
Assignee | ||
Updated•10 years ago
|
Keywords: checkin-needed
Assignee | ||
Comment 17•10 years ago
|
||
As it's quite unlikely that this patch is part of the hazard build failures, I've submitted it for checkin again.
Updated•10 years ago
|
Assignee: nobody → azhang
Status: REOPENED → ASSIGNED
Comment 18•10 years ago
|
||
Keywords: checkin-needed
Comment 19•10 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 20•10 years ago
|
||
Comment on attachment 8713282 [details] [diff] [review]
child-hangs.patch
Approval Request Comment
[Feature/regressing bug #]:
1241362
[User impact if declined]:
Extensions such as Statuser cannot report on child process BHR hangs in realtime.
[Describe test coverage new/current, TreeHerder]:
No changes.
[Risks and why]:
Low risk; only adds a relatively self-contained function, two message types, and a bunch of comments to TelemetrySession.
[String/UUID change made/needed]:
No changes.
Attachment #8713282 -
Flags: approval-mozilla-beta?
Attachment #8713282 -
Flags: approval-mozilla-beta-
Attachment #8713282 -
Flags: approval-mozilla-aurora?
Attachment #8713282 -
Flags: approval-mozilla-aurora-
Updated•10 years ago
|
status-firefox45:
--- → affected
status-firefox46:
--- → affected
Updated•10 years ago
|
Attachment #8713282 -
Flags: approval-mozilla-beta?
Attachment #8713282 -
Flags: approval-mozilla-beta+
Attachment #8713282 -
Flags: approval-mozilla-aurora?
Attachment #8713282 -
Flags: approval-mozilla-aurora+
Comment 21•10 years ago
|
||
Should be in 45 beta 3.
Comment 22•10 years ago
|
||
bugherder uplift |
Comment 23•10 years ago
|
||
bugherder uplift |
Comment 24•10 years ago
|
||
Anthony, i see that the _childThreadHangs data you added here is not sent out with any Telemetry ping?
Can you add a comment to that effect to _childThreadHangs et al in a follow-up bug?
Flags: needinfo?(azhang)
Assignee | ||
Comment 25•10 years ago
|
||
Sure thing, would you mind giving the patch a quick once-over?
https://bugzilla.mozilla.org/show_bug.cgi?id=1250224
Flags: needinfo?(azhang)
You need to log in
before you can comment on or make changes to this bug.
Description
•