Closed Bug 1189066 Opened 9 years ago Closed 9 years ago

Base CPOW alerts on the number of CPOWs, rather than their duration

Categories

(Toolkit :: Performance Monitoring, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: Yoric, Unassigned)

References

Details

For the moment, we measure CPOW duration by wallclock time, which is very inaccurate. Bug 1188616 should improve the accuracy somewhat, but this will still rely on wallclock time.

We cannot use CPU time directly, because most of the duration of a CPOW is spent in a wait state that doesn't use CPU. What we can do, however, is the following:
- in the child, compute the CPU time elapsed (that's the time spent by the child doing stuff);
- in the parent, compute the CPU time elapsed (that's the time spent by the parent serializing/deserializing/sending/receiving).

The sum should be a good approximation of a wallclock-independent cost for the CPOW.
Blocks: 1185926
No longer blocks: 1188616
Depends on: 1188616
Bill, it's not quite clear to me which are the entry points into a CPOW in the child (in C++). Where could I add instrumentation to determine the time it takes for a child to entirely process a CPOW?
Flags: needinfo?(wmccloskey)
Why is wallclock time inaccurate? If the child does something that requires disk access or paging or whatever, we want to count that time against the CPOW. Also, most measures of CPU time that I know of have pretty low resolution and are somewhat expensive to compute.

However, here is where we answer CPOWs in the child:
http://mxr.mozilla.org/mozilla-central/source/js/ipc/WrapperAnswer.cpp

Note that this code also runs in the parent since CPOWs can go in both directions.
Flags: needinfo?(wmccloskey)
Well, wallclock counts paging, and scheduling, and system activity, and a number of other factors that are both beyond the control of the code and impossible to reproduce, so using wallclock as a measure of performance is begging for false positives. That's why I prefer measuring CPU activity.

You have a point that blocking I/O is something that we may want to measure. However, given that the duration of any blocking I/O is unbounded and depends almost entirely of external factors, I suspect that time (whether wallclock or otherwise) is not the right measure for blocking I/O at all. Perhaps a number of blocking I/O operations would be more appropriate. I'll keep that for a followup bug.

As for the accuracy and performance of CPU time measures, I seem to remember that we can get a resolution that's much lower than 1ms, for a cost that should be invisible in comparison with the cost of IPC. Things might get nasty if we have nested CPOWs, but I think we should generally be ok. Plus, it's a CPOW, so it's something that we want to remove in the first place :)

Thanks for the link, I'll dig into that file. Do I understand correctly that I should start my stopwatch when the constructor of `WrapperAnswer` is called and stop it when calling either `ok` or `fail` is called?

Also, where can I find the definition of the low-level messages exchanged?
Flags: needinfo?(wmccloskey)
Thinking more about this, it really seems like a mistake. The real costs of doing a CPOW are as follows:

1. Context switching (we have a separate I/O thread on both sides, so we do something like four context switches per IPC).
2. The child is busy when you send the CPOW, so the parent blocks for a while until the child returns to the event loop and processes the CPOW message.

I don't think the mechanism in comment 0 would account for either of those.
Flags: needinfo?(wmccloskey)
Interesting points. How do you suggest we measure these costs?

Note the case that worries me is not Telemetry (for this purpose, wallclock time would be appropriate, since we average out between users), but slow add-on alerts (for this purpose, we don't want to blame add-ons when it's the system itself that is busy and causing the slowdown).
Flags: needinfo?(wmccloskey)
I forgot to mention: we also don't want to display the alert when the computer is just coming back from sleep, or when the process is being paged back to physical RAM. Wallclock time is not good at that.
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #5)
> Interesting points. How do you suggest we measure these costs?
> 
> Note the case that worries me is not Telemetry (for this purpose, wallclock
> time would be appropriate, since we average out between users), but slow
> add-on alerts (for this purpose, we don't want to blame add-ons when it's
> the system itself that is busy and causing the slowdown).

For that, maybe we should just measure the number of CPOWs sent per time. I guess that's the only thing that's fair to the add-on.

For telemetry, though, I'd like to stick with what we have.
Flags: needinfo?(wmccloskey)
Sounds good to me.
Summary: Track total CPU cost of CPOWs, not wallclock time → Base CPOW alerts on the number of CPOWs, rather than their duration
In this case, we can close the bug.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.