1189066 - Base CPOW alerts on the number of CPOWs, rather than their duration

David Teller [:Yoric] - still alive but not very active

Reporter

Description

•

9 years ago

For the moment, we measure CPOW duration by wallclock time, which is very inaccurate. Bug 1188616 should improve the accuracy somewhat, but this will still rely on wallclock time.

We cannot use CPU time directly, because most of the duration of a CPOW is spent in a wait state that doesn't use CPU. What we can do, however, is the following:
- in the child, compute the CPU time elapsed (that's the time spent by the child doing stuff);
- in the parent, compute the CPU time elapsed (that's the time spent by the parent serializing/deserializing/sending/receiving).

The sum should be a good approximation of a wallclock-independent cost for the CPOW.

David Teller [:Yoric] - still alive but not very active

Reporter

Updated

•

9 years ago

Blocks: 1185926
No longer blocks: 1188616

Depends on: 1188616

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 1

•

9 years ago

Bill, it's not quite clear to me which are the entry points into a CPOW in the child (in C++). Where could I add instrumentation to determine the time it takes for a child to entirely process a CPOW?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 2

•

9 years ago

Why is wallclock time inaccurate? If the child does something that requires disk access or paging or whatever, we want to count that time against the CPOW. Also, most measures of CPU time that I know of have pretty low resolution and are somewhat expensive to compute.

However, here is where we answer CPOWs in the child:
http://mxr.mozilla.org/mozilla-central/source/js/ipc/WrapperAnswer.cpp

Note that this code also runs in the parent since CPOWs can go in both directions.

Flags: needinfo?(wmccloskey)

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 3

•

9 years ago

Well, wallclock counts paging, and scheduling, and system activity, and a number of other factors that are both beyond the control of the code and impossible to reproduce, so using wallclock as a measure of performance is begging for false positives. That's why I prefer measuring CPU activity.

You have a point that blocking I/O is something that we may want to measure. However, given that the duration of any blocking I/O is unbounded and depends almost entirely of external factors, I suspect that time (whether wallclock or otherwise) is not the right measure for blocking I/O at all. Perhaps a number of blocking I/O operations would be more appropriate. I'll keep that for a followup bug.

As for the accuracy and performance of CPU time measures, I seem to remember that we can get a resolution that's much lower than 1ms, for a cost that should be invisible in comparison with the cost of IPC. Things might get nasty if we have nested CPOWs, but I think we should generally be ok. Plus, it's a CPOW, so it's something that we want to remove in the first place :)

Thanks for the link, I'll dig into that file. Do I understand correctly that I should start my stopwatch when the constructor of `WrapperAnswer` is called and stop it when calling either `ok` or `fail` is called?

Also, where can I find the definition of the low-level messages exchanged?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 4

•

9 years ago

Thinking more about this, it really seems like a mistake. The real costs of doing a CPOW are as follows:

1. Context switching (we have a separate I/O thread on both sides, so we do something like four context switches per IPC).
2. The child is busy when you send the CPOW, so the parent blocks for a while until the child returns to the event loop and processes the CPOW message.

I don't think the mechanism in comment 0 would account for either of those.

Flags: needinfo?(wmccloskey)

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 5

•

9 years ago

Interesting points. How do you suggest we measure these costs?

Note the case that worries me is not Telemetry (for this purpose, wallclock time would be appropriate, since we average out between users), but slow add-on alerts (for this purpose, we don't want to blame add-ons when it's the system itself that is busy and causing the slowdown).

Flags: needinfo?(wmccloskey)

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 6

•

9 years ago

I forgot to mention: we also don't want to display the alert when the computer is just coming back from sleep, or when the process is being paged back to physical RAM. Wallclock time is not good at that.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 7

•

9 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #5)
> Interesting points. How do you suggest we measure these costs?
> 
> Note the case that worries me is not Telemetry (for this purpose, wallclock
> time would be appropriate, since we average out between users), but slow
> add-on alerts (for this purpose, we don't want to blame add-ons when it's
> the system itself that is busy and causing the slowdown).

For that, maybe we should just measure the number of CPOWs sent per time. I guess that's the only thing that's fair to the add-on.

For telemetry, though, I'd like to stick with what we have.

Flags: needinfo?(wmccloskey)

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 8

•

9 years ago

Sounds good to me.

Summary: Track total CPU cost of CPOWs, not wallclock time → Base CPOW alerts on the number of CPOWs, rather than their duration

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 9

•

9 years ago

In this case, we can close the bug.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

Bugzilla

Quick Search

Base CPOW alerts on the number of CPOWs, rather than their duration

Categories

(Toolkit :: Performance Monitoring, defect)

Tracking

()

People

(Reporter: Yoric, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9