Closed Bug 1814718 Opened 1 year ago Closed 1 year ago

Record number of timers fired per Timer Thread wake-up

Categories

(Core :: XPCOM, task)

task

Tracking

()

RESOLVED FIXED
112 Branch
Tracking Status
firefox112 --- fixed

People

(Reporter: jlink, Assigned: jlink)

References

Details

Attachments

(3 files)

For the purpose of measuring the impact of https://bugzilla.mozilla.org/show_bug.cgi?id=1783405 it would useful to measure how many timers are fired each time the Timer Thread wakes up to do work. The intention of the above work is to cause a reduction in the number of times that the Timer Thread wakes up to accomplish a given amount of work by slightly increasing the amount of work that it does each time that it wakes up.

We need the telemetry to go in a few weeks before that change does in order to establish a baseline from real user behavior. The result that we expect to see is a small right-ward shift in the lower end of the distribution.

Assignee: nobody → jlink
No longer depends on: 1769255
Attachment #9316057 - Flags: data-review?(chutten)

Comment on attachment 9316057 [details]
1814718-data-collection-review-request.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection can be controlled through the product's preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, :jlink is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9316057 - Flags: data-review?(chutten) → data-review+
Pushed by fqueze@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/56cd09827bab
Added telemetry for number of timers fired per Timer Thread wake-up. r=smaug,florian

I'm baffled that this particular change here would cause a regression in those tests. A couple thoughts/questions came to mind:

  1. @smolnar: In the push with my change which has those failures those tests are consistently taking longer to run than in the push with my changes removed in which those tests don't fail. Other tests in that same suite do not show this time discrepancy. The failures are also only occurring on Windows 10 x64 2004 WebRender Shippable and, when they are failing, those tests are taking even longer than the same tests in the Windows 10 x64 2004 WebRender debug. Is it possible that those particular tests are running on a particular machine that is older/slower or that has something else going on with it to make it slow?
  2. @jrediger: There is not much going on in this change. The only "interesting" thing is really the call into glean to record the sample. Can you think of any conditions under which this could cause a performance problem? It seems to be only a couple tests and only on a highly-optimized configuration which is odd.
  3. Does anybody know how to trigger these tests in a push-to-try? I did one and tried to find and select everything that seemed related to Windows Shippable and, unless I'm missing something, it didn't run the "M-spi-nw" tests for that config (but they did run for some other configs).
Flags: needinfo?(smolnar)
Flags: needinfo?(jrediger)
Flags: needinfo?(jlink)

@Justin, the difference between test taking longer, is that in the first link there was a backstop push, which means all the tests are running. in the backout link not all the tests are running, hence the time difference. You can see the difference in what tests are running, by clicking on a test and checking it in the "Test Groups" tab, next to "Performance" tab.
The failure seems to be win shippable specific.
I have added manually on your test try the missing jobs, hopefully the needed test will run

Flags: needinfo?(smolnar)

(In reply to Justin Link from comment #6)

  1. @jrediger: There is not much going on in this change. The only "interesting" thing is really the call into glean to record the sample. Can you think of any conditions under which this could cause a performance problem? It seems to be only a couple tests and only on a highly-optimized configuration which is odd.

I looked at the patch. Do you have an idea how many samples would be accumulated there at once?
I realized that AccumulateSamples will allocate a (Rust) vec to hold those, and depending on how many that could get expensive easily.

Flags: needinfo?(jrediger)

I grabbed a quick profile of editor/libeditor/tests/test_undo_after_spellchecker_replaces_word.html without Justin's patch. During the test, the Timer thread of the content process is awakened 192758 in about 3.2s. I think Justin's patch would record a sample in Glean every time.
When profiling the test with the patch applied, I can't reproduce the failure, but I see 2 "FOG IPC Payload getting full" runnables on the child process running the test. I think this confirms we are recording a lot of data.

This seems to be due to AsyncSpellCheckTestHelper.jsm adding a repeating 0ms timer: https://searchfox.org/mozilla-central/rev/5ba1bdb6f8fbf8e8c91720b1e584c0081022f1bc/editor/AsyncSpellCheckTestHelper.jsm#93-94

This helper is used by the two tests mentioned in comment 5 and more: https://searchfox.org/mozilla-central/search?q=maybeOnSpellCheck&path=

I wake up to find a whole bunch of really useful feedback. I love it!

It seems like a potential easy "fix" here would be to have the TimerThread queue up its telemetry data and only flush it once enough has been captured. That should be able to mitigate most of the cost of adding this telemetry in this situation and won't take long to do.

Seeing timers used like this makes me kind of sad though. Am I the only one who thinks that repeatedly adding timers that need to be fired right away isn't a great way to use the Timer Thread? I'll need to think about how this type of usage is handled by the upcoming timer coalescing work (which is what this telemetry is intended to support).

I confess to asking myself how much of this bug is "Oh, Glean is cheap but not free. Once we get a full idea of the shape of this issue we'll need to examine how best to support this case." and how much of this bug is "Oh, new instrumentation managed to catch a problem before it even reached users! And once it's fixed we can re-evaluate the perf."

(In reply to Justin Link from comment #10)

Seeing timers used like this makes me kind of sad though. Am I the only one who thinks that repeatedly adding timers that need to be fired right away isn't a great way to use the Timer Thread?

Yeah, I refrained from suggesting immediately that we change the test helper to use ChromeUtils.idleDispatch instead of a 0ms timer and move on, mostly because I'm not sure if this would be hiding other similar cases. This might still be the way to move forward here.

Maybe this is just another reason to want to fix bug 1770658.

Here is push-to-try with my change to queue up glean submissions rather than send them every wake-up showing that the M-spi-nw tests are again passing on Windows 10 x64 2004 WebRender Shippable.

Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/de43589862f3
Added telemetry for number of timers fired per Timer Thread wake-up. r=smaug,florian
https://hg.mozilla.org/integration/autoland/rev/45ae4c850c9f
Buffer "timers fired per-wake-up" stats before sending to glean. r=smaug
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch
See Also: → 1817498
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: