mozilla::glean::impl::TimingDistributionMetric::Start causes hangs in nsLayoutUtils::PaintFrame
Categories
(Toolkit :: Telemetry, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox-esr91 | --- | unaffected |
firefox92 | --- | unaffected |
firefox93 | --- | disabled |
firefox94 | --- | fixed |
People
(Reporter: florian, Assigned: chutten)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: hang, perf, regression, Whiteboard: [bhr:timing_distribution])
Attachments
(7 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
42 bytes,
text/x-github-pull-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Starting on the September 2nd Nightly build, this represents about 7.8% of the total parent process main thread hang time: https://fqueze.github.io/hang-stats/#date=20210902&row=0&filter=timing_distribution
Stack:
RtlAcquireSRWLockExclusive ntdll
glean::private::timing_distribution::{{impl}}::start(glean::private::timing_distribution::TimingDistributionMetric*) xul
fog::ffi::timing_distribution::fog_timing_distribution_start(unsigned int) xul
mozilla::glean::impl::TimingDistributionMetric::Start() const xul
static nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, mozilla::nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) xul
Reporter | ||
Comment 1•3 years ago
|
||
This is currently the #2 most frequent hang.
Comment 2•3 years ago
|
||
Chris, any idea who should look at this? This looks like a pretty big issue if we want people to use Glean.
Comment 3•3 years ago
|
||
Adding ni? to Jeff as well who might have thoughts on if there's something we need to do right now to mitigate this.
Comment 5•3 years ago
|
||
This is presumably hanging on this lock? https://searchfox.org/mozilla-central/rev/a166f59fba89fc70ebfab287f4edb8e05ed4f6da/third_party/rust/glean/src/private/timing_distribution.rs#25
The question is who's holding it.
Florian, do we know which process the hang is happening in?
Reporter | ||
Comment 6•3 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)
Florian, do we know which process the hang is happening in?
The dashboard at https://fqueze.github.io/hang-stats/ only shows data for the parent process.
Comment 7•3 years ago
|
||
chutten, are there any threads/situations that would hold this lock for a long time?
Assignee | ||
Comment 8•3 years ago
|
||
That lock isn't taken by anything outside that file, so we have:
- Construction
- accumulate_raw_samples_nanos (writes to the db)
- start (holds only long enough to store a timestamp in a hashmap)
- stop_and_accumulate (writes to the db)
My guess is the db is slow on the machines suffering the hang and so we need to tell it to delay. Luckily we have a mechanism for it, we've just never needed it before. Lemme post a quick patch.
Assignee | ||
Comment 9•3 years ago
|
||
There are hangs in using TimingDistribution in Firefox Desktop, which is odd.
Current suspect is a slow db, so let's delay the io and see if it's true.
Updated•3 years ago
|
Comment 10•3 years ago
|
||
Comment 11•3 years ago
|
||
Backed out for causing telemetry failures in test_fog_background_update_ping
Backout link: https://hg.mozilla.org/integration/autoland/rev/61f59eea8c2d033efac5b05ab182890715003ab0
Assignee | ||
Comment 12•3 years ago
|
||
Well heck, that means the order isn't guaranteed. That's almost certainly a bug.
Comment 13•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Comment 14•3 years ago
|
||
Hm, but the boolean and string ahead of it are present. If the batching were failing, those'd be gone too. Maybe there's a subtle timing interaction to do with the profile lock that background update takes? I need to add logging.
Alas, testing this locally isn't working. ./mach telemetry-tests-client toolkit/components/telemetry/tests/marionette/tests/client/test_fog_background_update_ping.py
barfs with marionette telling me to start a session first (on my local Linux machine). I'll re-bootstrap in case my deps need updating or something, but I'm not hopeful I won't be stuck debugging this using print statements on try (to cover my bets I've already submitted once)
Assignee | ||
Comment 15•3 years ago
•
|
||
Ah. For posterity and anyone else running afoul of test_fog_background_update_ping
: the reason the test won't run for me locally is because there is no Background Update on Linux (the platform I was trying to run the test on) (yet) and the test's skip-if
isn't respected when you run the test directly on the command-line. (Which I already knew and wrote docs about)
Also, the try run with logs confirms that the ping being sent successfully contains absolutely everything but uuid.background_update.client_id
:
{
"ping_info": {
"seq": 0,
"start_time": "2021-09-09T14:02+00:00",
"end_time": "2021-09-09T14:02+00:00"
},
"client_info": {
"telemetry_sdk_build": "40.1.0",
"first_run_date": "2021-09-09+00:00",
"architecture": "x86",
"os_version": "10.0",
"os": "Windows",
"app_build": "20210909133459",
"app_channel": "default",
"app_display_version": "94.0a1"
},
"metrics": {
"string_list": {
"background_update.reasons": ["app.update.auto=false", "app.update.background.enabled=false"],
"background_update.states": ["NEVER_CHECKED"]
},
"string": {
"background_update.final_state": "NEVER_CHECKED",
"update.channel": "default"
},
"boolean": {
"update.background_update": false,
"update.auto_download": false,
"update.can_usually_apply_updates": true,
"update.can_usually_use_bits": true,
"update.can_usually_stage_updates": true,
"update.enabled": true,
"update.service_enabled": true,
"update.can_usually_check_for_updates": true,
"background_update.exit_code_success": true
}
}
}
Next step: get this running and failing locally on my Windows machine so I can add more logs and poke around why there is no client_id
. My read of the relevant code suggests that the client_id must be set or else an error would log or there'd be no ping sent at all... so there's something going on here. Maybe we're trying to send an unparseable uuid (but only when delaying ping-lifetime io? Whut.) so it fails silently, or maybe something else is going on.
Comment 16•3 years ago
|
||
Backout merged to central: https://hg.mozilla.org/mozilla-central/rev/61f59eea8c2d
Assignee | ||
Comment 17•3 years ago
|
||
Alrighty, though I've managed to get it working on Windows I haven't managed to get it to log from the background update process on Windows. But that's not just me, a failing run and a succeeding run both fail to have Glean log lines mentioning a "background-update" ping is being submitted.
"failing run" is a run with delay_ping_lifetime_io
set to true
. "succeeding" is with it set to false
.
MacOSX does log, though (failing run, succeeding run).
This logging problem is unfortunate (it means I can't debug this test on any of my machines) and should be looked into. I've filed bug 1730012 for that.
The difference in logging between a successful and failing run is a minor reordering in the segment we're looking at (the bits after the ===============
fences). There's nothing obviously going wrong in here: both get a valid UUID from the client_id state file, parse it, and send it to Glean. Why it doesn't appear in the ping that is later submitted is a true mystery.
I think I may be left with no alternative but instrumenting deeper and deeper into the stack, which will hit problems quickly when I have to edit the Glean crates themselves as I'll have to remember how to do that. Were there docs for that? Ah yes we have docs for that.
Struck by a sense of foreboding, I checked if this "ping doesn't get Telemetry's client_id in it" was present in the data. It is, but only at 0.04% (17k out of 41M pings since Sept 1). Overall it's pretty weak evidence that there's a timing issue in the wild that this test just happens to miss most of the time.
Next up for me is continuing to use try
to give me log-filled mac runs with both delay_ping_lifetime_io
on and off so I can hunt for discrepancies. Maybe there is something special about uuid
metric types when it comes to ping lifetimes and I just can't see it in the code.
Assignee | ||
Comment 18•3 years ago
|
||
Ah-ha! Found it.
Turns out that we clear various small "ping"-lifetime stores as part of init: baseline
's and the special internal glean_client_info
. There's nothing wrong with this, we just make sure to call clear_ping_lifetime_storage
with the appropriate storage_name
and it'll make sure to only clear the part of ping_lifetime_data
that corresponds to that... specific... storage_name
....
When operating in delay_ping_lifetime_io
mode, clearing any storage will clear all (delayed) ping-lifetime storage. Whoops.
Next Steps:
- Write a test that (in
delay_ping_lifetime_io
mode) records data instore1
, clearsstore2
, then verifies the data is still instore1
. It should fail. - Adjust
clear_ping_lifetime_storage
to iterate overping_lifetime_data
and only erase the stuff that starts with the storage key. - Test should pass, so check it in and get a patch release out for the SDK
- Update, vendor, and push the fixed Glean to m-c
- Turn on
delay_ping_lifetime_io
mode in m-c - Reinstate
gfx/metrics.yaml
and the instrumentation from bug 1728423
Comment 19•3 years ago
|
||
Assignee | ||
Comment 20•3 years ago
|
||
Depends on D124930
Comment 21•3 years ago
|
||
Comment 22•3 years ago
|
||
Backed out for causing linux x86 debug bustage
Backout link: https://hg.mozilla.org/integration/autoland/rev/f43509d3a1f8c84f0f0b013d4830b64bfd1babd9
Assignee | ||
Comment 23•3 years ago
|
||
Wait a minute, isn't m-c
built with rust 1.54.0?
Oh geez, it's 1.51.0
? How? I thought bug 1723016 forced a minimum of 1.54.0
and btree_retain
was stabilized on 1.53.0
... :glandium, can you help me understand?
Comment 24•3 years ago
|
||
m-c is built with 1.55.0 but has a minimum supported rust version of 1.51 at the moment. If you need a bump to the minimum supported version, we can do that. Please file a bug in "Firefox Build System :: Toolchains".
Assignee | ||
Comment 25•3 years ago
|
||
New steps:
- Write a broken test that checks that we persist unsent ping-lifetime data between runs in the Rust Glean SDK
- Fix that broken test by persisting on shutdown in the Rust SDK
- Expose the persistence API on the Rust SDK for consumers (like FOG)
- Release a new Glean SDK (likely v41.1.0)
- Vendor the new Glean SDK
- Call the persistence API from (3) in FOG opportunistically on idle to limit how much ping-lifetime data we might lose in the event of an unclean shutdown (like a crash). (( Testing this may be impossible or at least tricky. Need some way to uncleanly shutdown a marionette'd browser ))
- Push with the other patches
- Re-land bug 1728423
- Doublecheck we don't show up in BHR anymore
Comment 26•3 years ago
|
||
badboy merged PR #1793: "Bug1729723 expose persist ping lifetime" in 7a7dc96.
Keeping the bug open so the next steps get addressed too
Comment 27•3 years ago
|
||
Glean v41.1.0 has been released, crossing of Step 4.
Assignee | ||
Comment 28•3 years ago
|
||
Depends on D125456
Assignee | ||
Comment 29•3 years ago
|
||
Depends on D125834
Comment 30•3 years ago
|
||
Comment 31•3 years ago
|
||
Backed out for causing build bustages
Backout link: https://hg.mozilla.org/integration/autoland/rev/fd03aa020bfd84a1d2dd11070ed1d428d61640f6
``` error: unused Result
that must be used
[task 2021-09-16T15:02:52.226Z] 15:02:52 INFO - --> toolkit/components/glean/src/lib.rs:123:5
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - |
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - 123 | glean::persist_ping_lifetime_data();
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - |
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - = note: `-D unused-must-use` implied by `-D warnings`
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - = note: this `Result` may be an `Err` variant, which should be handled
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - error: could not compile `fog_control` due to previous error
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - Caused by:
[task 2021-09-16T15:02:52.231Z] 15:02:52 INFO - process didn't exit successfully: `CARGO=/builds/worker/fetches/rustc/bin/cargo CARGO_CRATE_NAME=fog_control
Assignee | ||
Comment 32•3 years ago
|
||
Oh now this is odd, it should've broken when I built locally if that's the case.
Assignee | ||
Comment 33•3 years ago
|
||
Easy enough to fix. Popped it to try just to be sure I didn't miss anything else: https://treeherder.mozilla.org/#/jobs?repo=try&revision=60427c78ab81bea8c8c988f7b656228e9aafc39d
Comment 34•3 years ago
|
||
Comment 35•3 years ago
|
||
Backed out for mochitest crashes on LateWriteChecks.cpp
Backout link: https://hg.mozilla.org/integration/autoland/rev/4f9a1f599b43c1748edf0ce1b42a25b5f825db7e
Log link: https://treeherder.mozilla.org/logviewer?job_id=351787536&repo=autoland&lineNumber=3020
Assignee | ||
Comment 36•3 years ago
|
||
Now the number of backouts is starting to make me feel bad for bothering the sheriffs so much. Sorry!
dthayer, you're one of the people who's interested in shutdown. Can you help me understand when I should put Glean's "flush in-memory stuff to disk" call if not inside a RunOnShutdown
? (( Specifically here: https://searchfox.org/mozilla-central/source/toolkit/components/glean/xpcom/FOG.cpp#46 ))
I guess I could set aPhase
to be XPCOMShutdown
instead of XPCOMShutdownFinal
which is its default... Lemme prep a patch.
Assignee | ||
Comment 37•3 years ago
|
||
Depends on D125835
Comment 38•3 years ago
|
||
Updated•3 years ago
|
Comment 39•3 years ago
|
||
Backed out for causing reftest failures with block_on_dispatcher() and also web-test failures.
Assignee | ||
Comment 40•3 years ago
|
||
It's the idle observer firing that's causing the block_on_dispatcher
which causes the explosion for the reftest failures. (the web-test ones are a mystery at the moment).
It seems as though the idle observer might be being notified after we called shutdown. I've pushed a patch to try
to see if unregistering the idle observer on shutdown is the fix needed here: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=d3ae16ddf536f16b60ec1f36401a896a65823822
If this comes back green from reftest and web-test, we're clear here.
Assignee | ||
Comment 41•3 years ago
|
||
This will ensure we do not call persist_ping_lifetime_data() on the Rust Glean
SDK after we call shutdown(), which panics. See bug 1731595.
Depends on D125880
Comment 42•3 years ago
|
||
Comment 43•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/542a3ee958a1
https://hg.mozilla.org/mozilla-central/rev/40ad4e7230cf
https://hg.mozilla.org/mozilla-central/rev/827b5a8d32bc
https://hg.mozilla.org/mozilla-central/rev/4a65e9435d88
https://hg.mozilla.org/mozilla-central/rev/8b8e872558b3
https://hg.mozilla.org/mozilla-central/rev/bd50b16faa8f
Updated•3 years ago
|
Updated•3 years ago
|
Description
•