Closed Bug 1729723 Opened 2 years ago Closed 2 years ago

mozilla::glean::impl::TimingDistributionMetric::Start causes hangs in nsLayoutUtils::PaintFrame


(Toolkit :: Telemetry, defect, P1)




94 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox92 --- unaffected
firefox93 --- disabled
firefox94 --- fixed


(Reporter: florian, Assigned: chutten)


(Blocks 1 open bug, Regression)


(Keywords: hang, perf, regression, Whiteboard: [bhr:timing_distribution])


(7 files)

Starting on the September 2nd Nightly build, this represents about 7.8% of the total parent process main thread hang time:


RtlAcquireSRWLockExclusive ntdll
glean::private::timing_distribution::{{impl}}::start(glean::private::timing_distribution::TimingDistributionMetric*) xul
fog::ffi::timing_distribution::fog_timing_distribution_start(unsigned int) xul
mozilla::glean::impl::TimingDistributionMetric::Start() const xul
static nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, mozilla::nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) xul

This is currently the #2 most frequent hang.

Chris, any idea who should look at this? This looks like a pretty big issue if we want people to use Glean.

Flags: needinfo?(chutten)

Adding ni? to Jeff as well who might have thoughts on if there's something we need to do right now to mitigate this.

Flags: needinfo?(jmuizelaar)

We can just revert this in the short term

Flags: needinfo?(jmuizelaar)

This is presumably hanging on this lock?
The question is who's holding it.

Florian, do we know which process the hang is happening in?

Flags: needinfo?(florian)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)

Florian, do we know which process the hang is happening in?

The dashboard at only shows data for the parent process.

Flags: needinfo?(florian)

chutten, are there any threads/situations that would hold this lock for a long time?

That lock isn't taken by anything outside that file, so we have:

  1. Construction
  2. accumulate_raw_samples_nanos (writes to the db)
  3. start (holds only long enough to store a timestamp in a hashmap)
  4. stop_and_accumulate (writes to the db)

My guess is the db is slow on the machines suffering the hang and so we need to tell it to delay. Luckily we have a mechanism for it, we've just never needed it before. Lemme post a quick patch.

Flags: needinfo?(chutten)

There are hangs in using TimingDistribution in Firefox Desktop, which is odd.
Current suspect is a slow db, so let's delay the io and see if it's true.

Assignee: nobody → chutten
See Also: → 1728891
Pushed by
Delay 'ping'-lifetime io in FOG r=TravisLong

Backed out for causing telemetry failures in test_fog_background_update_ping

Backout link:

Push with failures

Failure log

Flags: needinfo?(chutten)

Well heck, that means the order isn't guaranteed. That's almost certainly a bug.

Flags: needinfo?(chutten)
Backout by
Backed out changeset 51552aef09e0 for causing telemetry failures in test_fog_background_update_ping. CLOSED TREE

Hm, but the boolean and string ahead of it are present. If the batching were failing, those'd be gone too. Maybe there's a subtle timing interaction to do with the profile lock that background update takes? I need to add logging.

Alas, testing this locally isn't working. ./mach telemetry-tests-client toolkit/components/telemetry/tests/marionette/tests/client/ barfs with marionette telling me to start a session first (on my local Linux machine). I'll re-bootstrap in case my deps need updating or something, but I'm not hopeful I won't be stuck debugging this using print statements on try (to cover my bets I've already submitted once)

Component: Graphics → Telemetry
Priority: -- → P1
Product: Core → Toolkit

Ah. For posterity and anyone else running afoul of test_fog_background_update_ping: the reason the test won't run for me locally is because there is no Background Update on Linux (the platform I was trying to run the test on) (yet) and the test's skip-if isn't respected when you run the test directly on the command-line. (Which I already knew and wrote docs about)

Also, the try run with logs confirms that the ping being sent successfully contains absolutely everything but uuid.background_update.client_id:

    "ping_info": {
        "seq": 0,
        "start_time": "2021-09-09T14:02+00:00",
        "end_time": "2021-09-09T14:02+00:00"
    "client_info": {
        "telemetry_sdk_build": "40.1.0",
        "first_run_date": "2021-09-09+00:00",
        "architecture": "x86",
        "os_version": "10.0",
        "os": "Windows",
        "app_build": "20210909133459",
        "app_channel": "default",
        "app_display_version": "94.0a1"
    "metrics": {
        "string_list": {
            "background_update.reasons": ["", "app.update.background.enabled=false"],
            "background_update.states": ["NEVER_CHECKED"]
        "string": {
            "background_update.final_state": "NEVER_CHECKED",
            "": "default"
        "boolean": {
            "update.background_update": false,
            "update.auto_download": false,
            "update.can_usually_apply_updates": true,
            "update.can_usually_use_bits": true,
            "update.can_usually_stage_updates": true,
            "update.enabled": true,
            "update.service_enabled": true,
            "update.can_usually_check_for_updates": true,
            "background_update.exit_code_success": true

Next step: get this running and failing locally on my Windows machine so I can add more logs and poke around why there is no client_id. My read of the relevant code suggests that the client_id must be set or else an error would log or there'd be no ping sent at all... so there's something going on here. Maybe we're trying to send an unparseable uuid (but only when delaying ping-lifetime io? Whut.) so it fails silently, or maybe something else is going on.

Alrighty, though I've managed to get it working on Windows I haven't managed to get it to log from the background update process on Windows. But that's not just me, a failing run and a succeeding run both fail to have Glean log lines mentioning a "background-update" ping is being submitted.

"failing run" is a run with delay_ping_lifetime_io set to true. "succeeding" is with it set to false.

MacOSX does log, though (failing run, succeeding run).

This logging problem is unfortunate (it means I can't debug this test on any of my machines) and should be looked into. I've filed bug 1730012 for that.

The difference in logging between a successful and failing run is a minor reordering in the segment we're looking at (the bits after the =============== fences). There's nothing obviously going wrong in here: both get a valid UUID from the client_id state file, parse it, and send it to Glean. Why it doesn't appear in the ping that is later submitted is a true mystery.

I think I may be left with no alternative but instrumenting deeper and deeper into the stack, which will hit problems quickly when I have to edit the Glean crates themselves as I'll have to remember how to do that. Were there docs for that? Ah yes we have docs for that.

Struck by a sense of foreboding, I checked if this "ping doesn't get Telemetry's client_id in it" was present in the data. It is, but only at 0.04% (17k out of 41M pings since Sept 1). Overall it's pretty weak evidence that there's a timing issue in the wild that this test just happens to miss most of the time.

Next up for me is continuing to use try to give me log-filled mac runs with both delay_ping_lifetime_io on and off so I can hunt for discrepancies. Maybe there is something special about uuid metric types when it comes to ping lifetimes and I just can't see it in the code.

Ah-ha! Found it.

Turns out that we clear various small "ping"-lifetime stores as part of init: baseline's and the special internal glean_client_info. There's nothing wrong with this, we just make sure to call clear_ping_lifetime_storage with the appropriate storage_name and it'll make sure to only clear the part of ping_lifetime_data that corresponds to that... specific... storage_name....

When operating in delay_ping_lifetime_io mode, clearing any storage will clear all (delayed) ping-lifetime storage. Whoops.

Next Steps:

  1. Write a test that (in delay_ping_lifetime_io mode) records data in store1, clears store2, then verifies the data is still in store1. It should fail.
  2. Adjust clear_ping_lifetime_storage to iterate over ping_lifetime_data and only erase the stuff that starts with the storage key.
  3. Test should pass, so check it in and get a patch release out for the SDK
  4. Update, vendor, and push the fixed Glean to m-c
  5. Turn on delay_ping_lifetime_io mode in m-c
  6. Reinstate gfx/metrics.yaml and the instrumentation from bug 1728423

Depends on D124930

Pushed by
Delay 'ping'-lifetime io in FOG r=TravisLong
Update Glean to 41.0.0 r=TravisLong
Flags: needinfo?(chutten)

Wait a minute, isn't m-c built with rust 1.54.0?

Oh geez, it's 1.51.0? How? I thought bug 1723016 forced a minimum of 1.54.0 and btree_retain was stabilized on 1.53.0... :glandium, can you help me understand?

Flags: needinfo?(chutten) → needinfo?(mh+mozilla)

m-c is built with 1.55.0 but has a minimum supported rust version of 1.51 at the moment. If you need a bump to the minimum supported version, we can do that. Please file a bug in "Firefox Build System :: Toolchains".

Flags: needinfo?(mh+mozilla)
Depends on: 1730680

New steps:

  1. Write a broken test that checks that we persist unsent ping-lifetime data between runs in the Rust Glean SDK
  2. Fix that broken test by persisting on shutdown in the Rust SDK
  3. Expose the persistence API on the Rust SDK for consumers (like FOG)
  4. Release a new Glean SDK (likely v41.1.0)
  5. Vendor the new Glean SDK
  6. Call the persistence API from (3) in FOG opportunistically on idle to limit how much ping-lifetime data we might lose in the event of an unclean shutdown (like a crash). (( Testing this may be impossible or at least tricky. Need some way to uncleanly shutdown a marionette'd browser ))
  7. Push with the other patches
  8. Re-land bug 1728423
  9. Doublecheck we don't show up in BHR anymore

badboy merged PR #1793: "Bug1729723 expose persist ping lifetime" in 7a7dc96.

Keeping the bug open so the next steps get addressed too

Depends on D125456

Pushed by
Delay 'ping'-lifetime io in FOG r=TravisLong
Update Glean to 41.0.0 r=TravisLong
Upgrade Glean to v41.1.0 r=janerik
Flush 'ping'-lifetime Glean data on idle in FOG r=janerik

Backed out for causing build bustages

Backout link:

Push with failures

Failure log

``` error: unused Result that must be used
[task 2021-09-16T15:02:52.226Z] 15:02:52 INFO - --> toolkit/components/glean/src/
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - |
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - 123 | glean::persist_ping_lifetime_data();
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - |
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - = note: `-D unused-must-use` implied by `-D warnings`
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - = note: this `Result` may be an `Err` variant, which should be handled
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - error: could not compile `fog_control` due to previous error
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - Caused by:
[task 2021-09-16T15:02:52.231Z] 15:02:52 INFO - process didn't exit successfully: `CARGO=/builds/worker/fetches/rustc/bin/cargo CARGO_CRATE_NAME=fog_control

Flags: needinfo?(chutten)

Oh now this is odd, it should've broken when I built locally if that's the case.

Flags: needinfo?(chutten)

Easy enough to fix. Popped it to try just to be sure I didn't miss anything else:

Pushed by
Delay 'ping'-lifetime io in FOG r=TravisLong
Update Glean to 41.0.0 r=TravisLong
Upgrade Glean to v41.1.0 r=janerik
Flush 'ping'-lifetime Glean data on idle in FOG r=janerik

Now the number of backouts is starting to make me feel bad for bothering the sheriffs so much. Sorry!

dthayer, you're one of the people who's interested in shutdown. Can you help me understand when I should put Glean's "flush in-memory stuff to disk" call if not inside a RunOnShutdown? (( Specifically here: ))

I guess I could set aPhase to be XPCOMShutdown instead of XPCOMShutdownFinal which is its default... Lemme prep a patch.

Flags: needinfo?(chutten) → needinfo?(dothayer)
Blocks: 1731279
Pushed by
Delay 'ping'-lifetime io in FOG r=TravisLong
Update Glean to 41.0.0 r=TravisLong
Upgrade Glean to v41.1.0 r=janerik
Flush 'ping'-lifetime Glean data on idle in FOG r=janerik
Move FOG shutdown up to xpcom-shutdown to avoid latewrites r=dthayer,janerik
Flags: needinfo?(dothayer)

Backed out for causing reftest failures with block_on_dispatcher() and also web-test failures.

Flags: needinfo?(chutten)

It's the idle observer firing that's causing the block_on_dispatcher which causes the explosion for the reftest failures. (the web-test ones are a mystery at the moment).

It seems as though the idle observer might be being notified after we called shutdown. I've pushed a patch to try to see if unregistering the idle observer on shutdown is the fix needed here:

If this comes back green from reftest and web-test, we're clear here.

Flags: needinfo?(chutten)
See Also: → 1731595

This will ensure we do not call persist_ping_lifetime_data() on the Rust Glean
SDK after we call shutdown(), which panics. See bug 1731595.

Depends on D125880

Pushed by
Delay 'ping'-lifetime io in FOG r=TravisLong
Update Glean to 41.0.0 r=TravisLong
Upgrade Glean to v41.1.0 r=janerik
Flush 'ping'-lifetime Glean data on idle in FOG r=janerik
Move FOG shutdown up to xpcom-shutdown to avoid latewrites r=dthayer,janerik
Remove FOG's idle observer on shutdown r=TravisLong
See Also: → 1732941
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.