Closed Bug 1729723 Opened 3 years ago Closed 3 years ago

mozilla::glean::impl::TimingDistributionMetric::Start causes hangs in nsLayoutUtils::PaintFrame

Tracking

()

Status:

RESOLVED FIXED

Milestone:

94 Branch

Tracking Flags:

Tracking

Status

firefox-esr78

---

unaffected

firefox-esr91

---

unaffected

firefox92

---

unaffected

firefox93

---

disabled

firefox94

---

fixed

People

(Reporter: florian, Assigned: chutten)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: hang, perf, regression, Whiteboard: [bhr:timing_distribution])

Attachments

(7 files)

Bug 1729723 - Delay 'ping'-lifetime io in FOG r?TravisLong! 3 years ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1782 3 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review
Bug 1729723 - Update Glean to 41.0.0 r?TravisLong! 3 years ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729723 - Upgrade Glean to v41.1.0 r?janerik! 3 years ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729723 - Flush 'ping'-lifetime Glean data on idle in FOG r?janerik! 3 years ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729723 - Move FOG shutdown up to xpcom-shutdown to avoid latewrites r?dthayer!,janerik! 3 years ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729723 - Remove FOG's idle observer on shutdown r?TravisLong! 3 years ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review

Florian Quèze [:florian]

Reporter

Description

•

3 years ago

•

Edited

Starting on the September 2nd Nightly build, this represents about 7.8% of the total parent process main thread hang time: https://fqueze.github.io/hang-stats/#date=20210902&row=0&filter=timing_distribution

Stack:

RtlAcquireSRWLockExclusive ntdll
glean::private::timing_distribution::{{impl}}::start(glean::private::timing_distribution::TimingDistributionMetric*) xul
fog::ffi::timing_distribution::fog_timing_distribution_start(unsigned int) xul
mozilla::glean::impl::TimingDistributionMetric::Start() const xul
static nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, mozilla::nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) xul

Florian Quèze [:florian]

Reporter

Comment 1

•

3 years ago

This is currently the #2 most frequent hang.

Bas Schouten (:bas.schouten)

Comment 2

•

3 years ago

Chris, any idea who should look at this? This looks like a pretty big issue if we want people to use Glean.

Flags: needinfo?(chutten)

Bas Schouten (:bas.schouten)

Comment 3

•

3 years ago

Adding ni? to Jeff as well who might have thoughts on if there's something we need to do right now to mitigate this.

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 4

•

3 years ago

We can just revert this in the short term

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 5

•

3 years ago

This is presumably hanging on this lock? https://searchfox.org/mozilla-central/rev/a166f59fba89fc70ebfab287f4edb8e05ed4f6da/third_party/rust/glean/src/private/timing_distribution.rs#25
The question is who's holding it.

Florian, do we know which process the hang is happening in?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Reporter

Comment 6

•

3 years ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)

Florian, do we know which process the hang is happening in?

The dashboard at https://fqueze.github.io/hang-stats/ only shows data for the parent process.

Flags: needinfo?(florian)

Jeff Muizelaar [:jrmuizel]

Comment 7

•

3 years ago

chutten, are there any threads/situations that would hold this lock for a long time?

Chris H-C :chutten

Assignee

Comment 8

•

3 years ago

That lock isn't taken by anything outside that file, so we have:

Construction
accumulate_raw_samples_nanos (writes to the db)
start (holds only long enough to store a timestamp in a hashmap)
stop_and_accumulate (writes to the db)

My guess is the db is slow on the machines suffering the hang and so we need to tell it to delay. Luckily we have a mechanism for it, we've just never needed it before. Lemme post a quick patch.

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 9

•

3 years ago

Attached file Bug 1729723 - Delay 'ping'-lifetime io in FOG r?TravisLong! — Details

There are hangs in using TimingDistribution in Firefox Desktop, which is odd.
Current suspect is a slow db, so let's delay the io and see if it's true.

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → chutten

Status: NEW → ASSIGNED

Darkspirit

Updated

•

3 years ago

Comment 10

•

3 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/51552aef09e0 Delay 'ping'-lifetime io in FOG r=TravisLong

Sandor Molnar[:smolnar]

Comment 11

•

3 years ago

Backed out for causing telemetry failures in test_fog_background_update_ping

Backout link: https://hg.mozilla.org/integration/autoland/rev/61f59eea8c2d033efac5b05ab182890715003ab0

Push with failures

Failure log

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 12

•

3 years ago

Well heck, that means the order isn't guaranteed. That's almost certainly a bug.

Flags: needinfo?(chutten)

Pulsebot

Comment 13

•

3 years ago

Backout by smolnar@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/a72096bff057 Backed out changeset 51552aef09e0 for causing telemetry failures in test_fog_background_update_ping. CLOSED TREE

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Keywords: regression

Chris H-C :chutten

Assignee

Comment 14

•

3 years ago

Hm, but the boolean and string ahead of it are present. If the batching were failing, those'd be gone too. Maybe there's a subtle timing interaction to do with the profile lock that background update takes? I need to add logging.

Alas, testing this locally isn't working. ./mach telemetry-tests-client toolkit/components/telemetry/tests/marionette/tests/client/test_fog_background_update_ping.py barfs with marionette telling me to start a session first (on my local Linux machine). I'll re-bootstrap in case my deps need updating or something, but I'm not hopeful I won't be stuck debugging this using print statements on try (to cover my bets I've already submitted once)

Component: Graphics → Telemetry

Priority: -- → P1

Product: Core → Toolkit

Chris H-C :chutten

Assignee

Comment 15

•

3 years ago

•

Edited

Ah. For posterity and anyone else running afoul of test_fog_background_update_ping: the reason the test won't run for me locally is because there is no Background Update on Linux (the platform I was trying to run the test on) (yet) and the test's skip-if isn't respected when you run the test directly on the command-line. (Which I already knew and wrote docs about)

Also, the try run with logs confirms that the ping being sent successfully contains absolutely everything but uuid.background_update.client_id:

{
    "ping_info": {
        "seq": 0,
        "start_time": "2021-09-09T14:02+00:00",
        "end_time": "2021-09-09T14:02+00:00"
    },
    "client_info": {
        "telemetry_sdk_build": "40.1.0",
        "first_run_date": "2021-09-09+00:00",
        "architecture": "x86",
        "os_version": "10.0",
        "os": "Windows",
        "app_build": "20210909133459",
        "app_channel": "default",
        "app_display_version": "94.0a1"
    },
    "metrics": {
        "string_list": {
            "background_update.reasons": ["app.update.auto=false", "app.update.background.enabled=false"],
            "background_update.states": ["NEVER_CHECKED"]
        },
        "string": {
            "background_update.final_state": "NEVER_CHECKED",
            "update.channel": "default"
        },
        "boolean": {
            "update.background_update": false,
            "update.auto_download": false,
            "update.can_usually_apply_updates": true,
            "update.can_usually_use_bits": true,
            "update.can_usually_stage_updates": true,
            "update.enabled": true,
            "update.service_enabled": true,
            "update.can_usually_check_for_updates": true,
            "background_update.exit_code_success": true
        }
    }
}

Next step: get this running and failing locally on my Windows machine so I can add more logs and poke around why there is no client_id. My read of the relevant code suggests that the client_id must be set or else an error would log or there'd be no ping sent at all... so there's something going on here. Maybe we're trying to send an unparseable uuid (but only when delaying ping-lifetime io? Whut.) so it fails silently, or maybe something else is going on.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 16

•

3 years ago

Backout merged to central: https://hg.mozilla.org/mozilla-central/rev/61f59eea8c2d

Chris H-C :chutten

Assignee

Comment 17

•

3 years ago

Alrighty, though I've managed to get it working on Windows I haven't managed to get it to log from the background update process on Windows. But that's not just me, a failing run and a succeeding run both fail to have Glean log lines mentioning a "background-update" ping is being submitted.

"failing run" is a run with delay_ping_lifetime_io set to true. "succeeding" is with it set to false.

MacOSX does log, though (failing run, succeeding run).

This logging problem is unfortunate (it means I can't debug this test on any of my machines) and should be looked into. I've filed bug 1730012 for that.

The difference in logging between a successful and failing run is a minor reordering in the segment we're looking at (the bits after the =============== fences). There's nothing obviously going wrong in here: both get a valid UUID from the client_id state file, parse it, and send it to Glean. Why it doesn't appear in the ping that is later submitted is a true mystery.

I think I may be left with no alternative but instrumenting deeper and deeper into the stack, which will hit problems quickly when I have to edit the Glean crates themselves as I'll have to remember how to do that. Were there docs for that? Ah yes we have docs for that.

Struck by a sense of foreboding, I checked if this "ping doesn't get Telemetry's client_id in it" was present in the data. It is, but only at 0.04% (17k out of 41M pings since Sept 1). Overall it's pretty weak evidence that there's a timing issue in the wild that this test just happens to miss most of the time.

Next up for me is continuing to use try to give me log-filled mac runs with both delay_ping_lifetime_io on and off so I can hunt for discrepancies. Maybe there is something special about uuid metric types when it comes to ping lifetimes and I just can't see it in the code.

Chris H-C :chutten

Assignee

Comment 18

•

3 years ago

Ah-ha! Found it.

Turns out that we clear various small "ping"-lifetime stores as part of init: baseline's and the special internal glean_client_info. There's nothing wrong with this, we just make sure to call clear_ping_lifetime_storage with the appropriate storage_name and it'll make sure to only clear the part of ping_lifetime_data that corresponds to that... specific... storage_name....

When operating in delay_ping_lifetime_io mode, clearing any storage will clear all (delayed) ping-lifetime storage. Whoops.

Next Steps:

Write a test that (in delay_ping_lifetime_io mode) records data in store1, clears store2, then verifies the data is still in store1. It should fail.
Adjust clear_ping_lifetime_storage to iterate over ping_lifetime_data and only erase the stuff that starts with the storage key.
Test should pass, so check it in and get a patch release out for the SDK
Update, vendor, and push the fixed Glean to m-c
Turn on delay_ping_lifetime_io mode in m-c
Reinstate gfx/metrics.yaml and the instrumentation from bug 1728423

GitHub Bugzilla PR Linker

Comment 19

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1782 — Details

Chris H-C :chutten

Assignee

Comment 20

•

3 years ago

Attached file Bug 1729723 - Update Glean to 41.0.0 r?TravisLong! — Details

Depends on D124930

Pulsebot

Comment 21

•

3 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b166c01d2010 Delay 'ping'-lifetime io in FOG r=TravisLong https://hg.mozilla.org/integration/autoland/rev/e52c90cdf117 Update Glean to 41.0.0 r=TravisLong

Sandor Molnar[:smolnar]

Comment 22

•

3 years ago

Backed out for causing linux x86 debug bustage

Backout link: https://hg.mozilla.org/integration/autoland/rev/f43509d3a1f8c84f0f0b013d4830b64bfd1babd9

Push with failures

Failure log

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 23

•

3 years ago

Wait a minute, isn't m-c built with rust 1.54.0?

Oh geez, it's 1.51.0? How? I thought bug 1723016 forced a minimum of 1.54.0 and btree_retain was stabilized on 1.53.0... :glandium, can you help me understand?

Flags: needinfo?(chutten) → needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 24

•

3 years ago

m-c is built with 1.55.0 but has a minimum supported rust version of 1.51 at the moment. If you need a bump to the minimum supported version, we can do that. Please file a bug in "Firefox Build System :: Toolchains".

Flags: needinfo?(mh+mozilla)

Chris H-C :chutten

Assignee

Updated

•

3 years ago

Depends on: 1730680

Chris H-C :chutten

Assignee

Comment 25

•

3 years ago

New steps:

Write a broken test that checks that we persist unsent ping-lifetime data between runs in the Rust Glean SDK
Fix that broken test by persisting on shutdown in the Rust SDK
Expose the persistence API on the Rust SDK for consumers (like FOG)
Release a new Glean SDK (likely v41.1.0)
Vendor the new Glean SDK
Call the persistence API from (3) in FOG opportunistically on idle to limit how much ping-lifetime data we might lose in the event of an unclean shutdown (like a crash). (( Testing this may be impossible or at least tricky. Need some way to uncleanly shutdown a marionette'd browser ))
Push with the other patches
Re-land bug 1728423
Doublecheck we don't show up in BHR anymore

Jan-Erik Rediger [:janerik]

Comment 26

•

3 years ago

badboy merged PR #1793: "Bug1729723 expose persist ping lifetime" in 7a7dc96.

Keeping the bug open so the next steps get addressed too

Jan-Erik Rediger [:janerik]

Comment 27

•

3 years ago

Glean v41.1.0 has been released, crossing of Step 4.

Chris H-C :chutten

Assignee

Comment 28

•

3 years ago

Attached file Bug 1729723 - Upgrade Glean to v41.1.0 r?janerik! — Details

Depends on D125456

Chris H-C :chutten

Assignee

Comment 29

•

3 years ago

Attached file Bug 1729723 - Flush 'ping'-lifetime Glean data on idle in FOG r?janerik! — Details

Depends on D125834

Pulsebot

Comment 30

•

3 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0522676f94c3 Delay 'ping'-lifetime io in FOG r=TravisLong https://hg.mozilla.org/integration/autoland/rev/444336616b77 Update Glean to 41.0.0 r=TravisLong https://hg.mozilla.org/integration/autoland/rev/151dcec458a3 Upgrade Glean to v41.1.0 r=janerik https://hg.mozilla.org/integration/autoland/rev/d99b451638d8 Flush 'ping'-lifetime Glean data on idle in FOG r=janerik

Sandor Molnar[:smolnar]

Comment 31

•

3 years ago

Backed out for causing build bustages

Backout link: https://hg.mozilla.org/integration/autoland/rev/fd03aa020bfd84a1d2dd11070ed1d428d61640f6

Push with failures

Failure log

``` error: unused Result that must be used
[task 2021-09-16T15:02:52.226Z] 15:02:52 INFO - --> toolkit/components/glean/src/lib.rs:123:5
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - |
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - 123 | glean::persist_ping_lifetime_data();
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - |
[task 2021-09-16T15:02:52.227Z] 15:02:52 INFO - = note: `-D unused-must-use` implied by `-D warnings`
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - = note: this `Result` may be an `Err` variant, which should be handled
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - error: could not compile `fog_control` due to previous error
[task 2021-09-16T15:02:52.228Z] 15:02:52 INFO - Caused by:
[task 2021-09-16T15:02:52.231Z] 15:02:52 INFO - process didn't exit successfully: `CARGO=/builds/worker/fetches/rustc/bin/cargo CARGO_CRATE_NAME=fog_control

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 32

•

3 years ago

Oh now this is odd, it should've broken when I built locally if that's the case.

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 33

•

3 years ago

Easy enough to fix. Popped it to try just to be sure I didn't miss anything else: https://treeherder.mozilla.org/#/jobs?repo=try&revision=60427c78ab81bea8c8c988f7b656228e9aafc39d

Pulsebot

Comment 34

•

3 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/378b0f332427 Delay 'ping'-lifetime io in FOG r=TravisLong https://hg.mozilla.org/integration/autoland/rev/7b67981c7531 Update Glean to 41.0.0 r=TravisLong https://hg.mozilla.org/integration/autoland/rev/466be2d40e28 Upgrade Glean to v41.1.0 r=janerik https://hg.mozilla.org/integration/autoland/rev/728ead24c5e8 Flush 'ping'-lifetime Glean data on idle in FOG r=janerik

Narcis Beleuzu [:NarcisB]

Comment 35

•

3 years ago

Backed out for mochitest crashes on LateWriteChecks.cpp

Backout link: https://hg.mozilla.org/integration/autoland/rev/4f9a1f599b43c1748edf0ce1b42a25b5f825db7e
Log link: https://treeherder.mozilla.org/logviewer?job_id=351787536&repo=autoland&lineNumber=3020

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 36

•

3 years ago

Now the number of backouts is starting to make me feel bad for bothering the sheriffs so much. Sorry!

dthayer, you're one of the people who's interested in shutdown. Can you help me understand when I should put Glean's "flush in-memory stuff to disk" call if not inside a RunOnShutdown? (( Specifically here: https://searchfox.org/mozilla-central/source/toolkit/components/glean/xpcom/FOG.cpp#46 ))

I guess I could set aPhase to be XPCOMShutdown instead of XPCOMShutdownFinal which is its default... Lemme prep a patch.

Flags: needinfo?(chutten) → needinfo?(dothayer)

Chris H-C :chutten

Assignee

Comment 37

•

3 years ago

Attached file Bug 1729723 - Move FOG shutdown up to xpcom-shutdown to avoid latewrites r?dthayer!,janerik! — Details

Depends on D125835

Chris H-C :chutten

Assignee

Updated

•

3 years ago

Blocks: 1731279

Pulsebot

Comment 38

•

3 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/994d51619f15 Delay 'ping'-lifetime io in FOG r=TravisLong https://hg.mozilla.org/integration/autoland/rev/74b132e68ed0 Update Glean to 41.0.0 r=TravisLong https://hg.mozilla.org/integration/autoland/rev/c977b9868082 Upgrade Glean to v41.1.0 r=janerik https://hg.mozilla.org/integration/autoland/rev/90faab13d8ca Flush 'ping'-lifetime Glean data on idle in FOG r=janerik https://hg.mozilla.org/integration/autoland/rev/1cd9afe6d8fd Move FOG shutdown up to xpcom-shutdown to avoid latewrites r=dthayer,janerik

Alex Thayer [:alexical] (she/her)

Updated

•

3 years ago

Flags: needinfo?(dothayer)

Cristian Tuns

Comment 39

•

3 years ago

Backed out for causing reftest failures with block_on_dispatcher() and also web-test failures.

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Comment 40

•

3 years ago

It's the idle observer firing that's causing the block_on_dispatcher which causes the explosion for the reftest failures. (the web-test ones are a mystery at the moment).

It seems as though the idle observer might be being notified after we called shutdown. I've pushed a patch to try to see if unregistering the idle observer on shutdown is the fix needed here: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=d3ae16ddf536f16b60ec1f36401a896a65823822

If this comes back green from reftest and web-test, we're clear here.

Flags: needinfo?(chutten)

Chris H-C :chutten

Assignee

Updated

•

3 years ago

Comment 41

•

3 years ago

Attached file Bug 1729723 - Remove FOG's idle observer on shutdown r?TravisLong! — Details

This will ensure we do not call persist_ping_lifetime_data() on the Rust Glean
SDK after we call shutdown(), which panics. See bug 1731595.

Depends on D125880

Pulsebot

Comment 42

•

3 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/542a3ee958a1 Delay 'ping'-lifetime io in FOG r=TravisLong https://hg.mozilla.org/integration/autoland/rev/40ad4e7230cf Update Glean to 41.0.0 r=TravisLong https://hg.mozilla.org/integration/autoland/rev/827b5a8d32bc Upgrade Glean to v41.1.0 r=janerik https://hg.mozilla.org/integration/autoland/rev/4a65e9435d88 Flush 'ping'-lifetime Glean data on idle in FOG r=janerik https://hg.mozilla.org/integration/autoland/rev/8b8e872558b3 Move FOG shutdown up to xpcom-shutdown to avoid latewrites r=dthayer,janerik https://hg.mozilla.org/integration/autoland/rev/bd50b16faa8f Remove FOG's idle observer on shutdown r=TravisLong

Atila Butkovits

Comment 43

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/542a3ee958a1
https://hg.mozilla.org/mozilla-central/rev/40ad4e7230cf
https://hg.mozilla.org/mozilla-central/rev/827b5a8d32bc
https://hg.mozilla.org/mozilla-central/rev/4a65e9435d88
https://hg.mozilla.org/mozilla-central/rev/8b8e872558b3
https://hg.mozilla.org/mozilla-central/rev/bd50b16faa8f

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox94: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 94 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox92: --- → unaffected

status-firefox93: --- → disabled

status-firefox-esr78: --- → unaffected

status-firefox-esr91: --- → unaffected

Chris H-C :chutten

Assignee

Updated

•

3 years ago

Updated

•

3 years ago

Has Regression Range: --- → yes

You need to log in before you can comment on or make changes to this bug.