Closed Bug 1715414 Opened 3 years ago Closed 3 years ago

telemetry | High frequency Assertion failure: mThreadLocalIndex != kBadThreadLocalIndex (BackgroundChild::Startup() was never called!), at /builds/worker/checkouts/gecko/ipc/glue/BackgroundImpl.cpp:378

Categories

(Core :: IPC, defect, P3)

Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
94 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- fixed
firefox89 --- unaffected
firefox90 --- unaffected
firefox91 --- wontfix
firefox93 --- wontfix
firefox94 --- fixed

People

(Reporter: aosmond, Assigned: jld)

References

(Regression)

Details

(Keywords: assertion, intermittent-failure, regression)

Attachments

(2 files, 2 obsolete files)

Attached file live_backing.zip

+++ This bug was initially created as a clone of Bug #1644166 +++

This seems to be permafailing on https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=YjfU3do0SzyzEiXq761CMA.0&revision=c7230b53515ec3e0552294ced049f7ee9072e128&searchStr=linux%2Cwebrender%2Cdebug%2Ctelemetry

[task 2021-06-08T23:43:57.878Z] 23:43:57     INFO -  Assertion failure: mThreadLocalIndex != kBadThreadLocalIndex (BackgroundChild::Startup() was never called!), at /builds/worker/checkouts/gecko/ipc/glue/BackgroundImpl.cpp:383

We can't move these tests over until the intermittent gets fixed (or we get a pass for creating a permafail).

Summary: Intermittent Assertion failure: mThreadLocalIndex != kBadThreadLocalIndex (BackgroundChild::Startup() was never called!), at /builds/worker/checkouts/gecko/ipc/glue/BackgroundImpl.cpp:378 → telemetry | Intermittent Assertion failure: mThreadLocalIndex != kBadThreadLocalIndex (BackgroundChild::Startup() was never called!), at /builds/worker/checkouts/gecko/ipc/glue/BackgroundImpl.cpp:378

There are 151 total failures in the last 7 days on

  • windows10-64-qr debug
  • windows10-64 debug
  • windows10-32-qr debug
  • macosx1015-64-qr debug
  • linux1804-64 debug

Retriggers: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=OS%2CX%2C10.15%2CWebRender%2Cdebug%2Ctest-macosx1015-64-qr%2Fdebug-marionette-swr-e10s%2CMn-swr&tochange=d23b5390345dac48f1c6fc66482edbe85d1ea3ad&fromchange=c02d857cc89f0cb630535e3168f483765db90aa4

Randell, please take a look.

Flags: needinfo?(rjesup)
Regressed by: 1687843
Has Regression Range: --- → yes
Summary: telemetry | Intermittent Assertion failure: mThreadLocalIndex != kBadThreadLocalIndex (BackgroundChild::Startup() was never called!), at /builds/worker/checkouts/gecko/ipc/glue/BackgroundImpl.cpp:378 → telemetry | High frequency Assertion failure: mThreadLocalIndex != kBadThreadLocalIndex (BackgroundChild::Startup() was never called!), at /builds/worker/checkouts/gecko/ipc/glue/BackgroundImpl.cpp:378
Whiteboard: [stockwell needswork:owner] → [retriggered][stockwell needswork:owner]
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Whiteboard: [retriggered][stockwell disable-recommended] → [retriggered][stockwell fixed:backout]
Assignee: nobody → rjesup
Flags: needinfo?(rjesup)
Target Milestone: --- → 91 Branch

Jed, this bug is failing frequently on ESR91 and we're struggling to find someone who can help move the investigation forward. Can you please take a look and redirect as needed? Thanks.

Flags: needinfo?(jld)

I think what's going on here is that the idle scheduler is being used (in a content process) before PBackground is initialized, because the latter happens relatively late: from ContentChild::InitXPCOM called from RecvSetXPCOMProcessAttributes. So if thread scheduling is such that receiving the message is delayed enough for the process to consider itself idle before that point, then I can see how we'd crash like this.

But it's interesting that this is happening only on ESR91 and no other branches.

Redirecting needinfo to people who might have more detailed knowledge.

Flags: needinfo?(nika)
Flags: needinfo?(bugs)
Flags: needinfo?(jld)

ok, so obviously the API (GetOrCreateForCurrentThread) doesn't hint that it can be called only in some cases but not in others. It should just handle the problematic case in some reasonable way.

To fix this particular issue it should be enough to return TimeStamp() or TimeStamp::Now() from
https://searchfox.org/mozilla-central/rev/ad2ffab089e4e0c0fe99a1a046ab2b1c45546bdb/xpcom/threads/IdlePeriodState.cpp#102 when PBackground isn't ready yet. What would be the easier way to check that?

Flags: needinfo?(rjesup)
Flags: needinfo?(nika)
Flags: needinfo?(jld)
Flags: needinfo?(bugs)

I discussed this with the other IPC people and it was pointed out that the end of ContentProcess::Init should be late enough for everything in BackgroundChild::Startup to work and earlier than anything that involves event loops.

Also, the thread-local storage initialization could be separated from the rest and done much earlier if necessary, but I don't think that's the case here.

Flags: needinfo?(jld)

Hi Jed, who should take the next step here?

Flags: needinfo?(jld)

Nevermind, we'll just disable the test.

Flags: needinfo?(jld)

Disabling the one test didn't work, so let's skip the entire job on Linux debug instead.

I'll take this. If comment 14 / comment 16 are right, it should be simple to fix.

Assignee: rjesup → jld

Try run on the ESR91 branch (passes 40x), vs. the control group (fails in 8/10 test runs). I'll send this for review.

Previously we were staring PBackground in content processes in
response to receiving the SetXPCOMProcessAttributes IPC message, which
is sent immediately after the process is launched. Meanwhile, the
idle scheduler tries to use PBackground when the main thread considers
itself idle. But if thread scheduling is such that the content process
main thread becomes idle before the IPC I/O thread has received and
dispatched that message, then we have a problem (signaled by an assertion
failure).

This patch moves content process PBackground startup earlier, to the
end of ContentProcess::Init; that point is after enough of IPC and
XPCOM is started for it to work, but before we start spinning the main
thread event loop.

Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/671ba1530436
Start PBackground earlier in content processes. r=nika
Backout by ccozmuta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ff118c8d4a31
Backed out changeset 671ba1530436 for casuing bustages on ContentProcess.cpp. CLOSED TREE
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/82e09e54c635
Start PBackground earlier in content processes. r=nika
Flags: needinfo?(ryanvm)
Flags: needinfo?(ryanvm)
Target Milestone: 91 Branch → 94 Branch

Comment on attachment 9242150 [details]
Bug 1715414 - Start PBackground earlier in content processes.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Fixes intermittent test failures.
  • User impact if declined: This is relatively harmless in practice (the idle scheduler code looks like it fails gracefully, and I don't think any other PBackground users are currently affected), but the main motivation for uplift is to allow the failing tests that were disabled earlier in this bug to be re-enabled so we can catch any actual failures in those tests.
  • Fix Landed on Version: 94
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This just runs an existing small part of child process startup earlier (but still after the things it depends on; the code is simple enough to be reasonably sure of that), and it's been stable on 94 for a while.
  • String or UUID changes made by this patch: none
Flags: needinfo?(jld)
Attachment #9242150 - Flags: approval-mozilla-esr91?

Comment on attachment 9242150 [details]
Bug 1715414 - Start PBackground earlier in content processes.

Approved for 91.3esr, thanks.

Attachment #9242150 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+
Attachment #9241802 - Attachment is obsolete: true
Attachment #9241640 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: