Closed Bug 1657174 Opened 4 years ago Closed 4 years ago

Crash in [@ profiler_unregister_thread]

Categories

(Core :: Gecko Profiler, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
81 Branch
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox79 --- unaffected
firefox80 --- fixed
firefox81 --- fixed

People

(Reporter: kbrosnan, Assigned: mozbugz)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

This bug is for crash report bp-7077bc88-239c-4c11-9a3a-470e50200804.

Top 8 frames of crashing thread:

0 libxul.so profiler_unregister_thread tools/profiler/core/platform.cpp:5005
1 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:476
2 libnss3.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:201
3 libc.so libc.so@0x48811 
4 libc.so libc.so@0x487f7 
5 libc.so libc.so@0x1b369 
6 libc.so libc.so@0x487f7 
7 libnss3.so pt_recvfrom_cont nsprpub/pr/src/pthreads/ptio.c:987

This is the top crash for Fenix nightly. Looks like it was introduced on July 17.

Component: General → Gecko Profiler
Product: DevTools → Core
Version: unspecified → Trunk

So this is most probably the follow-up to bug 1651086, where we fixed crashes from multiple locked_register_thread() by allowing these multiple registrations. Now we're seeing the complementary multiple profiler_unregister_thread() calls.
The root cause (aka "regressed by" bug) is still unknown.

This can happen even for people not using the profiler -> P1 S2.

Like bug 1651086, I'll make a quick fix to allow these multiple de-registrations. But it would be good to investigate the real issue, I'll open a follow-up bug for that...

Assignee: nobody → gsquelart
Severity: -- → S2
Depends on: 1651086
Priority: -- → P1

Actually, looking at the code, this is a regression by bug 1651086, which handled multiple de-registrations, but left an assertion at the wrong place! 😅

No longer depends on: 1651086
Regressed by: 1651086
Has Regression Range: --- → yes

MOZ_RELEASE_ASSERT(registeredThread == TLSRegisteredThread::RegisteredThread(lock)); is failing on Android, probably as a follow-up to bug 1651086, which allowed multiple registrations (to fix a problem with yet-unknown root cause).
This assertion is too broad to help find the issue because at this point:

  • If FindCurrentThreadRegisteredThread() is not null, TLSRegisteredThread::RegisteredThread() should also not be null, and it should equal FindCurrentThreadRegisteredThread(), from the first succesful profiler_register_thread().
  • If FindCurrentThreadRegisteredThread() is null, TLSRegisteredThread::RegisteredThread() should be null as well, from the first profiler_unregister_thread().

So this patch splits this assertion into multiple more targeted ones to check the above, and also some extra assertions after de-registration, to ensure that both FindCurrentThreadRegisteredThread() and TLSRegisteredThread::RegisteredThread() are null.

After discussing this with :mstange, he suggested we split the failing assertion to have a better idea of what's happening.

This is only an exploratory patch, it will not fix the problem. So this bug should be kept open, and I'll monitor crashes over the next few days...

Keywords: leave-open
Pushed by gsquelart@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/63866b83d444 Added more targeted assertions around multiple thread de-registrations - r=mstange

Set release status flags based on info from the regressing bug 1651086

I'm confused: There are now 85 reports with the buildid (20200806033456) on which the exploratory patch landed, however when looking at these reports, they all link to a source without that patch. 🤔

I will keep monitoring, hopefully the next released version will have our patch...

(In reply to Gerald Squelart [:gerald] (he/him) from comment #9)

I'm confused: There are now 85 reports with the buildid (20200806033456) on which the exploratory patch landed

No, that build was built from https://hg.mozilla.org/mozilla-central/rev/6e35e01646d7c465893a172a0b4fb116c2293d2a, which is from m-c push 37674, and your patch landed in push 37675.

(In reply to Markus Stange [:mstange] from comment #10)

No, that build was built from [...] m-c push 37674, and your patch landed in push 37675.

Oooh, so close! Next one, then... 😅

Looks like we're still seeing this signature on Fenix with GV buildid 20200808213325.
Crash reason is MOZ_RELEASE_ASSERT(TLSRegisteredThread::RegisteredThread(lock))

Flags: needinfo?(gsquelart)

Thank you Julien.
Yes, we finally have reports with the exploratory patch. So we have a bit more information about what went wrong:

I can see almost all reports (e.g., this one) happen at MOZ_RELEASE_ASSERT(!TLSRegisteredThread::RegisteredThread(lock)); in the already-unregistered branch.
I'm guessing either:

  • This is the 2nd profiler_unregister_thread() on a thread, and somehow the TLS was not cleared; but it should have been reset in the previous call at line 5117.
  • This is a spurious profiler_unregister_thread() on a never-registered thread, in which case the TLS should be null anyway.

There was also a single report out of the ~50 I looked at, which failed at MOZ_RELEASE_ASSERT(TLSRegisteredThread::RegisteredThread(lock)); in the still-registered-as-expected branch. So we have a registered thread without a TLS, which is strange since we set the TLS before adding the thread to the register.

I will try to trace all paths through profiler_register_thread() and profiler_unregister_thread(), to see how this situation could happen.

It's also interesting that this almost only happens on Android, what could be so different from other platforms?

Flags: needinfo?(gsquelart)

I'd strongly expect it's some type of race; threading and scheduler models are quite different (and likely so even between linux desktop and android), so that's not that surprising that one platform is far more affected. There are ~100 windows assertions in crash-stats, though, so I think there's a fundamental assumption here that's wrong.

-> All since it asserts on windows too

OS: Android → All

And there are a few on Mac as well.

My first attempt at tracing register+unregister on paper didn't help, I'll continue investigating... Fresh eyes welcome!

Oh dear: I've just noticed that even though almost all reports point at line 5148 which does MOZ_RELEASE_ASSERT(!TLSRegisteredThread::RegisteredThread(lock)), the "MOZ_CRASH Reason" is actually MOZ_RELEASE_ASSERT(TLSRegisteredThread::RegisteredThread(lock)) (without '!') which is at line 5103! 🤦‍♂️
The 2nd unregistration problem was a red herring.

The real issue is that we have a still-registered thread, but the TLS is unexpectedly null.
Restarting my investigation...

Assertions are also clarified with messages, to better distinguish the same tests in different locations.

Assertions should now cover all cases:

  • NEW: After registering a thread in the profiler with CorePS::AppendRegisteredThread, the TLS should be set to that thread.
  • NEW: If profiler_register_thread is called again, the TLS should still be set to that thread.
  • When profiler_unregister_thread is first called, the TLS should still be set to that thread (that's the assertion currently trigering this bug 1657174),
  • NEW: When profiler_unregister_thread is first called and after we remove the thread with CorePS::RemoveRegisteredThread, the TLS should now be null.
  • If profiler_unregister_thread is called again (or with a never-registered thread), the TLS should be null.

This is a further exploratory patch for bug 1657174, this will not prevent crashes, but hopefully it should give a bit more information, at least a smaller range in which the possible presumed registration/TLS race happens.

Pushed by gsquelart@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/73ad83f70029 Added more (un)registration TLS assertions - r=mstange
Regressions: 1660458
No longer regressions: 1660458

I'm not seeing crash reports past 20200818092452, so I'm cautiously optimistic (and pleasantly surprised) that bug 1659901 may have in fact solved this bug! If that's correct, it means we had some misuses around TLS objects.

I will keep monitoring crash reports a while longer before declaring victory...

Still no new reports after Fenix 0.0a1 20200818092452. I'm declaring this bug fixed.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch

Comment on attachment 9168335 [details]
Bug 1657174 - Added more targeted assertions around multiple thread de-registrations - r?mstange

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Only assertions, needed to uplift bug 1659901

Note: There is another patch in thus bug, see attachment 9170931 [details].

  • String changes made/needed: None
Attachment #9168335 - Flags: approval-mozilla-release?
Attachment #9170931 - Flags: approval-mozilla-release?

Comment on attachment 9168335 [details]
Bug 1657174 - Added more targeted assertions around multiple thread de-registrations - r?mstange

Approved for GV80 for Fenix 80.1.3.

Attachment #9168335 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9170931 - Flags: approval-mozilla-release? → approval-mozilla-release+
See Also: → 1666246
Regressions: 1666246

The crashes are not gone, they just moved from profiler_unregister_thread to profiler_register_thread. Analysis continues in bug 1666246.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: