Closed Bug 1666246 Opened 4 years ago Closed 4 years ago

Crash in [@ profiler_register_thread]

Categories

(Core :: Gecko Profiler, defect, P1)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
83 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox80 --- wontfix
firefox81 + fixed
firefox82 + fixed
firefox83 + fixed

People

(Reporter: RyanVM, Assigned: mstange)

References

(Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

This is looking like super-high volume to me. We need to fix this ASAP.

Crash report: https://crash-stats.mozilla.org/report/index/9a0c82b5-429e-401a-8c0b-5e5b90200914

Top 4 frames of crashing thread:

0 libxul.so profiler_register_thread tools/profiler/core/platform.cpp:4981
1 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:430
2 libnss3.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:201
3 libc.so libc.so@0xe69e0 
Flags: needinfo?(gsquelart)
See Also: → 1657174

The links in the crash reports point to https://hg.mozilla.org/releases/mozilla-release/file/2fb97a7f50e8cbad1e3bb1737f5b86c1474fc36d/tools/profiler/core/platform.cpp#l4981 which is MOZ_RELEASE_ASSERT(CorePS::Exists()); but the MOZ_CRASH Reason in the reports is MOZ_RELEASE_ASSERT(TLSRegisteredThread::RegisteredThread(lock)) (TLS should be set when re-registering thread) which is an assert a few lines later: https://hg.mozilla.org/releases/mozilla-release/file/2fb97a7f50e8cbad1e3bb1737f5b86c1474fc36d/tools/profiler/core/platform.cpp#l5012

This is a new assert that was added in bug 1657174.

Regressed by: 1657174
Has Regression Range: --- → yes

It seems we made a mistake in this comment:

(In reply to Gerald Squelart [:gerald] (he/him) from bug 1657174 comment #21)

I'm not seeing crash reports past 20200818092452, so I'm cautiously optimistic (and pleasantly surprised) that bug 1659901 may have in fact solved this bug!

In fact, by the time that comment was written (August 31st), we had already collected just over 1000 crashes in the new assertion from that bug, on Nightly with build ID 20200820093209.
I think the mistake was that we were checking for crashes in profiler_unregister_thread, and those are in fact gone... because they all moved into profiler_register_thread!

Assignee: nobody → mstange.moz
Status: NEW → ASSIGNED

Continuing the analysis from bug 1657174:

(In reply to Gerald Squelart [:gerald] (he/him) from bug 1657174 comment #17)

The real issue is that we have a still-registered thread, but the TLS is unexpectedly null.

It could also be the other way round: We could have a fresh thread, with TLS being (expectedly) null, but FindCurrentThreadRegisteredThread returning unexpectedly non-null!
FindCurrentThreadRegisteredThread looks up the thread based on the thread ID (tid) from profiler_current_thread_id().
So we could unexpectedly find a thread in the following cases: If profiler_current_thread_id() returns a bad tid, or if two living threads have the same tid, or if a tid gets re-used but the old thread with that tid didn't properly unregister itself, or if the tids that are stored in the registered thread list have been overwritten with garbage.
All of those options sound unlikely to me, but at this point the assertions are tight enough that I don't see any other explanation.

I think we should land a fix that allows duplicate thread IDs in the registered thread list, and then land more instrumentation on Nightly to figure out what exactly is happening.

I still need to check which one of these hypotheses is most consistent with the previous assertion failures that we've been seeing. In the meantime, I'm going to post the patch that allows duplicate thread IDs and loosens the assertions considerably.

This removes some (apparently) over-aggressive assertions.
As a follow-up, we can add more instrumentation on Nightly to get to the root of the problem.

Severity: -- → S2
Flags: needinfo?(gsquelart)
Priority: -- → P1

(In reply to Markus Stange [:mstange] from comment #3)

or if a tid gets re-used but the old thread with that tid didn't properly unregister itself

I'm going all-in on this hypothesis. It is consistent with all the crashes we've seen in the regression-cause-chain of this bug.

Let's say Thread A has tid 537, registers itself (step 1), fails to unregister itself, and dies. Then thread B gets created, also with tid 537, and registers itself (step 2), eventually unregisters itself (step 3), and then also dies.

When thread A dies, it leaves an entry with tid 537 in the list of registered threads. Thread B starts out with an empty TLS but finds an entry for its tid in the list of registered threads.

Bug 1651086 / crash report / crashing code

This is a crash in step 2, where the new thread tries to register itself but takes issue with the leftover entry in the list.

Bug 1657174 / crash report / crashing code

This is a crash in step 3. In step 2, after the patch from bug 1651086, we now think that thread B is actually thread A and is trying to register itself again. So we don't create a new RegisteredThread object and we don't change what's stored in TLS. This means that thread B's TLS stays empty.
Then we reach step 3. We find thread A's leftover RegisteredThread object in the list, and then crash because TLS is still empty.

This bug / crash report / crashing code

This is a crash in step 2 again. In step 2, we think that thread B is actually thread A and is trying to register itself again. Then we hit a new check from bug 1657174 and crash because TLS is still empty.


So, I think the remaining questions are:

  • What kind of thread is thread A, and why doesn't it unregister itself?
  • If we hit step 2, should we forcefully unregister thread A? I'd say yes. The current patch doesn't do that, and we can potentially accumulate lots of RegisteredThread entries for dead threads, with potentially duplicate tids.
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/4df42e22a94f
Allow duplicate ThreadIds in the profiler's registered thread list. r=gerald

Comment on attachment 9176983 [details]
Bug 1666246 - Allow duplicate ThreadIds in the profiler's registered thread list. r=gerald

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes on Android
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The updated code still has inherent risks due to its MOZ_RELEASE_ASSERTs, but I think it's a low-risk patch related to what what there before, because it removes one likely cause of crashes, and the remaining assertions are similar to (not worse that) the previous ones. And we still do need these assertions, because later code relies on these assumptions and otherwise would likely crash as well.
  • String changes made/needed:
Attachment #9176983 - Flags: approval-mozilla-beta?
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch

Comment on attachment 9176983 [details]
Bug 1666246 - Allow duplicate ThreadIds in the profiler's registered thread list. r=gerald

Approved for 82.0b2 so we can get this into Fenix Beta ASAP for possible dot release consideration.

Attachment #9176983 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Crash Signature: [@ profiler_register_thread] → [@ profiler_register_thread] [@ locked_register_thread]

No signs of this crash or newly-morphed signatures on Fenix builds containing this fix. I think we got it :-). Please nominate this for release approval when you get a chance.

Crash Signature: [@ profiler_register_thread] [@ locked_register_thread] → [@ profiler_register_thread] [@ locked_register_thread]
Flags: needinfo?(mstange.moz)

Comment on attachment 9176983 [details]
Bug 1666246 - Allow duplicate ThreadIds in the profiler's registered thread list. r=gerald

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes on Android
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: bug 1660458
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We have evidence from Nightly that it works. We also have a model of what was going wrong, and the patch mostly just changes asserts around, in such a way that they don't trigger in the troublesome case.
  • String changes made/needed:
Flags: needinfo?(mstange.moz)
Attachment #9176983 - Flags: approval-mozilla-release?

This has conflicts with release due to bug 1660458 not being uplifted to 81. Are we better off uplifting that as well or rebasing this patch around it?

Flags: needinfo?(mstange.moz)

In bug 1660458 comment 12, I thought it was not worth uplifting, because it was not really needed apart from slightly reducing the installer size.
I believe it would be fairly easy to change the patch here to work without bug 1660458: Change IsTLSInited() to Init() may be enough, but I haven't tried, so there may be more than that.

Now, bug 1660458 landed 3 weeks ago in 82 without problems so far (that I can see), so I think it would be safe to uplift as well, if that made the whole process simpler.

So we could go either way; I'd have a slight preference for uplifting bug 1660458 at this point.

I'll let Markus weigh in as well...

I think it's better to uplift both patches. Should be less risky since it won't create a new untested code path.

Flags: needinfo?(mstange.moz)

Comment on attachment 9176983 [details]
Bug 1666246 - Allow duplicate ThreadIds in the profiler's registered thread list. r=gerald

Approved for 81.0.1 and Fenix 81.1.2.

Attachment #9176983 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: