Closed Bug 1451891 Opened 2 years ago Closed 2 years ago

More crashes in __fixunsdfsi

Categories

(Firefox for Android :: General, defect, P1, critical)

ARM
Android
defect

Tracking

()

VERIFIED FIXED
Firefox 63
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox59 --- unaffected
firefox60 --- unaffected
firefox61 + fixed
firefox62 --- verified
firefox63 --- verified

People

(Reporter: mccr8, Assigned: jchen)

References

Details

(Keywords: crash, regression, Whiteboard: [geckoview:p2])

Crash Data

Attachments

(1 file)

I'm still seeing crashes with this signature from bug 1447607 in the April 4 Android Nightlies. It is still the top crash, though the volume is greatly reduced. It is "only" 27% of all Android crashes.

Here's an example:
  bp-978a316e-ec20-4fc6-9ef2-ee23f0180405

I don't know if this bug should be reopened or a new bug filed, so I'll just file a new bug and it can be duped over as appropriate.
This is still a top crash, so I'm nominating it for tracking.
David, is there somebody who can investigate this while snorp is on PTO? Thanks.
Flags: needinfo?(dbolter)
This could be related to ElfLoader not using thread-local-storage to store its errors, as raised by :snorp at https://bugzilla.mozilla.org/show_bug.cgi?id=1447607#c37.  Unfortunately, the lower level dlerror()'s returned strings aren't just C string literals that permanently exist in the process's address space, but stuff sprintf'ed into a per-thread buffer.  If the thread goes away, the buffer is no longer legal memory to read.

Of course, timing-wise, it does seem somewhat unlikely that this is directly caused by a race.  More likely is that the TLS issue lets mismatched "cause error that sets the error; call dlerror()" pairs become a problem.  Looking at the wrapped dlsym, it looks like there's a hazard still, namely at https://searchfox.org/mozilla-central/rev/2ce99e8054b0ff6ed1adf484aeaacacf2fea084c/mozglue/linker/ElfLoader.cpp#86 this block:

  if (handle != RTLD_DEFAULT && handle != RTLD_NEXT) {
    LibHandle *h = reinterpret_cast<LibHandle *>(handle);
    return h->GetSymbolPtr(symbol);
  }

If GetSymbolPtr could possibly return null, then we can expect someone to call dlerror() after calling this and get whatever lastError is, but nowhere in the branch of this wrapped-dlsym did we set lastError.  And lastError gets set all over the place in cases that don't seem to guarantee that the caller will invoke dlerror().


Making lastError something that's stored in TLS is probably the simplest option.  A more hacky option is to maintain a singleton char buffer on the ElfLoader itself and strncpy whatever dlerror() returns into that at the time, since that ensures per-thread pointers never leave the thread.  Further auditing of ElfLoader also sounds good.
Flags: needinfo?(dbolter) → needinfo?(snorp)
I don't see any crashes with a build ID later than the 4th, so maybe we're ok now?
Flags: needinfo?(snorp)
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
This appears to be back again in 62 nightly, at least I spotted one signature in crash stats - https://bit.ly/2JGevrN.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
A dozen of these in the June 27th build, from a number of different install times, making it the #2 Android, after a crash that was all from a single install time.
__fixunsdfsi is the top crash in Fennec Nightly 63.0a1.
I found this crash today testing 62 Beta 7 following the steps:

Case A:
1." Always restore" off;
2. From Recently closed I restored all tabs;
3. Close the browser and opened again.

Case B:
1. Switching between languages. 
2. Go to Mozilla -> About Firefox beta;
3. Close the browser and re-open.
Note: from 16 languages tested, I had 8 crashes.
Crash also reproduced with the following steps and setup:

Sony Xperia Z3 (Android 5.1.1)
Firefox Nightly 63.0a1 (2018-07-10)

1. Select Tools > Downloads
2. Close Firefox using its Quit option or by Force Stopping it (Android Settings > Apps > Firefox > Force Stop).
3. Reopen Firefox, downloads tab restored
Whiteboard: [geckoview]
Marcia, James says __fixunsdfsi is a generic function and Socorro should skip __fixunsdfsi stack frames to find a useful function name for the crash signature. Who should I ask about add __fixunsdfsi to Socorro's skip list?
Flags: needinfo?(mozillamarcia.knous)
Whiteboard: [geckoview] → [geckoview:p2]
(In reply to Chris Peterson [:cpeterson] from comment #10)
> Marcia, James says __fixunsdfsi is a generic function and Socorro should
> skip __fixunsdfsi stack frames to find a useful function name for the crash
> signature. Who should I ask about add __fixunsdfsi to Socorro's skip list?

You can file a bug in the Socorro::Signature component. It also only takes a few minutes to do the pull request yourself. Bug 1471379 is an example.
Depends on: 1475482
(In reply to Andrew McCreight [:mccr8] from comment #11)
> You can file a bug in the Socorro::Signature component. It also only takes a
> few minutes to do the pull request yourself.

Thanks! I filed bug 1475482.
Flags: needinfo?(mozillamarcia.knous)
We still have a lot of crashes in pr_FindSymbolInLib on beta 62. Is that also something we should add to the skip list?
Flags: needinfo?(nchen)
It looks like `DLLErrorInternal` and `pr_FindSymbolInLib` are actually involved in this crash so I guess they shouldn't go on the skip list. I'm looking at writing a speculative fix.
Assignee: nobody → nchen
Flags: needinfo?(nchen)
Comment on attachment 8994681 [details]
Bug 1451891 - Fix race conditions in __wrap_dlerror;

Actually this is a race condition and there is a better, proper fix.
Attachment #8994681 - Flags: review?(jorendorff)
Comment on attachment 8994681 [details]
Bug 1451891 - Fix race conditions in __wrap_dlerror;

https://reviewboard.mozilla.org/r/259198/#review266506

Also, why does having a null or incorrect dlerror message cause a crash? DLLInternalError appears handle the null error case:

https://searchfox.org/mozilla-central/rev/bdfd20ef30d521b57d5b6feeda71325e8b4cad66/nsprpub/pr/src/linking/prlink.c#118,125,135

::: mozglue/linker/ElfLoader.cpp:99
(Diff revision 2)
> -  void* sym = dlsym(handle, symbol);
> -  ElfLoader::Singleton.lastError = dlerror();
> +  ElfLoader::Singleton.lastError = nullptr; // Use system dlerror.
> +  return dlsym(handle, symbol);

Could there still be a race where some other thread sets lastError to a custom/non-system error pointer in between setting `lastError = nullptr` here and __wrap_dlerror fetching lastError.exchange(nullptr)?
Null is fine, but a dlerror message from another thread is potentially bad, because the thread the message came from could have exited. In that case the buffer that the dlerror message points to would have been freed, and reading the dlerror message would result in a crash.
Comment on attachment 8994681 [details]
Bug 1451891 - Fix race conditions in __wrap_dlerror;

https://reviewboard.mozilla.org/r/259198/#review266576
Attachment #8994681 - Flags: review?(mh+mozilla) → review+
Pushed by nchen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c95b51e4a996
Fix race conditions in __wrap_dlerror; r=glandium
https://hg.mozilla.org/mozilla-central/rev/c95b51e4a996
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 63
Need uplift based on Nightly results
Flags: needinfo?(nchen)
Comment on attachment 8994681 [details]
Bug 1451891 - Fix race conditions in __wrap_dlerror;

Approval Request Comment
[Feature/Bug causing the regression]: N/A
[User impact if declined]: Random crashes
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Not really
[Why is the change risky/not risky?]: Patch makes the affected code more thread-safe but should have no effect on functionality
[String changes made/needed]: None
Flags: needinfo?(nchen)
Attachment #8994681 - Flags: approval-mozilla-beta?
Comment on attachment 8994681 [details]
Bug 1451891 - Fix race conditions in __wrap_dlerror;

Crashes stopped on nightly since the landing of the patch 4 days ago and no regression reported, approving for the next Beta.
Attachment #8994681 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Device:
 - Samsung Galaxy Note 8 (Android 8.0)

Verified this following the steps provided in comment 8 and comment 9 and FF did not crash.
Status: RESOLVED → VERIFIED
Hardware: Unspecified → ARM
You need to log in before you can comment on or make changes to this bug.