Status

()

defect
P1
critical
VERIFIED FIXED
Last year
Last year

People

(Reporter: calixte, Assigned: snorp)

Tracking

({crash, regression})

Trunk
Firefox 61
Unspecified
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox-esr60 unaffected, firefox59 unaffected, firefox60 fixed, firefox61blocking verified)

Details

(crash signature)

Attachments

(1 attachment)

This bug was filed from the Socorro interface and is
report bp-6b824cd7-0d0e-49a0-8f61-c56030180321.
=============================================================

Top 10 frames of crashing thread:

0 libc.so libc.so@0x19fce 
1 libnss3.so __fixunsdfsi 
2 libnss3.so pr_FindSymbolInLib nsprpub/pr/src/linking/prlink.c:1180
3 libnss3.so __fixunsdfsi 
4 libnss3.so PR_FindSymbol nsprpub/pr/src/linking/prlink.c:1216
5 libxul.so js::ctypes::Library::Declare js/src/ctypes/Library.cpp:342
6 libxul.so js::NativeObject::copyDenseElements js/src/vm/NativeObject-inl.h:156
7 libxul.so js::ctypes::Library::Close js/public/Value.h
8 libxul.so js::InternalCallOrConstruct js/src/vm/JSContext-inl.h:290
9 libmozglue.so arena_t::MallocSmall memory/build/Mutex.h:79

=============================================================

There are 61 crashes in nightly 61 with buildid 20180321040529.
:glandium, could you investigate please ?
Flags: needinfo?(mh+mozilla)
Over 2341 crashes in the build from March 21 - the spike seems to have gone down since then.
I was just able to reproduce this on my Pixel device, but simply typing in the URL bar. Working on better STR.
Michael: Any idea what might have caused this spike? https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=e636edf00e6fbdc3206c9df4a1548ae38b3d13fa appears to be the possible regression range based on the Build ID.
Flags: needinfo?(michael.l.comella)
This looks like a platform issue: Snorp, do you know what's up?
Flags: needinfo?(michael.l.comella) → needinfo?(snorp)
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #2)
> I was just able to reproduce this on my Pixel device, but simply typing in
> the URL bar. Working on better STR.

The second time I crashed I was typing in the URL bar and then backspaced a few times. It is not always reproducible though.
Same regression range as bug 1448106. Are they possibly related?
It does seem related somehow. Investigating some more...
Flags: needinfo?(snorp)
4751 crashes in the build from the 3/22 - so we have two builds (where we generated over 4K in crashes.

Some comments:
*It actually seems to happen more often when I don't have the app open at all
*App continuously crashes upon opening separate apps apart from the the browser about 4-5 times at a time several times a day. I love the app, but it really is annoying to have to report on these errors that aren't being fixed at least on my end with the device I'm using.  
*happened when I switched to Firefox from other app
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #3)
> Michael: Any idea what might have caused this spike?
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=e636
> edf00e6fbdc3206c9df4a1548ae38b3d13fa appears to be the possible regression
> range based on the Build ID.

As far as I can see, the crash rate already exploded with the previous build (20180321040529), so the range would be https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=3d21d31141dc5e2d2aff89203458125a3cce6c64.
(In reply to Jan Henning [:JanH] from comment #9)
> (In reply to Marcia Knous [:marcia - needinfo? me] from comment #3)
> > Michael: Any idea what might have caused this spike?
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=e636
> > edf00e6fbdc3206c9df4a1548ae38b3d13fa appears to be the possible regression
> > range based on the Build ID.
> 
> As far as I can see, the crash rate already exploded with the previous build
> (20180321040529), so the range would be
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=3d21
> d31141dc5e2d2aff89203458125a3cce6c64.

Thanks Jan - lots of the comments mention a background process causing the crash, or Fennec crashing in the background. Anything in that set looks as if we touched code in that area?
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #7)
> It does seem related somehow. Investigating some more...

I'm using an original Google Pixel and have encountered both bug 1447607 and bug 1448106 on current latest nightly, within minutes of each other.

report 4e50a496-bd94-493c-8113-2e6e40180324
report df4d542e-ee62-4c29-96bd-ccb5b0180324
report aab091a6-c906-4d9a-8813-9636b0180324

Can't tell you much more, since it seemed pretty random and I don't really use my phone that much.

Maybe unrelated, but I also have a 100% reproducible bug if anyone is interested.

bug 1448211
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #2)
> I was just able to reproduce this on my Pixel device, but simply typing in
> the URL bar. Working on better STR.

Are you using an original Google Pixel? I wonder if you can also reproduce a bug I get on my Google Pixel.

bug 1448211
This is the #1 topcrash in the Android nightly 20180322100140, with 7609
crashes reported.
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #10)
> (In reply to Jan Henning [:JanH] from comment #9)
> > As far as I can see, the crash rate already exploded with the previous build
> > (20180321040529), so the range would be
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=3d21
> > d31141dc5e2d2aff89203458125a3cce6c64.
> 
> Thanks Jan - lots of the comments mention a background process causing the
> crash, or Fennec crashing in the background. Anything in that set looks as
> if we touched code in that area?

Somebody with better debugging skills and more time than me needs to look further into this, but I happened to run into what is presumably this crash while I had the Android Studio debugger attached for some other problem. The backtrace there isn't particularly helpful - all I get is
> * thread #55, name = 'DOM Worker', stop reason = signal SIGSEGV: invalid address (fault address: 0x8df00df4)
>   * frame #0: 0xb6cce096 libc.so`strlen + 54

With the patch for bug 1448693 applied, DumpJSStack() however gives the following output, which might provide some further clues:
> 0 declareFFI(lib = [object Library], symbol = "copyfile", abi = ctypes.default_abi, returnType = [object Object], [object Object], [object Object], [object Object], [object Object]) ["resource://gre/modules/workers/require.js line 141 > Function line 3 > eval":1138]
> 1 get() ["resource://gre/modules/workers/require.js line 141 > Function line 3 > eval":1013]
>     this = [object Object]
> 2 anonymous(exports = [object DedicatedWorkerGlobalScope]) ["resource://gre/modules/osfile/osfile_unix_front.jsm":481]
> 3 <TOP LEVEL> ["resource://gre/modules/osfile/osfile_unix_front.jsm":23]
> 4 <TOP LEVEL> ["resource://gre/modules/osfile.jsm":27]
> 5 anonymous(exports = [object DedicatedWorkerGlobalScope]) ["resource://gre/modules/osfile/osfile_async_worker.js":24]
> 6 <TOP LEVEL> ["resource://gre/modules/osfile/osfile_async_worker.js":13]
(In reply to Jan Henning [:JanH] from comment #14)
> (In reply to Marcia Knous [:marcia - needinfo? me] from comment #10)
> > (In reply to Jan Henning [:JanH] from comment #9)
> > > As far as I can see, the crash rate already exploded with the previous build
> > > (20180321040529), so the range would be
> > > https://hg.mozilla.org/mozilla-central/
> > > pushloghtml?fromchange=bfb7edfd0436db388bb9e103b8ad817fc50bfdcf&tochange=3d21
> > > d31141dc5e2d2aff89203458125a3cce6c64.
> > 
> > Thanks Jan - lots of the comments mention a background process causing the
> > crash, or Fennec crashing in the background. Anything in that set looks as
> > if we touched code in that area?
> 
> Somebody with better debugging skills and more time than me needs to look
> further into this, but I happened to run into what is presumably this crash
> while I had the Android Studio debugger attached for some other problem. The
> backtrace there isn't particularly helpful - all I get is
> > * thread #55, name = 'DOM Worker', stop reason = signal SIGSEGV: invalid address (fault address: 0x8df00df4)
> >   * frame #0: 0xb6cce096 libc.so`strlen + 54
> 
> With the patch for bug 1448693 applied, DumpJSStack() however gives the
> following output, which might provide some further clues:
> > 0 declareFFI(lib = [object Library], symbol = "copyfile", abi = ctypes.default_abi, returnType = [object Object], [object Object], [object Object], [object Object], [object Object]) ["resource://gre/modules/workers/require.js line 141 > Function line 3 > eval":1138]
> > 1 get() ["resource://gre/modules/workers/require.js line 141 > Function line 3 > eval":1013]
> >     this = [object Object]
> > 2 anonymous(exports = [object DedicatedWorkerGlobalScope]) ["resource://gre/modules/osfile/osfile_unix_front.jsm":481]
> > 3 <TOP LEVEL> ["resource://gre/modules/osfile/osfile_unix_front.jsm":23]
> > 4 <TOP LEVEL> ["resource://gre/modules/osfile.jsm":27]
> > 5 anonymous(exports = [object DedicatedWorkerGlobalScope]) ["resource://gre/modules/osfile/osfile_async_worker.js":24]
> > 6 <TOP LEVEL> ["resource://gre/modules/osfile/osfile_async_worker.js":13]

Thanks Jan. At least now we know why ctypes is showing up. The question is why PR_FindSymbolInLib() suddenly crashes. Glandium, have there been any linker changes that could cause this?
There haven't been changes to the linker since mid-february.
Flags: needinfo?(mh+mozilla)
And copyfile is not a symbol available on Android.
FWIW, I'm affected of this bug. Let me know if I can be of any help, in case this needs more investigation.
I actually got this crash myself over the weekend. Nightly was in the background when it occurred, so it seems likely that it was the Service Worker path we see above.

I'm going to try to artificially trigger a failed PR_FindSymbolInLib() to see if I can repro that way.
Crash Signature: [@ __fixunsdfsi] → [@ __fixunsdfsi] [@ pr_FindSymbolInLib]
See Also: → 1448106
This and bug 1448106 are currently accounting for 95% of all Fennec Nightly crashes (over 37000 in the last week). This needs to be someone's top priority until the culprit is either found and backed out or a fix can be landed.
Flags: needinfo?(dbolter)
Yeah I didn't have any luck reproducing locally by using PR_FindSymbolInLib() in a similar way. I'm wondering if it's somehow specific to Service Workers.
If we had reliable STR that would really help, since maybe we would have a shot at bisecting....
Crash Signature: [@ __fixunsdfsi] [@ pr_FindSymbolInLib] → [@ __fixunsdfsi] [@ pr_FindSymbolInLib] [@ DLLErrorInternal] [@ strlen | DLLErrorInternal]
:asuth they suspect SW may be involved in some way, can you give your 2 cents here? It's a top crasher for them. Don't take over the bug but give any input you can.
Flags: needinfo?(bugmail)
Not ServiceWorkers related, that's a ChromeWorker for the OSFile logic using jsctypes.  The crash is in a strlen() call, all the __fixunsdfsi stack signatures are just because we don't have symbols for most libc.so hashes.  I assume the ones we have are due to someone running Ted's script at https://wiki.mozilla.org/Breakpad:Symbols#Android_system_symbols.  If we can't find a public repo of symbol tables for other places, maybe we should make a community push to get more people to run the script.

The strlen(error) call is either being given a completely random `error` pointer or a non-NUL-terminated one so that it has some chance of reading into illegal memory.  We're presumably in either the USE_DLFCN or HAVE_STRERROR ifdef case and one of dlerror() or strerror() is betraying us.  I'll look a little more into the branch taken there.
Flags: needinfo?(bugmail)
[triage] Critical: top nightly crasher.

Let's try to an assignee for this bug.
Flags: needinfo?(sdaswani)
Priority: -- → P1
It looks possible that what's happening is:
- per the registers and disassembly of our android binary, it looks like we gave strlen() a garbage pointer because its r0 and r4 are both the crash address (and dlerror returned into r4, which we assigned into r4 for the call).
- dlerror() in Android's bionic libc uses thread-local-storage (TLS) to store dlerror results.  When you call it, it returns the current value and overwrites what was there with NULL.  It's not clear to me that the TLS value is zeroed as part of thread startup, so if it's our first call to dlerror() and no one wrote a valid or NULL pointer into there, we could end up with a garbage pointer address.
- bionic's dl* functions wrap underlying linker functions.  If they fail, they set an error string.  In many cases, failure is returning a null pointer, like for dlopen.  However...
- For dlsym, bool do_dlsym(..., OUTPARAM result) is invoked, with the OUTPARAM result being returned but dlsym's bool return value determining whether the link's error message is propagated to the dlerror error message infra.
- do_dlsym is a big function.  So it's possible its return value could indicate success, but return a null pointer, which will cause us to invoke dlerror() and get the uninitialized value.

I could be wrong about this, more investigation is needed.  However, if I'm right, a simple workaround is to invoke dlerror() once and ignore its return value before we use it for anything.  This will ensure the TLS slot is properly initialized to null.  I'll look a little more...
David and James, I'm going to remove my NI as I assume this is a 'platform' bug. If it's a 'Front End' bug and you want my 'Front End' engineers to work on it, NI me back.
Flags: needinfo?(sdaswani) → needinfo?(snorp)
(In reply to Andrew Sutherland [:asuth] from comment #27)
> 
> I could be wrong about this, more investigation is needed.  However, if I'm
> right, a simple workaround is to invoke dlerror() once and ignore its return
> value before we use it for anything.  This will ensure the TLS slot is
> properly initialized to null.  I'll look a little more...

Ah, interesting! One thing to note, however, is that we wrap dl* with our own functions in ElfLoader.cpp[0]. AFAICT it looks like our own dlerror() could also be returning a garbage pointer.


[0] https://searchfox.org/mozilla-central/source/mozglue/linker/ElfLoader.cpp#70
Flags: needinfo?(snorp)
Ah, that makes more sense!  Because bionic's pthread's __init_tls method uses mmap with MAP_ANONYMOUS to allocate the per-thread slots, so all TLS slots should indeed be zero-initialized.  Plus, it mprotects both sides of the memory region so someone would have to almost intentionally be trashing that TLS slot.  Also, the do_dlsym path indeed does seem to be safe.
One obvious problem in ElfLoader.cpp is that LastError is not set to dlerror when dlsym returns null.
And yeah, it looks like the ElfLoader constructor[1] really wants to initialize lastError since it's POD and not a smart pointer or anything:
  ElfLoader() : expect_shutdown(true), lastError(nullptr)
instead of 
  ElfLoader() : expect_shutdown(true)

Given that _wrap_dlsym()[2] doesn't do anything if it ends up punting to dlsym() and dlsym() returns null.  It seems like __wrap_dlsym really wants to check the return value of dlsym() and propagate dlerror() in the result of an error.

1: https://searchfox.org/mozilla-central/rev/49cc27555d5b7ed2eb06baf156693dd578daa06f/mozglue/linker/ElfLoader.h#457
2: https://searchfox.org/mozilla-central/rev/49cc27555d5b7ed2eb06baf156693dd578daa06f/mozglue/linker/ElfLoader.cpp#78
Assignee: nobody → snorp
Clearing my NI - James will post a patch.
Flags: needinfo?(dbolter)
Is it possible to know what is the patch (if one) which introduced this regression ?
(In reply to Calixte Denizet (:calixte) from comment #35)
> Is it possible to know what is the patch (if one) which introduced this
> regression ?

If the problem Andrew found is the cause of this bug, then we have likely been getting lucky most of the time, and some recent change in the memory layout made it so we are no longer so lucky. AFAICT, the bug has been here for 5+ years.
Another thing I noticed is that ElfLoader doesn't use TLS to store the error, which means the whole thing is pretty racy. Presumably that will not cause any problem other than getting a null or incorrect error, which should not cause a crash, but it's something that we may want to fix at some point.
Comment on attachment 8963228 [details]
Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError

https://reviewboard.mozilla.org/r/232108/#review237916

::: mozglue/linker/ElfLoader.cpp:91
(Diff revision 1)
>    if (handle != RTLD_DEFAULT && handle != RTLD_NEXT) {
>      LibHandle *h = reinterpret_cast<LibHandle *>(handle);
>      return h->GetSymbolPtr(symbol);
>    }
> -  return dlsym(handle, symbol);
> +
> +  dlerror();

The same should be done in SystemElf::GetSymbolPtr, SystemElf::Load and SystemElf::~SystemElf. I'm also tempted to say we should somehow wrap RTLD_DEFAULT and RTLD_NEXT in a SystemElf and use SystemElf::GetSymbolPtr, which would have the side effect of giving us a debug log for those too.
Attachment #8963228 - Flags: review?(mh+mozilla)
Duplicate of this bug: 1448106
Comment on attachment 8963228 [details]
Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError

https://reviewboard.mozilla.org/r/232108/#review237984

::: mozglue/linker/ElfLoader.cpp:91
(Diff revision 2)
>    if (handle != RTLD_DEFAULT && handle != RTLD_NEXT) {
>      LibHandle *h = reinterpret_cast<LibHandle *>(handle);
>      return h->GetSymbolPtr(symbol);
>    }
> -  return dlsym(handle, symbol);
> +
> +  dlerror();

Sorry for the confusion, this defensive dlerror() call is not necessary, as bionic's slots do get initialized to 0.  The underlying bionic lib is sound, I just hadn't realized this whole ElfLoader.cpp thing existed and that it provided an interposed dlerror() implementation, so I went looking deeper in the stack for the problem.

I'll try and see if there's some way to make searchfox better indicate the situation where dlopen()/malloc()/whatever() looks like the system implementation you assume it would be, but in fact it's a wrapping moz implementation.
(In reply to Andrew Sutherland [:asuth] from comment #41)
> Comment on attachment 8963228 [details]
> Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError
> 
> https://reviewboard.mozilla.org/r/232108/#review237984
> 
> ::: mozglue/linker/ElfLoader.cpp:91
> (Diff revision 2)
> >    if (handle != RTLD_DEFAULT && handle != RTLD_NEXT) {
> >      LibHandle *h = reinterpret_cast<LibHandle *>(handle);
> >      return h->GetSymbolPtr(symbol);
> >    }
> > -  return dlsym(handle, symbol);
> > +
> > +  dlerror();
> 
> Sorry for the confusion, this defensive dlerror() call is not necessary, as
> bionic's slots do get initialized to 0.  The underlying bionic lib is sound,
> I just hadn't realized this whole ElfLoader.cpp thing existed and that it
> provided an interposed dlerror() implementation, so I went looking deeper in
> the stack for the problem.

Oh, hmm. I was thinking we'd still want this in order to clear an existing error, but dl* don't do this for you so maybe we shouldn't either.
Comment on attachment 8963228 [details]
Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError

https://reviewboard.mozilla.org/r/232108/#review238112

::: mozglue/linker/ElfLoader.cpp:80
(Diff revision 2)
>  }
>  
>  void *
>  __wrap_dlsym(void *handle, const char *symbol)
>  {
> +  ElfLoader::Singleton.lastError = nullptr;

All branches are setting lastError, so this shouldn't be necessary.

::: mozglue/linker/ElfLoader.cpp:91
(Diff revision 2)
>    if (handle != RTLD_DEFAULT && handle != RTLD_NEXT) {
>      LibHandle *h = reinterpret_cast<LibHandle *>(handle);
>      return h->GetSymbolPtr(symbol);
>    }
> -  return dlsym(handle, symbol);
> +
> +  dlerror();

per asuth's comment, please remove the empty dlerror() calls.

::: mozglue/linker/ElfLoader.cpp:364
(Diff revision 2)
>    if (path && path[0] == '/' && (access(path, F_OK) == -1)){
>      DEBUG_LOG("dlopen(\"%s\", 0x%x) = %p", path, flags, (void *)nullptr);
>      return nullptr;
>    }

You'll want to do something to lastError in this case.
Attachment #8963228 - Flags: review?(mh+mozilla)
Duplicate of this bug: 1450312
glandium, I'm on PTO next week, so please commit this if you r+. Thanks!
Flags: needinfo?(mh+mozilla)
Comment on attachment 8963228 [details]
Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError

https://reviewboard.mozilla.org/r/232108/#review238584
Attachment #8963228 - Flags: review?(mh+mozilla) → review+
I can't autoland because of the open issue that I can't close because I haven't authored them, and I'm not the patch author. Andrew, can you close your issue and autoland?
Flags: needinfo?(mh+mozilla) → needinfo?(bugmail)
I pushed the "fixed" button.  Did that do it?
Flags: needinfo?(bugmail)
Re-reading more closely, I've now pushed the land button too and auto-land seems like it's doing something.
Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/3dcf849314ce
Correctly init and update ElfLoader::Singleton::lastError r=glandium
https://hg.mozilla.org/mozilla-central/rev/3dcf849314ce
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → Firefox 61
Duplicate of this bug: 1450846
Comment on attachment 8963228 [details]
Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError

Approval Request Comment
[Feature/Bug causing the regression]: None, has existed for 5+ years!
[User impact if declined]: Intermittent startup crashes
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: Straightforward error handling fix
[String changes made/needed]: None
Attachment #8963228 - Flags: approval-mozilla-release?
Attachment #8963228 - Flags: approval-mozilla-beta?
From irc discussion with snorp. This may affect the crash in bug 1450793. 

I'll have a look back at the crash info for beta 60 in Google Play to see if we have clear data that we might see reflected if we uplift this and other work to beta 12.
Comment on attachment 8963228 [details]
Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError

Looks like it definitely made an impact on Nightly with no known regressions thus far. Worth taking in our next Fennec beta, whether that's a b12 build this week or b13 next week.
Attachment #8963228 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I'm not seeing any obvious impact on beta after beta 12. It's still too early to tell for beta 13.
Comment on attachment 8963228 [details]
Bug 1447607 - Correctly init and update ElfLoader::Singleton::lastError

60 is now on m-r.
Attachment #8963228 - Flags: approval-mozilla-release? → approval-mozilla-release-
I think we can say we're out of the woods here now :)
Status: RESOLVED → VERIFIED
Depends on: 1475482
You need to log in before you can comment on or make changes to this bug.