crash in nsTimerImpl::PostTimerEvent()

RESOLVED WORKSFORME

Status

()

Core
XPCOM
--
critical
RESOLVED WORKSFORME
5 years ago
5 years ago

People

(Reporter: Bebe, Unassigned)

Tracking

({crash, reproducible})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [b2g-crash], crash signature)

Attachments

(2 attachments)

(Reporter)

Description

5 years ago
Created attachment 774623 [details]
Logcat from the crash

This bug was filed from the Socorro interface and is 
report bp-7b31028d-6007-4b2d-96b2-de6c72130712 .
 ============================================================= 

I can reproduce this on:


Using a Gaia-UI Automation test:
https://github.com/mozilla/gaia-ui-tests/blob/master/gaiatest/tests/videoplayer/test_play_youtube_video.py

The tests opens a browser
Navigates to a youtube video http://m.youtube.com/watch?v=5MzuGWFIfio

At some point the phone crashes and reboots.
(Reporter)

Comment 1

5 years ago
I can reproduce this on:
Gecko  http://hg.mozilla.org/mozilla-central/rev/b44898282f21
Gaia  d94ed01a27125ea8dc91b9f16805411e2d2cc708
BuildID 20130712030200
Version 25.0a1
This is the mobile YouTube site, right? Can you reproduce this outside of automation?

Comment 3

5 years ago
I have triggered this crash using other browser and Marketplace tests. I don't think it is specifically that youtube test causing it but that it is a pretty reliably way to trigger it.

We'll spend some time to try and trigger it manually.
FWIW, I can't reproduce this crash by watching the video at http://m.youtube.com/watch?v=5MzuGWFIfio manually.

Updated

5 years ago
OS: Android → Gonk (Firefox OS)
Hardware: All → ARM
Whiteboard: [b2g-crash]

Updated

5 years ago
Crash Signature: [@ nsTimerImpl::PostTimerEvent()] → [@ nsTimerImpl::PostTimerEvent()] [@ pthread_mutex_lock | PR_Lock | nsTimerImpl::PostTimerEvent() ]

Updated

5 years ago
Duplicate of this bug: 894028

Updated

5 years ago
Blocks: 884399
I have been able to reproduce similar effect with a clone (hosted locally) of http://people.mozilla.com/~npierron/sunspider/hosted/ when run with the automation used by arewefastyet.

I run an "adb reboot", and now I constantly have this bug.
(Reporter)

Comment 7

5 years ago
this is still crashing our automation builds
(Reporter)

Comment 8

5 years ago
Created attachment 776380 [details]
new logcat
(Reporter)

Comment 10

5 years ago
I was also able to reproduce this by loading:
http://firefoxos.123done.org

If it's not crashing the first time just close the browser app and reopen the link

Comment 11

5 years ago
Florin can you post the build details on which you are replicating this?
(Reporter)

Comment 12

5 years ago
Gecko  http://hg.mozilla.org/mozilla-central/rev/5976b9c673f8
Gaia  f2e2403873bcd83b046ff2b9baf61e8db6224496
BuildID 20130716030201
Version 25.0a1
(In reply to Nicolas B. Pierron [:nbp] from comment #6)
> I have been able to reproduce similar effect with a clone (hosted locally)
> of http://people.mozilla.com/~npierron/sunspider/hosted/ when run with the
> automation used by arewefastyet.
> 
> I run an "adb reboot", and now I constantly have this bug.

The script that I used was doing:

        browser.go_to_url(self._start_page)
        // time.sleep(2)
        browser.switch_to_content()
        // time.sleep(2)
        self.wait_for_element_present(*self._start_now_locator)
        link = self.marionette.find_element(*self._start_now_locator)
        self.assertTrue(link.text == 'Start Now!', '...')

And the error went away after adding the 2 sleeps.  From the screen, I was able to see the page being loaded and even the style sheet being applied before b2g's crash.

[1] https://github.com/nbp/gaia-ui-tests/blob/bench/gaiatest/tests/browser/benchmarks/test_bench_sunspider.py

Comment 14

5 years ago
I've also replicated this 6 times this morning using the steps from comment #10 using the same build from comment #12.

I've also replicated it using nbp's STR and it seems to be the same bug.

Comment 15

5 years ago
Moving to the correct component, which would really have helped getting this prioritized. The crash address of 0xf56 is surprising, because it's unlikely to be a null-pointer deref but I also would not expect any heap allocations this low in the address space.

And we're missing the line number in nsTimerImpl::PostTimerEvent() because it's supposedly in Mutex.h (Mutex::Lock is the function).
Component: Gaia → XPCOM
Product: Boot2Gecko → Core

Comment 16

5 years ago
Also, it would very much help if people who are experiencing this could continue to post a few more crash report IDs.

Comment 18

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
> Moving to the correct component, which would really have helped getting this
> prioritized. The crash address of 0xf56 is surprising, because it's unlikely
> to be a null-pointer deref but I also would not expect any heap allocations
> this low in the address space.

The crash address varies, as far as I can tell, but they're all small values, seemingly below 0xffff.  This can be caused by any number of things, for example, clobbering the top 16 bits of the address somehow?

> And we're missing the line number in nsTimerImpl::PostTimerEvent() because
> it's supposedly in Mutex.h (Mutex::Lock is the function).

Which mutex is this?  If I'm reading this code correctly, we should not be trying to hold a mutex in PostTimerEvent.

Has somebody tried to capture this crash under a debugger?
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #18)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #15) 
> > And we're missing the line number in nsTimerImpl::PostTimerEvent() because
> > it's supposedly in Mutex.h (Mutex::Lock is the function).
> 
> Which mutex is this?  If I'm reading this code correctly, we should not be
> trying to hold a mutex in PostTimerEvent.
> 
> Has somebody tried to capture this crash under a debugger?

Could be the mutex in TimerThread::RemoveTimer: http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#336

Comment 20

5 years ago
(In reply to Josh Matthews [:jdm] from comment #19)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #18)
> > (In reply to Benjamin Smedberg  [:bsmedberg] from comment #15) 
> > > And we're missing the line number in nsTimerImpl::PostTimerEvent() because
> > > it's supposedly in Mutex.h (Mutex::Lock is the function).
> > 
> > Which mutex is this?  If I'm reading this code correctly, we should not be
> > trying to hold a mutex in PostTimerEvent.
> > 
> > Has somebody tried to capture this crash under a debugger?
> 
> Could be the mutex in TimerThread::RemoveTimer:
> http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.
> cpp#336

Yes, and that would indeed crash if gThread is bogus.  The only way I can imagine that to happen would be if we were shutting down and nsTimerImpl::Shutdown got called as we were running PostTimerEvent.  It seems like we don't protect accesses to gThread across threads, so this seems racy at first glance.

Comment 21

5 years ago
Thread 0 in the crash reports doesn't show any evidence that we are shutting down.

Comment 22

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #21)
> Thread 0 in the crash reports doesn't show any evidence that we are shutting
> down.

Hmm, yeah, you're right.

Comment 23

5 years ago
One idea which we can try to narrow down this crash is for somebody who can reproduce this in a local build to add the following code to the end of TimerThread::~TimerThread():

memset(this, 0, sizeof(*this));

and see if the crash is then converted to a near-null crash.  If that happens, then my theory would be that we have a refcount imbalance which causes gThread to get destroyed before we NS_RELEASE it...
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #23)
> One idea which we can try to narrow down this crash is for somebody who can
> reproduce this in a local build to add the following code to the end of
> TimerThread::~TimerThread():
> 
> memset(this, 0, sizeof(*this));
> 
> and see if the crash is then converted to a near-null crash.  If that
> happens, then my theory would be that we have a refcount imbalance which
> causes gThread to get destroyed before we NS_RELEASE it...

If someone could build a patch with what you are suggesting and send it to try, then we can try test out what you are suggesting in comment 23.
I have a 100% reproducible case with my Unagi running master (today's build) on http://www.browserscope.org/, and would be happy to run a custom build, as Jason suggests, from comment 24.

Comment 26

5 years ago
(In reply to comment #24)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #23)
> > One idea which we can try to narrow down this crash is for somebody who can
> > reproduce this in a local build to add the following code to the end of
> > TimerThread::~TimerThread():
> > 
> > memset(this, 0, sizeof(*this));
> > 
> > and see if the crash is then converted to a near-null crash.  If that
> > happens, then my theory would be that we have a refcount imbalance which
> > causes gThread to get destroyed before we NS_RELEASE it...
> 
> If someone could build a patch with what you are suggesting and send it to try,
> then we can try test out what you are suggesting in comment 23.

Do you know which trychooser syntax I should be using for your device?
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #26)
> (In reply to comment #24)
> > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #23)
> > > One idea which we can try to narrow down this crash is for somebody who can
> > > reproduce this in a local build to add the following code to the end of
> > > TimerThread::~TimerThread():
> > > 
> > > memset(this, 0, sizeof(*this));
> > > 
> > > and see if the crash is then converted to a near-null crash.  If that
> > > happens, then my theory would be that we have a refcount imbalance which
> > > causes gThread to get destroyed before we NS_RELEASE it...
> > 
> > If someone could build a patch with what you are suggesting and send it to try,
> > then we can try test out what you are suggesting in comment 23.
> 
> Do you know which trychooser syntax I should be using for your device?

try: -b do -p unagi -u none -t none
As a point of information - when you provide unagi in the trychooser syntax and the build is successful, then your try build should appear here:

https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/try-builds/
I downloaded the build, flashed it, crashed using my earlier steps, and got this crash report:

https://crash-stats.mozilla.com/report/index/3355c694-c9d0-47e8-9cca-0bc322130718

Comment 31

5 years ago
I'll need to process that using the try symbols to get any data; I'll do that tomorrow.
Flags: needinfo?(benjamin)

Comment 32

5 years ago
(In reply to Stephen Donner [:stephend] from comment #30) 
> https://crash-stats.mozilla.com/report/index/3355c694-c9d0-47e8-9cca-
> 0bc322130718
The abort message is: "not reached: file JavaScriptTypes.cpp, line 145".
Another simple manual STR on unagi master:
1. launch browser
2. open tab tray
3. select settings
4. select About firefox
5. select FAQ

It will crash with this signature.

Updated

5 years ago
Keywords: reproducible

Comment 34

5 years ago
Does this reproduce on desktop-B2G builds? Given that I don't have a device, and debugging on-device is painful anyway, I'd love to have this on desktop-windows or desktop-linux B2G builds.
Flags: needinfo?(benjamin)

Comment 35

5 years ago
Issue still occurring on:
Unagi v1.2.0 Mozilla RIL
Build ID: 20130718030209
Gecko: http://hg.mozilla.org/mozilla-central/rev/f26e4c26ce4a
Gaia: 4ec7c428f6a63a44f888ea6f6ade0385c89ae305
Platform Version: 25.0a1

1) Open browser
2) Type website (i.e. cnn.com)
3) Observe

Result:
Device Crashes

Build Link: https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/nightly/mozilla-central-unagi/2013/07/2013-07-18-03-02-09/
Crash Report: https://crash-stats.mozilla.com/report/index/dfde895a-49a7-40a6-b7d9-7123f2130718
blocking-b2g: --- → koi?

Comment 36

5 years ago
(In reply to comment #35)
> Issue still occurring on:
> Unagi v1.2.0 Mozilla RIL
> Build ID: 20130718030209
> Gecko: http://hg.mozilla.org/mozilla-central/rev/f26e4c26ce4a
> Gaia: 4ec7c428f6a63a44f888ea6f6ade0385c89ae305
> Platform Version: 25.0a1
> 
> 1) Open browser
> 2) Type website (i.e. cnn.com)
> 3) Observe
> 
> Result:
> Device Crashes
> 
> Build Link:
> https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/nightly/mozilla-central-unagi/2013/07/2013-07-18-03-02-09/
> Crash Report:
> https://crash-stats.mozilla.com/report/index/dfde895a-49a7-40a6-b7d9-7123f2130718

This seems like a different bug, please file it separately (probably in Core::Graphics.)  Thanks!

Comment 37

5 years ago
Reference bug 895629 for comments 35 & 36
This was also affecting the performance FPS tests, however since the nightly builds on July 22nd it has not replicated.
(In reply to Dave Hunt (:davehunt) from comment #38)
> This was also affecting the performance FPS tests, however since the nightly
> builds on July 22nd it has not replicated.

Sounds like we can pull this off the tracker bug then if it can no longer be replicated.
No longer blocks: 884399

Comment 40

5 years ago
Shall we call this WFM then?
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → WORKSFORME

Updated

5 years ago
blocking-b2g: koi? → ---
You need to log in before you can comment on or make changes to this bug.