Closed Bug 964537 Opened 11 years ago Closed 11 years ago

crash in strlen | __vfprintf

Categories

(Core :: IPC, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla31
blocking-b2g 1.4+
Tracking Status
firefox30 --- wontfix
firefox31 --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: astole, Assigned: dhylands)

References

Details

(Keywords: crash, regression, Whiteboard: [b2g-crash])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is report bp-96c5eada-2b77-4dff-9c39-ccff52140127. ============================================================= Hit this issue while waiting for the marketplace sign in screen to load. No exact STR. Will provide more information if more precise STR are found. Environmental Variables: Device: Buri v1.4 Master M-C Mozilla RIL BuildID: 20140127072047 Gaia: 75ea7d07cdb590722634016410e12819faf82e5a Gecko: 53376ef850fc Version: 29.0a1 Firmware Version: v1.2-device.cfg
Component: General → IPC
Whiteboard: [b2g-crash]
Frame Module Signature Source 0 libc.so strlen bionic/libc/arch-arm/bionic/strlen.c 1 libc.so __vfprintf bionic/libc/stdio/vfprintf.c 2 libc.so vsnprintf bionic/libc/stdio/vsnprintf.c 3 liblog.so __android_log_print system/core/liblog/logd_write.c 4 libxul.so mozilla::hal_impl::OomVictimLogger::Observe(nsISupports*, char const*, char16_t const*) hal/gonk/GonkHal.cpp 5 libxul.so nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverList.cpp 6 libxul.so nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverService.cpp 7 libxul.so mozilla::dom::ContentParent::ActorDestroy(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason) dom/ipc/ContentParent.cpp 8 libxul.so mozilla::dom::PContentParent::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason) /builds/slave/b2g_m-cen_ham_ntly-00000000000/build/objdir-gecko/ipc/ipdl/PContentParent.cpp 9 libxul.so mozilla::dom::PContentParent::OnChannelError() /builds/slave/b2g_m-cen_ham_ntly-00000000000/build/objdir-gecko/ipc/ipdl/PContentParent.cpp 10 libxul.so mozilla::dom::ContentParent::OnChannelError() dom/ipc/ContentParent.cpp 11 libxul.so mozilla::ipc::MessageChannel::NotifyMaybeChannelError() ipc/glue/MessageChannel.cpp 12 libxul.so mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError() ipc/glue/MessageChannel.cpp 13 libxul.so RunnableMethod<WebCore::ReverbConvolver, void (WebCore::ReverbConvolver::*)(), Tuple0>::Run() ipc/chromium/src/base/tuple.h 14 libxul.so MessageLoop::RunTask(Task*) ipc/chromium/src/base/message_loop.cc 15 libxul.so MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) ipc/chromium/src/base/message_loop.cc 16 libxul.so MessageLoop::DoWork() ipc/chromium/src/base/message_loop.cc 17 libxul.so mozilla::ipc::DoWorkRunnable::Run() ipc/glue/MessagePump.cpp 18 libxul.so nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp 19 libxul.so NS_ProcessNextEvent(nsIThread*, bool) xpcom/glue/nsThreadUtils.cpp 20 libxul.so mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp 21 libxul.so MessageLoop::RunInternal() ipc/chromium/src/base/message_loop.cc 22 libxul.so MessageLoop::Run() ipc/chromium/src/base/message_loop.cc 23 libxul.so nsBaseAppShell::Run() widget/xpwidgets/nsBaseAppShell.cpp 24 libxul.so nsAppStartup::Run() toolkit/components/startup/nsAppStartup.cpp 25 libxul.so XREMain::XRE_mainRun() toolkit/xre/nsAppRunner.cpp 26 libxul.so XREMain::XRE_main(int, char**, nsXREAppData const*) toolkit/xre/nsAppRunner.cpp 27 libxul.so XRE_main toolkit/xre/nsAppRunner.cpp 28 b2g main b2g/app/nsBrowserApp.cpp 29 libc.so __libc_init bionic/libc/bionic/libc_init_dynamic.c 30 @0xb00045a9 31 b2g NS_CStringSetData More reports: https://crash-stats.mozilla.com/report/list?product=B2G&signature=strlen+|+__vfprintf#tab-reports
Sarah - Can you have someone work on getting STR on this bug?
Flags: needinfo?(sparsons)
QA Contact: mvaughan
Unfortunately I am unable to reproduce this issue on the 01/27/14 or 02/03/14 Master (1.4) builds. Also, the reporter said that he has only seen this once and could not reproduce this crash again. There is something interesting I am seeing on the 1/27 build, but not on the 2/03 build, that could've possibly played a part in the crash. On the 1/27 build, the Persona sign-in window does not appear when tapping on the 'Sign In / Sign Up' button. Instead a throbber will appear over the button and then nothing else will happen. If I press the hardware Home button to minimize Marketplace, the Persona window will then appear. In the 2/03 build, the Persona window immediately appears as expected. I will continue to keep an eye out for this issue and I will leave steps-wanted in case someone else has the ability to find them.
Flags: needinfo?(sparsons)
strlen|__vfprintf normally smells like a null string or a garbage string, but in this case the log strings are hardcoded in OomVictimLogger::Observe. That function has seen some recent changes, so I'll need to double check that the code I'm looking at is what was actually in the crash dump.
We ought to be able to figure this out just by inspecting the disassembly, but I don't have a b2g build environment in front of me. dhylands can you take a look? http://mxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1121 All the crashes are on address 0x2, as if that's what the printf was trying to use for %s. Did that variadic macro go wrong? It's suspicious that 0x2 is REG_NOSUB that was on the stack just prior.
Flags: needinfo?(dhylands)
The OOM Victim Logger should only get invoked when an OOM occurs. I suspect that partially explains why the STR is somewhat elusive. I'll take a look at the code and see if I can figure out what's causing the error. The stack crawls seem to be missing line numbers.
Flags: needinfo?(dhylands)
Does anybody know how to get the libxul.so and symbols that's associated with a particular crash dump? For example: if I look at: https://crash-stats.mozilla.com/report/index/dcafda5e-bc23-4c48-b3fc-e58d12140318 It gives a buildID of 20140310093315 I can't find that build over here: https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/nightly/mozilla-central-hamachi/2014/03/ I tried a hamachi build with master, and the OomVictimLogger seems to be working properly: OomLogger E [Kill]: select 1943 (Communications), adj 11, size 7233, to kill OomLogger E [Kill]: select 2067 (Messages), adj 12, size 6218, to kill OomLogger E [Kill]: send sigkill to 2067 (Messages), adj 12, size 6218 Looking at the source code, I think that the likely failure is occuring right here: http://dxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1187 i.e. strchr returns NULL and the +2 makes it go to 2. I'll code up a patch to make this more robust. It would really useful to get the following output from a phone which is having the problem: adb shell dmesg | tail -10 I'm interested in the format of the dmesg (it doesn't actually matter what message content is). That's what OomVictimLogger is parsing, and its quite possible that the kernel on the phones that are seeing this problem are outputting something that we didn't expect.
Adding needinfo for Andrew since he reported the problem and can probably provide the dmesg info for me. I'm supposedly running the same firmware version. It's also possible that due to the way the logging occurs (its a circular buffer) that we can get malformed messages under just the wrong conditions. For reference, my output (with some tabs removed to make it look better in the bug) from dmesg looks like: <7>[688717.273124] ygg **** msm_batt_worker 1720 battery_voltage 4201 <7>[688837.593234] ygg **** msm_batt_worker 1720 battery_voltage 4200 <7>[688957.913076] ygg **** msm_batt_worker 1720 battery_voltage 4201 <7>[689078.233032] ygg **** msm_batt_worker 1720 battery_voltage 4201 <7>[689198.553382] ygg **** msm_batt_worker 1720 battery_voltage 4200 <7>[689318.873162] ygg **** msm_batt_worker 1720 battery_voltage 4200 <7>[689439.193234] ygg **** msm_batt_worker 1720 battery_voltage 4200 <7>[689559.513042] ygg **** msm_batt_worker 1720 battery_voltage 4200 <7>[689679.832974] ygg **** msm_batt_worker 1720 battery_voltage 4200 <7>[689800.153132] ygg **** msm_batt_worker 1720 battery_voltage 4200
Flags: needinfo?(astole)
Ah, I completely misread the bracing on Observe and didn't see the other log statements. The strchr+2 does seem more likely.
Putting you down as a reviewer, since by the mere fact that you've looked at the code it makes you the second most knowledgable person of this particular function :)
Attachment #8393098 - Flags: review?(dmajor)
Comment on attachment 8393098 [details] [diff] [review] Fix intermittent access through a null pointer Review of attachment 8393098 [details] [diff] [review]: ----------------------------------------------------------------- Remind me to stop reading code :)
Attachment #8393098 - Flags: review?(dmajor) → review+
Assignee: nobody → dhylands
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Looks like this has been resolved fixed so removing the needinfo.
Flags: needinfo?(astole)
Blocks: 939972
Blocks: 939372
No longer blocks: 939972
Comment on attachment 8393098 [details] [diff] [review] Fix intermittent access through a null pointer NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings. Bug 1026749 appears to be a dup of this bug, so I'm being proactive and requesting approval on this bug. [Approval Request Comment] Bug caused by (feature/regressing bug #): 939372 User impact if declined: B2G main process crashes. Testing completed: Landed on master 3 months ago. All OOM testing winds up testing code in this patch. Risk to taking this patch (and alternatives if risky): Extremely low (has been on master for 3 months with no further reported problems) String or UUID changes made by this patch: None
Attachment #8393098 - Flags: approval-mozilla-b2g30?
Dave -- could this be causing stack corruptions which is one of the suspicions in all these JS related crashes: bug 1026316?
Flags: needinfo?(dhylands)
I don't think so. The buggy code (in this bug) always tries to dereference memory location 0x2 which is invalid and causes the segfault. The bug originates from: > line_begin = strchr(line_begin, ']') + 2; The buffer is explicitly null terminated here: http://dxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1228 So strchr will always return a valid pointer or NULL. Let's suppose that the ']' was just before the '\0'. The +2 would cause line_begin to skip over the '\0'. line_begin is only ever read (as part of the OOM_LOG statement), so the worst case would be a segfault when it eventually runs off the end of a page, and if it encountered a '\0' somewhere along the way you might get lots of crap logged into logcat (Note: following this logic also covers the case where somehow the explicit null termination didn't happen and strchr just goes wandering off the end of the buffer). So the other thing to consider is what might happen if __android_log_print is passed a really long string (we skip over the '\0' and find a '\0' some large number of bytes later. __android_log_print (system/core/liblog/logd_write.c:214) does a vsnprintf with sized buffer. So passing in a too long string shouldn't cause any buffer overflows here. So I think that covers all of the cases I can think of, and I don't see any of them writing anywhere, just reading.
Flags: needinfo?(dhylands)
blocking-b2g: --- → 1.4+
Keywords: steps-wanted
Attachment #8393098 - Flags: approval-mozilla-b2g30?
I really don't buy that all of those bugs are duplicates of this bug.
I've taken another look at the code, and I now see how the memory corruption could occur. In comment 18, there is a possibility that the ']' is at the end of the buffer, just before the '\0'. When 2 is added to line_begin, this pushes it beyond the end of the buffer. The loop then continues, and sets line_end to the newline past line_begin: http://dxr.mozilla.org/mozilla-central/source/hal/gonk/GonkHal.cpp#1233 Then line 1235 replaces the newline with a '\0', and this causes the memory corruption.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: