Closed Bug 1367850 Opened 2 years ago Closed 2 years ago
REFTEST PROCESS-CRASH | http://10
.252 | application crashed [@ libc .73 .233:33649/tests/dom/media/test/crashtests/1236639 .html .so + 0x4027e]
17.29 KB, patch
|Details | Diff | Splinter Review|
17.35 KB, patch
|Details | Diff | Splinter Review|
I'm not completely sure this isn't a dupe. https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=4ac208f317b37f7ed2e669d5ca23064ebc979103&filter-classifiedState=unclassified&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-searchStr=autophone&selectedJob=97735888 https://treeherder.mozilla.org/logviewer.html#?job_id=97735888&repo=mozilla-beta&lineNumber=870 REFTEST PROCESS-CRASH | http://10.252.73.233:33649/tests/dom/media/test/crashtests/1236639.html | application crashed [@ libc.so + 0x4027e] Crash dump filename: /tmp/tmpCviVP9/34e795c8-c451-2540-c2ae-f92c6142f83a.dmp Operating system: Android 0.0.0 Linux 3.10.73-g9741316 #1 SMP PREEMPT Thu Nov 5 02:17:00 UTC 2015 armv8l CPU: arm ARMv1 ARM part(0x4100d030) features: half,thumb,fastmult,vfpv2,edsp,neon,vfpv3,vfpv4,idiva,idivt 8 CPUs GPU: UNKNOWN Crash reason: SIGSEGV Crash address: 0xe5e5e5e5 Process uptime: not available Thread 0 (crashed) 0 libc.so + 0x4027e r0 = 0xe5e5e5e5 r1 = 0x00000000 r2 = 0xd1e37728 r3 = 0xe02cb24c r4 = 0xe5e5e5e5 r5 = 0xffb07e10 r6 = 0xffb07e14 r7 = 0xd1e37728 r8 = 0xffb07e00 r9 = 0xbe414080 r10 = 0xffb07e08 r12 = 0xef37ad5c fp = 0xdf8c3c70 sp = 0xffb07dc0 lr = 0xef2acac3 pc = 0xf6fe727e Found by: given as instruction pointer in context 1 libnss3.so!PR_Lock [ptsynch.c:4ac208f317b3 : 177 + 0x5] sp = 0xffb07dd0 pc = 0xef2acac3 Found by: stack scanning 2 libxul.so!mozilla::OffTheBooksMutex::Lock [BlockingResourceBase.cpp:4ac208f317b3 : 382 + 0x5] r4 = 0xd1e37728 sp = 0xffb07dd8 pc = 0xdd4ea083 Found by: call frame info 3 libxul.so!mozilla::AndroidBridge::RunDelayedUiThreadTasks [Mutex.h:4ac208f317b3 : 210 + 0x5] This is not bug 1354853 which is resolved on beta and has a different stack. I'm going to use bug 1290569 to star this but hide the exploitablity.
Looks like a UAF in some Android widget code. Any ideas, Jim?
Component: Audio/Video: Playback → General
Product: Core → Firefox for Android
Flags: needinfo?(nchen) → needinfo?(rbarker)
Didn't mean to cancel the ni? Is there more info somewhere other than that stack trace? Just guessing with out more info, since this is the last test to run could this crash be on shutting down fennec? I believe that is what the log shows? I would assume RunDelayedUiThreadTasks is getting invoked after shutdown? I guess I can look as see if there is a way to prevent that? But under normal operation, since android just kills fennec and it normally does not shutdown cleanly, I don't see this getting hit often. I ran locally multiple times and was unable to reproduce it.
Just the logcat beginning at https://treeherder.mozilla.org/logviewer.html#?job_id=97735888&repo=mozilla-beta&lineNumber=7466 Unfortunately we were still getting our logcat messages trimmed by chatty and I don't see anything. Maybe you will. I haven't seen this again since it happened.
Looking at the code I'm fairly certain this is a race condition at shutdown. I'm thinking it may be a stale cache value on the UI thread? This call is guarded by two variables but neither one is protected by a lock nor are they atomic. Since both values are set in the main thread I would guess it is possible this guard fails at shutdown? I can make the boolean atomic and see if this still happens?
That sounds like it won't hurt and can only help. I'll keep an eye out and if we don't see it again in a while (months?) we can resolve it.
Ugly fix but it is the only solution I could come up with. Problem: The while loop in AndroidBridge::RunDelayedUiThreadTasks locks and unlock the Mutex mUiTaskQueueLock as it runs tasks in the UI thread. The AndroidUiThread destroys it self in this while loop with a task. The task is synchronous. Once the task has run, the main thread proceeds to destroy the AndroidBridge along with the mUiTaskQueueLock. When the MutexAutoLock and MutexAutoUnlock objects in AndroidBridge::RunDelayedUiThreadTasks try to lock/unlock the Mutex when they are destroyed, it is possible they are operating on a dangling pointer since the Mutex was already destroyed in the main thread. This is basically a race. Solution: When the AndroidUiThread gets destroyed, flush the queue and have the MutexAutoLock and MutexAutoUnlock forget the Mutex so they do not try and manipulate a deleted Mutex and return immediately. This probably doesn't happen very often in the wild since most users never shutdown fennec.
Since this is a racey crash on exit, is it a security issue?
Looks like the patch doesn't break anything on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0afff8555dc2f36f462ee6e03277af6d81054638
What do you think about moving `mUiTaskQueueLock` out of `AndroidBridge`, and just not free the lock when we shut down? We already do something similar in `nsAppShell` , where the lock gets initialized on startup but is never deleted on shutdown.  https://dxr.mozilla.org/mozilla-central/rev/bdb2387396b4a74dfefb7c983733eed3625e906a/widget/android/nsAppShell.cpp#89
(In reply to Jim Chen [:jchen] [:darchons] from comment #9) > What do you think about moving `mUiTaskQueueLock` out of `AndroidBridge`, > and just not free the lock when we shut down? We already do something > similar in `nsAppShell` , where the lock gets initialized on startup but > is never deleted on shutdown. > Perhaps just removing AndroidBridge::PostTaskToUiThread() and moving RunDelayedUiThreadTasks, mUiTaskQueue, and its lock to AndroidUiThread makes the most sense. There is only one external call to AndroidBridge::PostTaskToUiThread so I could just change that to mozilla::GetAndroidUiThread()->Dispatch(). Then I think the mUiTaskQueue Mutex could be deleted safely in AndroidUiThread.
Comment on attachment 8873644 [details] [diff] [review] 0001-Bug-1367850-Flush-UI-task-queue-after-AndroidUiThread-has-been-shutdown-r-jchen-17060116-21a60aa0e51b.patch Clearing review until I can create a new patch based on feedback.
Attachment #8873917 - Flags: review?(nchen) → review+
Comment on attachment 8873917 [details] [diff] [review] 0001-Bug-1367850-Move-Android-UI-thread-runnable-queue-from-AndroidBridge-to-AndroidUiThread-r-jchen-17060211-74bc139734bf.patch [Security approval request comment] How easily could an exploit be constructed based on the patch? I don't see how an exploit could be created. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? I'm not sure there is a security problem. Which older supported branches are affected by this flaw? introduced in 53. If not all supported branches, which bug introduced the flaw? Bug 1319850 Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? No, I don't think it is an exploitable flaw since it only happens when Fennec is shutdown which really doesn't happen much in the wild. How likely is this patch to cause regressions; how much testing does it need? I think it is unlikely to cause a regression.
Attachment #8873917 - Flags: sec-approval?
So I'm pretty sure this is the same bug as Bug 1354853. However I don't believe the patch for that bug actually fixed the problem.
New patch seems to work on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=914d1267e20faf5cd69dd6407264787515653194
This has missed Firefox 54. We've already done our last beta. I'm giving sec-approval for checkin on June 27 but not before then. That's two weeks into the next cycle. At that point, we'll want it onBeta (55) as well so please nominate a patch when you do the trunk checkin.
Attachment #8873917 - Flags: sec-approval? → sec-approval+
Approval Request Comment [Feature/Bug causing the regression]: Adding support for nsThread to Android UI thread [User impact if declined]: Possible use after free on shutdown [Is this code covered by automated tests?]: no [Has the fix been verified in Nightly?]: no [Needs manual test from QE? If yes, steps to reproduce]: no [List of other uplifts needed for the feature/fix]: none [Is the change risky?]: no [Why is the change risky/not risky?]: Just prevents mutex from being freed prematurely [String changes made/needed]: none
Attachment #8881541 - Flags: approval-mozilla-beta?
Randall pushed this to inbound earlier today. https://hg.mozilla.org/integration/mozilla-inbound/rev/a6f8452cbd4b
Whiteboard: [checkin on 6/27]
Attachment #8881541 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Depends on: 1376670
This got backed out in https://hg.mozilla.org/mozilla-central/rev/9af23c413a1f8d337b19b4f8450e241e91b71136 for causing bug 1376670.
I'm looking into now.
I've run my new patch through try four times and have not seen the try failures. Do I just push my patch again to inbound or do I need approval? Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=43321b4ce35e93cae47e5b893c554b83379b228d
You don't need to get approval again. Push when ready :)
Autophone doesn't invoke from -u all. It couldn't handle all of the load from everyone running -u all. To run the test which failed here use try: -b do -p android-api-15 -u autophone-crashtest-dom-media -t none. I manually invoked the test on your try build. You can see the appropriate test as Cdm1. https://treeherder.mozilla.org/#/jobs?repo=try&revision=43321b4ce35e93cae47e5b893c554b83379b228d&filter-searchStr=Cdm1 If you want to rerun your test without submitting a new try build, just retrigger it. I retriggered it for a total of 4 times.
Does beta need an updated patch as well?
(In reply to Julien Cristau [:jcristau] from comment #27) > Does beta need an updated patch as well? Yes, I was waiting to see if it gets backed out again since I wasn't 100% confident that I fixed the issue. There is an API change so it won't apply cleanly. I'll rebase and update the beta patch.
This is the updated version of the patch currently in mc, fixed for Beta.
Attachment #8881541 - Attachment is obsolete: true
Comment on attachment 8883640 [details] [diff] [review] V2 of Rebased patch for Beta carrying over beta approval to the new version of this patch
Attachment #8883640 - Flags: approval-mozilla-beta+
(In reply to Randall Barker [:rbarker] from comment #18) > [Is this code covered by automated tests?]: no > [Has the fix been verified in Nightly?]: no > [Needs manual test from QE? If yes, steps to reproduce]: no Setting qe-verify- based on Randall's assessment on manual testing needs.
(In reply to Andrei Vaida [:avaida], Desktop Release QA – please ni? me from comment #32) > (In reply to Randall Barker [:rbarker] from comment #18) > > [Is this code covered by automated tests?]: no > > [Has the fix been verified in Nightly?]: no > > [Needs manual test from QE? If yes, steps to reproduce]: no > > Setting qe-verify- based on Randall's assessment on manual testing needs. Verifying it will be difficult. The original bug would intermittently occur on shutdown which general only happens in tests since normal Fennec operation is for the OS to just kill the app. Additionally, if you use an add-on to exit, if the shutdown is taking too long, the OS will kill the app anyway. The best you can probably do is shutdown Fennec a lot with an add-on and ensure that it doesn't crash releasing the mutex after invoking a runnalbe in the UI thread. It wouldn't surprise me if there were other crashes on shutdown since testing shutdown in Fennec is problematic.
You need to log in before you can comment on or make changes to this bug.