Closed Bug 1141775 Opened 7 years ago Closed 7 years ago
Wifi Scanner::Get Access Points From WLAN hang | crash in shutdownhang | Wait For Single Object Ex | Wait For Single Object | PR _Wait | ns Thread::Process Next Event(bool, bool*) | NS _Process Next Event(ns IThread*, bool) | ns Thread::Shutdown()
This bug was filed from the Socorro interface and is report bp-59eeb10e-867a-432b-b37b-f5dec2150305. ============================================================= This is the #10 topcrash in 36.0.1 and is still showing up significantly in 37.0b3. 38 and 39 don't look affected which could be from fixes already in, or could be because we don't have the same user population on the nightly and aurora channels. Crashing thread: 0 xul.dll mozilla::`anonymous namespace'::RunWatchdog(void*) toolkit/components/terminator/nsTerminator.cpp 1 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c 2 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c 3 msvcr120.dll _callthreadstartex f:\dd\vctools\crt\crtw32\startup\threadex.c:376 4 msvcr120.dll msvcr120.dll@0x2c000 5 kernel32.dll BaseThreadInitThunk 6 ntdll.dll __RtlUserThreadStart 7 ntdll.dll _RtlUserThreadStart
The stack from the "crashing thread" isn't that helpful for those, you need to expand the thread and look at the second listed one, which should have the frames you actually see in the signature. From what I've seen, we have different stacks with the same signature, though.
I found STR by testing the bug 1145111. 1) Load the testcase https://bugzilla.mozilla.org/attachment.cgi?id=8580021 2) When the geoloc notificatiom pops up, click on "Always Share Location" 3) Wait for ~30/45 sec during the demo is running (see console log) 4) Close Firefox and wait for ~65 sec (I measured). Result: FF crashes after 65 sec. https://crash-stats.mozilla.com/report/index/2b51bfaf-64f3-491b-abf9-3e28e2150320 If you restart FF immediately (2-3 sec) after closing it, you will see the message "Firefox is already running, but is not responding."
I found a regression range with my STR: good=2014-11-15 bad=2014-11-16 http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=acbd7b68fa8c&tochange=a52bf59965a0 My guess goes to suspected bug 1044020. I tested and FF38/39 are affected too. David, any idea about the crash?
I can reproduce the bug. I actually couldn't restart Firefox at all (due to .parentLock), which is pretty bad, but that part is probably another bug. The ~65 seconds delay is normal for such a crash, that's approximately the delay after which we decide that the main thread died at some point during shutdown and we need to crash. The culprit is most likely the main thread doing a PR_Wait. No clue why, but the shutdown terminator doesn't look suspect for that crash.
As this is a shutdown crash and we don't yet have a fix, I think we're going to have to live with this in 37. Looking at the crash stats, I expect that this will be as significant in 37 as it was in 38. This bug should be prioritized highly. yoric - Do you intend to take this bug or do we need to find someone on the necko team to take it?
(In reply to Loic from comment #2) > I found STR by testing the bug 1145111. > > 1) Load the testcase https://bugzilla.mozilla.org/attachment.cgi?id=8580021 > 2) When the geoloc notificatiom pops up, click on "Always Share Location" > 3) Wait for ~30/45 sec during the demo is running (see console log) > 4) Close Firefox and wait for ~65 sec (I measured). > > > Result: FF crashes after 65 sec. > https://crash-stats.mozilla.com/report/index/2b51bfaf-64f3-491b-abf9- > 3e28e2150320 > If you restart FF immediately (2-3 sec) after closing it, you will see the > message "Firefox is already running, but is not responding." It seems stuck in WinWifiScanner::GetAccessPointsFromWLAN which is run on 5 separate threads.
> yoric - Do you intend to take this bug or do we need to find someone on the > necko team to take it? I have no domain knowledge of the issue, so I'd rather leave someone with a better grasp of WinWifiScanner handle the situation. If we find out that we need to port the code to AsyncShutdown, I'll try and help.
The work that I did here was sort of a one-off project so I don't fully recall the details of how this is all put together, but it looks like `WlanRegisterNotification` is stuck trying to unregister for notifications. The documentation  mentions that this can become an issue if you call `WlanRegisterNotification` from a callback, which we don't do. Also, neither `OnScanComplete` nor `InterfaceScanCallbackData::OnInterfaceScanComplete` appears in any of the call stacks in the crash report so it doesn't seem like that's our problem. We only ever call `WlanRegisterNotification` to unregister for events in one place , so it shouldn't be too hard to track down what's happening if we can consistently reproduce. It looks like `WinWifiScanner::GetAccessPointsFromWLAN` was not designed to be called from multiple threads, and really it should be impossible for multiple Wifi scans to occur simultaneously. We should probably add some logic to abort a Wifi scan if one is already taking place (this likely makes sense on all platforms). If it turns out that `WlanRegisterNotification` can get stuck trying to unregister even without a callback running and without any interference from other threads trying to perform simultaneous Wifi scans, then we might try aborting our Wifi scan if the scans for individual interfaces don't complete in time . If we do this and we don't trust `WlanRegisterNotification` to unregister our notifications for us, we'll have to use `WlanCloseHandle` instead. I don't have a Window installation anymore so it'll be pretty difficult for me to work on this. Is anyone else able to pick this up? I can provide guidance as needed.  https://msdn.microsoft.com/en-us/library/windows/desktop/ms706771%28v=vs.85%29.aspx  https://hg.mozilla.org/mozilla-central/annotate/4d2d97b3ba34/netwerk/wifi/win_wifiScanner.cpp#l145  https://hg.mozilla.org/mozilla-central/annotate/4d2d97b3ba34/netwerk/wifi/win_wifiScanner.cpp#l140
Jason - Can you help find someone to work on this top crash during the 38 time frame?
(In reply to Honza Bambas (:mayhemer) from comment #10) > Any potential relation to > https://groups.google.com/forum/#!topic/mozilla.dev.platform/uTNODUOFi3E ? Err... different API.
Summary: crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | nsThread::Shutdown() → WinWifiScanner::GetAccessPointsFromWLAN hang | crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | nsThread::Shutdown()
Theory: nsWifiMonitor.StartWatching(L1 [aka listener 1]) - mThread is null, makes a new thread.. that thread calls nsWifiMonitor.Run, which calls DoScan which makes a new WinWifiScanner (which opens the dll and opens a single handle), and then calls GetAccessPointsFromWLAN which calls into the windows library - those calls naturally take a while to execute on the new thread. So this is the stack we see on the hangs. Notably nsWifiMontior.mReentrantMonitor is not held during GetAccessPointsFrom WLAN nsWifiMonitor.StopWatching(L1) - removes L1, marks KeepGoing false and sets mThread to nullptr. But mThread is still running and will be doing GetAccessPointsFromWLAN for a bit nsWifiMonitor.StartWatching(L2) - mThread is null again so it makes a new thread.. that thread calls .Run() which calls DoScan() as before. So we've got a new thread, but the same nwWifiMonitor - this time mWinWifiMonitor is not null so it reuses the old dll handle and then calls GetAccessPointsFromWLAN using that handle for the windows library calls. But the L1 code is still running on the original thread in the same library with the same handle I suspect that makes stuff hang. The SettingsCallback function of the NetworkGeoLocationProvider ( https://mxr.mozilla.org/mozilla-central/source/dom/system/NetworkGeolocationProvider.js#299 ) looks like it does this quick stop/start pattern that would trigger this. Tim, do you think that makes sense? I'm sure I can come up with a theoretical fix, I may need help verifying it.
Loic, this is wfm but you seem to be able to reproduce from comment 2. I have what seems like a fix available, and I have smoke tested it, can you try and verify? https://treeherder.mozilla.org/#/jobs?repo=try&revision=a236168c6d30
(In reply to Patrick McManus [:mcmanus] from comment #13) > Loic, this is wfm but you seem to be able to reproduce from comment 2. I > have what seems like a fix available, and I have smoke tested it, can you > try and verify? > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=a236168c6d30 I tried the current Nightly, it crashed. I tried your build 2 times (http://email@example.com/try-win32/firefox-40.0a1.en-US.win32.zip), no crash after closing the browser.
loic - awesome! thanks.
(In reply to Patrick McManus [:mcmanus] from comment #12) > Theory: > > nsWifiMonitor.StartWatching(L1 [aka listener 1]) > - mThread is null, makes a new thread.. that thread calls > nsWifiMonitor.Run, which calls DoScan which makes a new WinWifiScanner > (which opens the dll and opens a single handle), and then calls > GetAccessPointsFromWLAN which calls into the windows library - those calls > naturally take a while to execute on the new thread. So this is the stack we > see on the hangs. Notably nsWifiMontior.mReentrantMonitor is not held during > GetAccessPointsFrom WLAN > > nsWifiMonitor.StopWatching(L1) > - removes L1, marks KeepGoing false and sets mThread to nullptr. But > mThread is still running and will be doing GetAccessPointsFromWLAN for a bit > > nsWifiMonitor.StartWatching(L2) > - mThread is null again so it makes a new thread.. that thread calls .Run() > which calls DoScan() as before. So we've got a new thread, but the same > nwWifiMonitor - this time mWinWifiMonitor is not null so it reuses the old > dll handle and then calls GetAccessPointsFromWLAN using that handle for the > windows library calls. But the L1 code is still running on the original > thread in the same library with the same handle I suspect that makes stuff > hang. > > The SettingsCallback function of the NetworkGeoLocationProvider ( > https://mxr.mozilla.org/mozilla-central/source/dom/system/ > NetworkGeolocationProvider.js#299 ) looks like it does this quick stop/start > pattern that would trigger this. > > Tim, do you think that makes sense? I'm sure I can come up with a > theoretical fix, I may need help verifying it. That sounds pretty plausible to me!
Comment on attachment 8587963 [details] [diff] [review] one wifi monitor thread Approval Request Comment [Feature/regressing bug #]:985970 [User impact if declined]: shutdown hangs and inability of firefox to quickly restart [Describe test coverage new/current, TreeHerder]:this impacts geolocation and that has existing treeherder regression coverage [Risks and why]: the only real risk here is that I don't have geolocation domain knowledge for testing it wrt regressions (though I don't anticipate any) based on the code change - perhaps wesley hardman from bug 985970 can verify? I'm confident the reported crash is resolved. [String/UUID change made/needed]: none This patch only applies to WinWifiScanner crashes in shutdownhang - I suspect there are multiple completely unrelated causes of shutdownhang.
Comment on attachment 8587963 [details] [diff] [review] one wifi monitor thread All shutdown hangs fixes are good to take... Especially that early in the cycle. Should be in 38 beta 2 or 3.
I couldn't reproduce using the steps from comment 2, but accordingly to Socorro there are still crashes with this signature after the fix landed with on 40.0a1 - 20150409030206 and 39.0a2 - 20150409004007.
(In reply to Petruta Rasa [QA] [:petruta] from comment #24) > I couldn't reproduce using the steps from comment 2, but accordingly to > Socorro there are still crashes with this signature after the fix landed > with on 40.0a1 - 20150409030206 and 39.0a2 - 20150409004007. can you link to particular crash stack reports? It really helps.
https://crash-stats.mozilla.com/report/index/fdedff38-ddab-4abe-8734-d7b862150410 https://crash-stats.mozilla.com/report/index/6f443f52-d591-4ad7-a8b7-e5c8c2150410 https://crash-stats.mozilla.com/report/index/af1f6dc6-7904-4e99-84a9-0299a2150410 https://crash-stats.mozilla.com/report/index/fa610c4e-3805-4a46-99b7-be9d92150410 https://crash-stats.mozilla.com/report/index/3a51c8c9-96bc-42a6-b01f-5fe062150409 https://crash-stats.mozilla.com/report/index/493e6b21-b777-4c3d-b891-e6ba62150410 https://crash-stats.mozilla.com/report/index/d3cc947e-5945-4c92-9e21-e57e82150410 https://crash-stats.mozilla.com/report/index/9c1e3bb3-9582-4f00-86bc-0172c2150409
I'm pretty sure there are multiple different issues and stacks that lead to this same signature and that we only fixed one of them here. If w3e see a reduction due to this patch, then we're good. Other cases should be filed separately.
I looked at all the reports in comment 26 and I don't think the wifiscanner (this bug) is part of any of them. There are certainly other sources of shutdown hangs still to be addressed in different bugs. Only one of them https://crash-stats.mozilla.com/report/index/fa610c4e-3805-4a46-99b7-be9d92150410#allthreads has the wifiscanner thread at all, and afaict the shutdown event has not been observed for the wifi scanner yet in that case because the main thread is hung trying to shutdown the url-classifier service.
a bug for the classifier shutdown: https://bugzilla.mozilla.org/show_bug.cgi?id=1154386
Thanks Robert and Patrick. Marking as verified based on above comments.
Hi, Still have this bug with FFv39. Unable to use FF because it crash randomly and not always permit to send reports. Last crash reports are: https://crash-stats.mozilla.com/report/index/4398a094-63af-4057-b5db-500642150714 https://crash-stats.mozilla.com/report/index/9645f32b-5255-43a3-a01e-52eb02150708 https://crash-stats.mozilla.com/report/index/a9ccd443-3eec-4237-86e8-5ee802150706 https://crash-stats.mozilla.com/report/index/ab2bb92b-f581-4e01-ae69-2e4f52150701 could someone help?
(In reply to nhgastro from comment #31) > Hi, > Still have this bug with FFv39. > thanks for the crash reports - but its the wrong bug. This bug is about the wifi scanner which is not in your reports (and seems to be verified fixed). This might overlap with 1183477 - I'll put a note in there.
You need to log in before you can comment on or make changes to this bug.