WinWifiScanner::GetAccessPointsFromWLAN hang | crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | nsThread::Shutdown()

VERIFIED FIXED in Firefox 38

Status

()

defect
--
critical
VERIFIED FIXED
4 years ago
4 years ago

People

(Reporter: lizzard, Assigned: mcmanus)

Tracking

({crash, regression, reproducible})

36 Branch
mozilla40
x86
Windows NT
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify +

Firefox Tracking Flags

(firefox36 wontfix, firefox37+ wontfix, firefox38+ verified, firefox39+ verified, firefox40 verified)

Details

(crash signature)

Attachments

(1 attachment)

This bug was filed from the Socorro interface and is 
report bp-59eeb10e-867a-432b-b37b-f5dec2150305.
=============================================================
This is the #10 topcrash in 36.0.1 and is still showing up significantly in 37.0b3.   

38 and 39 don't look affected which could be from fixes already in, or could be because we don't have the same user population on the nightly and aurora channels. 

Crashing thread:

0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog(void*) 	toolkit/components/terminator/nsTerminator.cpp
1 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
2 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
3 	msvcr120.dll 	_callthreadstartex 	f:\dd\vctools\crt\crtw32\startup\threadex.c:376
4 	msvcr120.dll 	msvcr120.dll@0x2c000 	
5 	kernel32.dll 	BaseThreadInitThunk 	
6 	ntdll.dll 	__RtlUserThreadStart 	
7 	ntdll.dll 	_RtlUserThreadStart

Comment 1

4 years ago
The stack from the "crashing thread" isn't that helpful for those, you need to expand the thread and look at the second listed one, which should have the frames you actually see in the signature. From what I've seen, we have different stacks with the same signature, though.

Comment 2

4 years ago
I found STR by testing the bug 1145111.

1) Load the testcase https://bugzilla.mozilla.org/attachment.cgi?id=8580021 
2) When the geoloc notificatiom pops up, click on "Always Share Location"
3) Wait for ~30/45 sec during the demo is running (see console log)
4) Close Firefox and wait for ~65 sec (I measured).


Result: FF crashes after 65 sec.
https://crash-stats.mozilla.com/report/index/2b51bfaf-64f3-491b-abf9-3e28e2150320
If you restart FF immediately (2-3 sec) after closing it, you will see the message "Firefox is already running, but is not responding."

Comment 3

4 years ago
I found a regression range with my STR:
good=2014-11-15
bad=2014-11-16
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=acbd7b68fa8c&tochange=a52bf59965a0

My guess goes to suspected bug 1044020.

I tested and FF38/39 are affected too.

David, any idea about the crash?
Flags: needinfo?(dteller)
Version: 37 Branch → 36 Branch
I can reproduce the bug. I actually couldn't restart Firefox at all (due to .parentLock), which is pretty bad, but that part is probably another bug.

The ~65 seconds delay is normal for such a crash, that's approximately the delay after which we decide that the main thread died at some point during shutdown and we need to crash.

The culprit is most likely the main thread doing a PR_Wait. No clue why, but the shutdown terminator doesn't look suspect for that crash.
Flags: needinfo?(dteller)
As this is a shutdown crash and we don't yet have a fix, I think we're going to have to live with this in 37. Looking at the crash stats, I expect that this will be as significant in 37 as it was in 38. This bug should be prioritized highly.

yoric - Do you intend to take this bug or do we need to find someone on the necko team to take it?
(In reply to Loic from comment #2)
> I found STR by testing the bug 1145111.
> 
> 1) Load the testcase https://bugzilla.mozilla.org/attachment.cgi?id=8580021 
> 2) When the geoloc notificatiom pops up, click on "Always Share Location"
> 3) Wait for ~30/45 sec during the demo is running (see console log)
> 4) Close Firefox and wait for ~65 sec (I measured).
> 
> 
> Result: FF crashes after 65 sec.
> https://crash-stats.mozilla.com/report/index/2b51bfaf-64f3-491b-abf9-
> 3e28e2150320
> If you restart FF immediately (2-3 sec) after closing it, you will see the
> message "Firefox is already running, but is not responding."

It seems stuck in WinWifiScanner::GetAccessPointsFromWLAN which is run on 5 separate threads.
Blocks: 985970
Flags: needinfo?(tabraldes)
> yoric - Do you intend to take this bug or do we need to find someone on the
> necko team to take it?

I have no domain knowledge of the issue, so I'd rather leave someone with a better grasp of WinWifiScanner handle the situation. If we find out that we need to port the code to AsyncShutdown, I'll try and help.
Flags: needinfo?(dteller)
The work that I did here was sort of a one-off project so I don't fully recall the details of how this is all put together, but it looks like `WlanRegisterNotification` is stuck trying to unregister for notifications. The documentation [1] mentions that this can become an issue if you call `WlanRegisterNotification` from a callback, which we don't do. Also, neither `OnScanComplete` nor `InterfaceScanCallbackData::OnInterfaceScanComplete` appears in any of the call stacks in the crash report so it doesn't seem like that's our problem. We only ever call `WlanRegisterNotification` to unregister for events in one place [2], so it shouldn't be too hard to track down what's happening if we can consistently reproduce.

It looks like `WinWifiScanner::GetAccessPointsFromWLAN` was not designed to be called from multiple threads, and really it should be impossible for multiple Wifi scans to occur simultaneously. We should probably add some logic to abort a Wifi scan if one is already taking place (this likely makes sense on all platforms).

If it turns out that `WlanRegisterNotification` can get stuck trying to unregister even without a callback running and without any interference from other threads trying to perform simultaneous Wifi scans, then we might try aborting our Wifi scan if the scans for individual interfaces don't complete in time [3]. If we do this and we don't trust `WlanRegisterNotification` to unregister our notifications for us, we'll have to use `WlanCloseHandle` instead.

I don't have a Window installation anymore so it'll be pretty difficult for me to work on this. Is anyone else able to pick this up? I can provide guidance as needed.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/ms706771%28v=vs.85%29.aspx
[2] https://hg.mozilla.org/mozilla-central/annotate/4d2d97b3ba34/netwerk/wifi/win_wifiScanner.cpp#l145
[3] https://hg.mozilla.org/mozilla-central/annotate/4d2d97b3ba34/netwerk/wifi/win_wifiScanner.cpp#l140
Flags: needinfo?(tabraldes)
Jason - Can you help find someone to work on this top crash during the 38 time frame?
Flags: needinfo?(jduell.mcbugs)
Comment hidden (obsolete)
(In reply to Honza Bambas (:mayhemer) from comment #10)
> Any potential relation to
> https://groups.google.com/forum/#!topic/mozilla.dev.platform/uTNODUOFi3E ?

Err... different API.
Flags: needinfo?(tabraldes)

Updated

4 years ago
Assignee: nobody → mcmanus
Flags: needinfo?(jduell.mcbugs)
Assignee

Updated

4 years ago
Summary: crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | nsThread::Shutdown() → WinWifiScanner::GetAccessPointsFromWLAN hang | crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | nsThread::Shutdown()
Assignee

Comment 12

4 years ago
Theory:

nsWifiMonitor.StartWatching(L1 [aka listener 1])
 - mThread is null, makes a new thread.. that thread calls nsWifiMonitor.Run, which calls DoScan which makes a new WinWifiScanner (which opens the dll and opens a single handle), and then calls GetAccessPointsFromWLAN which calls into the windows library - those calls naturally take a while to execute on the new thread. So this is the stack we see on the hangs. Notably nsWifiMontior.mReentrantMonitor is not held during GetAccessPointsFrom WLAN

nsWifiMonitor.StopWatching(L1)
 - removes L1, marks KeepGoing false and sets mThread to nullptr. But mThread is still running and will be doing GetAccessPointsFromWLAN for a bit

nsWifiMonitor.StartWatching(L2)
 - mThread is null again so it makes a new thread.. that thread calls .Run() which calls DoScan() as before. So we've got a new thread, but the same nwWifiMonitor - this time mWinWifiMonitor is not null so it reuses the old dll handle and then calls GetAccessPointsFromWLAN using that handle for the windows library calls. But the L1 code is still running on the original thread in the same library with the same handle I suspect that makes stuff hang.

The SettingsCallback function of the NetworkGeoLocationProvider ( https://mxr.mozilla.org/mozilla-central/source/dom/system/NetworkGeolocationProvider.js#299 ) looks like it does this quick stop/start pattern that would trigger this.

Tim, do you think that makes sense? I'm sure I can come up with a theoretical fix, I may need help verifying it.
Flags: needinfo?(tabraldes)
Assignee

Comment 13

4 years ago
Loic, this is wfm but you seem to be able to reproduce from comment 2. I have what seems like a fix available, and I have smoke tested it, can you try and verify?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a236168c6d30
Flags: needinfo?(epinal99-bugzilla2)
Assignee

Comment 14

4 years ago
Attachment #8587963 - Flags: review?(hurley)

Comment 15

4 years ago
(In reply to Patrick McManus [:mcmanus] from comment #13)
> Loic, this is wfm but you seem to be able to reproduce from comment 2. I
> have what seems like a fix available, and I have smoke tested it, can you
> try and verify?
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=a236168c6d30

I tried the current Nightly, it crashed.
I tried your build 2 times (http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mcmanus@ducksong.com-a236168c6d30/try-win32/firefox-40.0a1.en-US.win32.zip), no crash after closing the browser.
Flags: needinfo?(epinal99-bugzilla2)
Assignee

Comment 16

4 years ago
loic - awesome! thanks.
(In reply to Patrick McManus [:mcmanus] from comment #12)
> Theory:
> 
> nsWifiMonitor.StartWatching(L1 [aka listener 1])
>  - mThread is null, makes a new thread.. that thread calls
> nsWifiMonitor.Run, which calls DoScan which makes a new WinWifiScanner
> (which opens the dll and opens a single handle), and then calls
> GetAccessPointsFromWLAN which calls into the windows library - those calls
> naturally take a while to execute on the new thread. So this is the stack we
> see on the hangs. Notably nsWifiMontior.mReentrantMonitor is not held during
> GetAccessPointsFrom WLAN
> 
> nsWifiMonitor.StopWatching(L1)
>  - removes L1, marks KeepGoing false and sets mThread to nullptr. But
> mThread is still running and will be doing GetAccessPointsFromWLAN for a bit
> 
> nsWifiMonitor.StartWatching(L2)
>  - mThread is null again so it makes a new thread.. that thread calls .Run()
> which calls DoScan() as before. So we've got a new thread, but the same
> nwWifiMonitor - this time mWinWifiMonitor is not null so it reuses the old
> dll handle and then calls GetAccessPointsFromWLAN using that handle for the
> windows library calls. But the L1 code is still running on the original
> thread in the same library with the same handle I suspect that makes stuff
> hang.
> 
> The SettingsCallback function of the NetworkGeoLocationProvider (
> https://mxr.mozilla.org/mozilla-central/source/dom/system/
> NetworkGeolocationProvider.js#299 ) looks like it does this quick stop/start
> pattern that would trigger this.
> 
> Tim, do you think that makes sense? I'm sure I can come up with a
> theoretical fix, I may need help verifying it.

That sounds pretty plausible to me!
Flags: needinfo?(tabraldes)

Updated

4 years ago
Attachment #8587963 - Flags: review?(hurley) → review+
https://hg.mozilla.org/mozilla-central/rev/4b51391dc2a1
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Assignee

Comment 20

4 years ago
Comment on attachment 8587963 [details] [diff] [review]
one wifi monitor thread

Approval Request Comment
[Feature/regressing bug #]:985970
[User impact if declined]: shutdown hangs and inability of firefox to quickly restart
[Describe test coverage new/current, TreeHerder]:this impacts geolocation and that has existing treeherder regression coverage
[Risks and why]: the only real risk here is that I don't have geolocation domain knowledge for testing it wrt regressions (though I don't anticipate any) based on the code change - perhaps wesley hardman from bug 985970 can verify? I'm confident the reported crash is resolved.
[String/UUID change made/needed]: none

This patch only applies to WinWifiScanner crashes in shutdownhang - I suspect there are multiple completely unrelated causes of shutdownhang.
Attachment #8587963 - Flags: approval-mozilla-beta?
Attachment #8587963 - Flags: approval-mozilla-aurora?
Comment on attachment 8587963 [details] [diff] [review]
one wifi monitor thread

All shutdown hangs fixes are good to take... Especially that early in the cycle. Should be in 38 beta 2 or 3.
Attachment #8587963 - Flags: approval-mozilla-beta?
Attachment #8587963 - Flags: approval-mozilla-beta+
Attachment #8587963 - Flags: approval-mozilla-aurora?
Attachment #8587963 - Flags: approval-mozilla-aurora+
Flags: qe-verify+
I couldn't reproduce using the steps from comment 2, but accordingly to Socorro there are still crashes with this signature after the fix landed with on 40.0a1 - 20150409030206 and 39.0a2 - 20150409004007.
Assignee

Comment 25

4 years ago
(In reply to Petruta Rasa [QA] [:petruta] from comment #24)
> I couldn't reproduce using the steps from comment 2, but accordingly to
> Socorro there are still crashes with this signature after the fix landed
> with on 40.0a1 - 20150409030206 and 39.0a2 - 20150409004007.

can you link to particular crash stack reports? It really helps.

Comment 27

4 years ago
I'm pretty sure there are multiple different issues and stacks that lead to this same signature and that we only fixed one of them here. If w3e see a reduction due to this patch, then we're good. Other cases should be filed separately.
Assignee

Comment 28

4 years ago
I looked at all the reports in comment 26 and I don't think the wifiscanner (this bug) is part of any of them. There are certainly other sources of shutdown hangs still to be addressed in different bugs.

Only one of them https://crash-stats.mozilla.com/report/index/fa610c4e-3805-4a46-99b7-be9d92150410#allthreads has the wifiscanner thread at all, and afaict the shutdown event has not been observed for the wifi scanner yet in that case because the main thread is hung trying to shutdown the url-classifier service.
Assignee

Comment 29

4 years ago
a bug for the classifier shutdown:
https://bugzilla.mozilla.org/show_bug.cgi?id=1154386
Thanks Robert and Patrick. Marking as verified based on above comments.
Assignee

Comment 32

4 years ago
(In reply to nhgastro from comment #31)
> Hi,
> Still have this bug with FFv39.
> 

thanks for the crash reports - but its the wrong bug. This bug is about the wifi scanner which is not in your reports (and seems to be verified fixed).

This might overlap with 1183477 - I'll put a note in there.
You need to log in before you can comment on or make changes to this bug.