Closed Bug 634084 Opened 14 years ago Closed 13 years ago

Assertion failure: ready > 0 Crash [@ _PR_MD_PR_POLL PR_Poll nsSocketTransportService::Poll(int, unsigned int*) nsSocketTransportService::DoPollIteration(int)]

Categories

(Core :: Networking, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 674147
mozilla7

People

(Reporter: bc, Assigned: mayhemer)

References

Details

(Keywords: crash)

Attachments

(6 files, 2 obsolete files)

The crash automation has been getting Assertion failure: ready > 0, at c:/work/mozilla/builds/2.0.0/mozilla/nsprpub/pr/src/md/windows/w32poll.c:322 on 32 bit Windows 7 for the following urls (reproduced today). Note that I haven't seen this with 32 bit builds on 64bit Windows 7. http://drlunswe.blogspot.com/ https://addons.mozilla.org/pt-br/firefox/addon/switch-to-tab-no-more/ https://addons.mozilla.org/en-US/firefox/addon/lavafox-v1-green/ https://addons.mozilla.org/es-ES/firefox/addon/spanish-mexico-language-pack/ https://addons.mozilla.org/firefox/addon/6366 https://addons.mozilla.org/pt-br/firefox/addon/video-downloadhelper/ https://addons.mozilla.org/ru/firefox/addon/3006 https://services.addons.mozilla.org/en-US/firefox/discovery/addon/add-on-compatibility-reporter/ https://services.addons.mozilla.org/ru/firefox/discovery/addon/1865/ https://addons.mozilla.org/en-US/firefox/addon/2464 This appears to occur during shutdown after loading the page. Example crash reports Operating system: Windows NT 6.1.7600 CPU: x86 GenuineIntel family 6 model 44 stepping 2 2 CPUs Crash reason: EXCEPTION_BREAKPOINT Crash address: 0x759c22a1 Thread 8 (crashed) 0 KERNELBASE.dll + 0x122a1 eip = 0x759c22a1 esp = 0x0451c51c ebp = 0x0451c520 ebx = 0x003b8150 esi = 0x00000000 edi = 0x00000000 eax = 0x00000000 ecx = 0xd692d71c edx = 0x736c1d48 efl = 0x00000212 Found by: given as instruction pointer in context 1 nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b] eip = 0x735968d0 esp = 0x0451c528 ebp = 0x0451f5bc Found by: previous frame's frame pointer 2 nspr4.dll!PR_Poll [prio.c : 173 + 0x10] eip = 0x73586f74 esp = 0x0451f5c4 ebp = 0x0451f5d0 Found by: call frame info 3 xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 358 + 0x11] eip = 0x6f9f5ca2 esp = 0x0451f5d8 ebp = 0x0451f600 Found by: call frame info 4 xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 673 + 0xf] eip = 0x6f9f6987 esp = 0x0451f608 ebp = 0x0451f650 Found by: call frame info 5 xul.dll!nsSocketTransportService::OnProcessNextEvent(nsIThreadInternal *,int,unsigned int) [nsSocketTransportService2.cpp : 552 + 0xc] eip = 0x6f9f650d esp = 0x0451f658 ebp = 0x0451f660 Found by: call frame info Operating system: Windows NT 6.1.7600 CPU: x86 GenuineIntel family 6 model 44 stepping 2 1 CPU Crash reason: EXCEPTION_BREAKPOINT Crash address: 0x7541194b Thread 8 (crashed) 0 KERNELBASE.dll + 0x3194b eip = 0x7541194b esp = 0x0326c7c0 ebp = 0x0326c7c4 ebx = 0x003a50c8 esi = 0x00000000 edi = 0x00000000 eax = 0x00000000 ecx = 0x37b8f996 edx = 0x6f0f1d48 efl = 0x00000202 Found by: given as instruction pointer in context 1 nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b] eip = 0x6efc68d0 esp = 0x0326c7cc ebp = 0x0326f860 Found by: previous frame's frame pointer 2 nspr4.dll!PR_Poll [prio.c : 173 + 0x10] eip = 0x6efb6f74 esp = 0x0326f868 ebp = 0x0326f874 Found by: call frame info 3 xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 358 + 0x11] eip = 0x66065ca2 esp = 0x0326f87c ebp = 0x0326f8a4 Found by: call frame info 4 xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 673 + 0xf] eip = 0x66066987 esp = 0x0326f8ac ebp = 0x0326f8f4 Found by: call frame info 5 xul.dll!nsSocketTransportService::OnProcessNextEvent(nsIThreadInternal *,int,unsigned int) [nsSocketTransportService2.cpp : 552 + 0xc] eip = 0x6606650d esp = 0x0326f8fc ebp = 0x0326f904 Found by: call frame info On Windows XP the crash automation has seen ABORT: imgCacheEntry release isn't thread-safe! with a related stack. See Bug 529949 https://addons.mozilla.org/en-US/firefox/addon/prospector-home-dash/ Assertion failure: ready > 0, at c:/work/mozilla/builds/2.0.0/mozilla/nsprpub/pr/src/md/windows/w32poll.c:322 ###!!! ASSERTION: nsScriptCacheCleaner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:/work/mozilla/builds/2.0.0/mozilla/content/base/src/nsFrameMessageManager.cpp, line 729 ###!!! ABORT: imgCacheEntry release isn't thread-safe!: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:\work\mozilla\builds\2.0.0\mozilla\modules\libpr0n\src\imgLoader.h, line 85 (minidump_stackwalk crashed and was unable to provide a stack) https://addons.mozilla.org/pt-br/firefox/addon/switch-to-tab-no-more/ Assertion failure: ready > 0, at c:/work/mozilla/builds/2.0.0/mozilla/nsprpub/pr/src/md/windows/w32poll.c:322 nsStringStats => mAllocCount: 16373 => mReallocCount: 1694 => mFreeCount: 8758 -- LEAKED 7615 !!! => mShareCount: 23106 => mAdoptCount: 1643 => mAdoptFreeCount: 1629 -- LEAKED 14 !!! ###!!! ASSERTION: nsScriptCacheCleaner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:/work/mozilla/builds/2.0.0/mozilla/content/base/src/nsFrameMessageManager.cpp, line 729 ###!!! ABORT: imgCacheEntry release isn't thread-safe!: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:\work\mozilla\builds\2.0.0\mozilla\modules\libpr0n\src\imgLoader.h, line 85 Operating system: Windows NT 5.1.2600 Service Pack 3 CPU: x86 GenuineIntel family 6 model 44 stepping 2 1 CPU Crash reason: EXCEPTION_BREAKPOINT Crash address: 0x7c90120e Thread 6 (crashed) 0 ntdll.dll + 0x120e eip = 0x7c90120e esp = 0x0264cc64 ebp = 0x0264cc68 ebx = 0x017ec2c0 esi = 0x00000000 edi = 0x00000000 eax = 0x00000000 ecx = 0x21d847c8 edx = 0x003b3d38 efl = 0x00000202 Found by: given as instruction pointer in context 1 nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b] eip = 0x00c468d0 esp = 0x0264cc70 ebp = 0x0264fd04 Found by: previous frame's frame pointer 2 nspr4.dll!PR_Poll [prio.c : 173 + 0x10] eip = 0x00c36f74 esp = 0x0264fd0c ebp = 0x0264fd18 Found by: call frame info 3 xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 358 + 0x11] eip = 0x106d5ca2 esp = 0x0264fd20 ebp = 0x0264fd48 Found by: call frame info 4 xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 673 + 0xf] eip = 0x106d6987 esp = 0x0264fd50 ebp = 0x0264fd98 Found by: call frame info 5 xul.dll!nsSocketTransportService::OnProcessNextEvent(nsIThreadInternal *,int,unsigned int) [nsSocketTransportService2.cpp : 552 + 0xc] eip = 0x106d650d esp = 0x0264fda0 ebp = 0x0264fda8 Found by: call frame info Note the following does appear in the console for the addons.mozilla.org crashes: api-secure.recaptcha.net : server does not support RFC 5746, see CVE-2009-3555 I haven't been able to reproduce these crashes locally with Windows XP or Vista and I don't have Windows 7 locally. I have occasionally experienced shutdown hangs, bug haven't been able to learn much from them. Note that each of these urls came from the nightly crash dumps which means our users are experiencing some kind of crash behavior. For the addons related crashes, I expect they crash when restarting after installing the addon.
FYI, I see Assertion failure: ready > 0 on the following domains: addons.mozilla.org, blogs.ubc.ca, boh.com, drlunswe.blogspot.com, forum.smallworlds.com, listen.grooveshark.com, saladecristo.webnode.it, services.addons.mozilla.org, thatguywiththeglasses.com, www.biggby.com, www.diziizleyelim.com, www.fightlinker.com, www.girlsaskguys.com, www.htcsense.com, www.izmitliyizbiz.com, www.kvfan.net, www.macromedia.com, www.minecraft.net, www.podpocalypse.com, www.premiumissuer.com, www.pwrshow.com, www.scribd.com, www.tnawrestling.com, www.vidyoara.com, www.wizard101.com
Attached file example crash report
I see this on literally hundreds of urls on Windows during crash testing. Operating system: Windows NT 6.1.7601 Service Pack 1 CPU: x86 GenuineIntel family 6 model 44 stepping 2 2 CPUs Crash reason: EXCEPTION_BREAKPOINT Crash address: 0x76b722a1 Thread 8 (crashed) 0 KERNELBASE.dll + 0x122a1 eip = 0x76b722a1 esp = 0x0446cacc ebp = 0x0446cad0 ebx = 0x005663c8 esi = 0x00000000 edi = 0x00000000 eax = 0x00000000 ecx = 0x35dce649 edx = 0x72071d48 efl = 0x00000212 Found by: given as instruction pointer in context 1 nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b] eip = 0x72fc68d0 esp = 0x0446cad8 ebp = 0x0446fb6c Found by: previous frame's frame pointer 2 nspr4.dll!PR_Poll [prio.c : 173 + 0x10] eip = 0x72fb6f74 esp = 0x0446fb74 ebp = 0x0446fb80 Found by: call frame info 3 xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 355 + 0x11] eip = 0x69dd02d2 esp = 0x0446fb88 ebp = 0x0446fbb0 Found by: call frame info 4 xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 668 + 0xf] eip = 0x69dd1227 esp = 0x0446fbb8 ebp = 0x0446fbf0 Found by: call frame info nsprpub/pr/src/md/windows/w32poll.c 284 /* 285 ** Now to unravel the select sets back into the client's poll 286 ** descriptor list. Is this possibly an area for pissing away 287 ** a few cycles or what? 288 */ 289 if (ready > 0) 290 { 291 ready = 0; 292 for (pd = pds, epd = pd + npds; pd < epd; pd++) 293 { 294 PRInt16 out_flags = 0; 295 if ((NULL != pd->fd) && (0 != pd->in_flags)) 296 { 297 SOCKET osfd; 298 bottom = PR_GetIdentitiesLayer(pd->fd, PR_NSPR_IO_LAYER); 299 PR_ASSERT(NULL != bottom); 300 301 osfd = (SOCKET) bottom->secret->md.osfd; 302 303 if (FD_ISSET(osfd, &rd)) 304 { 305 if (pd->out_flags & _PR_POLL_READ_SYS_READ) 306 out_flags |= PR_POLL_READ; 307 if (pd->out_flags & _PR_POLL_WRITE_SYS_READ) 308 out_flags |= PR_POLL_WRITE; 309 } 310 if (FD_ISSET(osfd, &wt)) 311 { 312 if (pd->out_flags & _PR_POLL_READ_SYS_WRITE) 313 out_flags |= PR_POLL_READ; 314 if (pd->out_flags & _PR_POLL_WRITE_SYS_WRITE) 315 out_flags |= PR_POLL_WRITE; 316 } 317 if (FD_ISSET(osfd, &ex)) out_flags |= PR_POLL_EXCEPT; 318 } 319 pd->out_flags = out_flags; 320 if (out_flags) ready++; 321 } This assert is the one firing 322 PR_ASSERT(ready > 0); 323 } 324 else if (ready == SOCKET_ERROR) 325 { 326 err = WSAGetLastError(); 327 if (err == WSAENOTSOCK) 328 { 329 /* Find the bad fds */ 330 int optval; 331 int optlen = sizeof(optval); 332 ready = 0; 333 for (pd = pds, epd = pd + npds; pd < epd; pd++) 334 { 335 pd->out_flags = 0; 336 if ((NULL != pd->fd) && (0 != pd->in_flags)) 337 { 338 bottom = PR_GetIdentitiesLayer(pd->fd, PR_NSPR_IO_LAYER); 339 if (getsockopt(bottom->secret->md.osfd, SOL_SOCKET, 340 SO_TYPE, (char *) &optval, &optlen) == -1) 341 { 342 PR_ASSERT(WSAGetLastError() == WSAENOTSOCK); 343 if (WSAGetLastError() == WSAENOTSOCK) 344 { 345 pd->out_flags = PR_POLL_NVAL; 346 ready++; 347 } 348 } 349 } 350 } 351 PR_ASSERT(ready > 0); 352 } Either the loop isn't executed at all or out_flags is never set inside the loop. What is the effect if _PR_MD_PR_POLL incorrectly returns 0 when it should return non-zero? Does it warrant a fatal assertion?
Severity: normal → critical
Summary: Assertion failure: ready > 0 @ KERNELBASE.dll@0x122a1 _PR_MD_PR_POLL PR_Poll nsSocketTransportService::Poll(int, unsigned int*) nsSocketTransportService::DoPollIteration(int) → Assertion failure: ready > 0 Crash [@ _PR_MD_PR_POLL PR_Poll nsSocketTransportService::Poll(int, unsigned int*) nsSocketTransportService::DoPollIteration(int)]
Hi Bob, Thanks for updating the bug. I just want to check if you're seeing a recent change in behavior, or this has remained more or less the same since first reported. (I want to make sure d527ceffd806 from 6/15 is not a factor - it shouldn't be.)
No, not really a change. It has been pretty constant in the automation but I haven't been able to reproduce locally though I didn't try this weekend. It might need a fast internet connection.
I just wanted to let you know this is one of those cases where I keep seeing the same fatal assertion over and over. I've seen this so far on 623 different urls from 63 domains including: https://addons.mozilla.org/ https://developer.mozilla.org/ https://services.addons.mozilla.org/ The cost to the automation crash testing is that there may be other more important issues being hidden by this fatal assertion.
Bob, to be clear, this is still happening after bug 658580 has been landed? Do we have any estimation on the regression range? Is there a way for us to run the crash automation as e.g. a try server push? If this can be regularly reproduced we may add more assertions to figure out what condition and where fails.
(In reply to comment #6) > Bob, to be clear, this is still happening after bug 658580 has been landed? yes. > Do we have any estimation on the regression range? Not really but it exists in the Gecko 2.0 branch so it is not recent. > Is there a way for us to run the crash automation as e.g. a try server push? > No. > If this can be regularly reproduced we may add more assertions to figure out > what condition and where fails. It is pretty common in the automation. As soon as my morning windows builds finish I'll try to reproduce locally again. More assertions to narrow it down would be nice.
I might have a patch with added assertions today, but more likely tomorrow. Patrick, would you be willing to review it quickly?
(In reply to comment #8) > I might have a patch with added assertions today, but more likely tomorrow. > > Patrick, would you be willing to review it quickly? yes.. sorry I was out-of-'office' yesterday. this would be good to understand.
Attached patch debug assertions, v1 (obsolete) — Splinter Review
Attachment #541448 - Flags: review?(mcmanus)
Comment on attachment 541448 [details] [diff] [review] debug assertions, v1 This is fine. My only suggestion would be that this new assert: >+#ifdef DEBUG [.] >+ PR_ASSERT(ready == has_been_set); >+#endif is pretty much guaranteed to fire in the same situations Bob already reports and it doesn't provide as much info as the ones a few lines later on that you just added (and it will obscure them by firing first). maybe a NS_WARNING instead?
Attachment #541448 - Flags: review?(mcmanus) → review+
The patch completely killed the try run. I tried to run it locally, I could reproduce with 100%. But this is unfortunately not this bug caught, it just shows PR_ASSERT(ready == has_been_set) check is wrong. select Win32 function (and linux as well, actually) doesn't return the number of sockets that has set at least one flag, but it returns the number of flags set on all sockets. So, if we get READ and WRITE flags on a single socket (i.e. contained in both read and write sets), select returns 2.
Attachment #541448 - Attachment is obsolete: true
Attachment #541701 - Flags: review?(mcmanus)
Attachment #541448 - Flags: review+
Comment on attachment 541701 [details] [diff] [review] debug assertions, v2 Green.
Comment on attachment 541701 [details] [diff] [review] debug assertions, v2 Review of attachment 541701 [details] [diff] [review]: ----------------------------------------------------------------- instead of checking this in, is it possible for bob to just run the try build? If not, then let's check it in.
Attachment #541701 - Flags: review?(mcmanus) → review+
It's easier for me if you go ahead and check it in if you don't mind.
Thanks for the review, Patrick. Landed on inbound: http://hg.mozilla.org/integration/mozilla-inbound/rev/235d9fac4881
Whiteboard: [inbound]
Assignee: nobody → honzab.moz
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla7
This isn't really fixed. The patch was a diagnostic to help determine the root issue.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bob, thanks. Ehsan, should I mark the patch or the bug somehow next time to make it more visible there was just a debugging patch landed instead of a real fix if this happens more often?
Comment on attachment 541701 [details] [diff] [review] debug assertions, v2 Yes, you can mark it like this. Sorry for marking the bug as fixed by mistake.
Attachment #541701 - Flags: checkin+
With the new patch I got: No ready socket but should be at least one, dumping: fd=009A04A0, in=1, out=0, osfd=500 / rd:0 wt:0, ex:0 fd=066E1A90, in=5, out=0, osfd=3cc / rd:0 wt:0, ex:0 fd=069AD768, in=5, out=0, osfd=300 / rd:0 wt:0, ex:0 fd=06A8C908, in=5, out=0, osfd=308 / rd:0 wt:0, ex:0 fd=069E9AB8, in=5, out=0, osfd=330 / rd:0 wt:0, ex:0 fd=06A87100, in=5, out=0, osfd=328 / rd:0 wt:0, ex:0 fd=069C7C30, in=5, out=0, osfd=320 / rd:0 wt:0, ex:0 fd=068499B0, in=5, out=0, osfd=318 / rd:0 wt:0, ex:0 fd=06A8DD78, in=5, out=0, osfd=310 / rd:0 wt:0, ex:0 Assertion failure: ready > 0, at c:/work/mozilla/builds/nightly/mozilla/nsprpub/pr/src/md/windows/w32poll.c:355
Bob, can you attach the complete log reference please?
Attached file log download-statusbar (obsolete) —
Attached file log save-images
Honza, I still hit Assertion failure: ready > 0 at nsprpub/pr/src/md/windows/w32poll.c:355 with the same kind of dump as in comment 22. Is there anything else we can do?
Honza, this fatal assertion is a real drag on crash testing. I've been hitting this thing for almost 6 months. It prevents the testing of other, possibly more important, crashes. Can we just make this a non-fatal assertion?
Honza: do you still need your NSPR patch "debug assertions, v2" (attachment 541701 [details] [diff] [review]) in mozilla-central? It was overwritten by the latest NSPR update (http://hg.mozilla.org/mozilla-central/rev/d0c53c6ed070). If you need the patch, let's do it properly: 1. Create a nsprpub/patches directory. 2. Add nsprpub/patches/README to document all the NSPR patches. 3. Add the patch files to nsprpub/patches. 4. Add a step to the NSPR/NSS update procedure at https://developer.mozilla.org/en/Updating_NSPR_or_NSS_in_mozilla-central to check the nsprpub/patches directory, similar to Step 4 for the security/patches directory.
Thank you. I don't need the patch, feel free to overwrite it. What I need is to find time and try to reproduce/fix locally. In comment 30 Bob asked about ability to temporarily remove the assertion from the windows NSPR polling function because it breaks the tests quit often and prevents more important stuff get tested. Wan-teh, would you agree with that?
Honza: PR_ASSERT is a fatal assertion. You can replace it with PR_LOG, or you can imitate the "debug-patch" in bug 662557 and control the PR_ASSERT with an environment variable.
Wan-teh: in other words, you don't agree. Bob: I'll try to find time ASAP to figure this out.
Status: REOPENED → ASSIGNED
I know the cause. Could you please try and apply patch from bug 674147? If crashes stop, then it proves the SSL thread switches the NSPR layer socket during the time we wait in _MD_SELECT. With it it also switches the osdf handle and it is then not found in the resulting list of select(). Another reason to remove the ssl thread. I discovered this during debug tests of SPDY patch 0.
Confirmed in the debugger that we put the nsSSLIOLayerHelpers::mSharedPollableEvent osfd to select() lists but after return from select() the NSPR layer is switched back to the original socket with a different osfd. I think this is dup of or dep on bug 674147.
bsmith: it isn't clear to me what patches I need to apply from bug 674147.
Bob, I guess Honza means that he wants you to try out things with that WIP patch to see if the crashes stop. Don't be surprised if that patch breaks other things.
Depends on: 674147
Bug 674147 landed. Please let us know if we hit this assertion again.
I think this is fixed. I've only seen this on beta (55 times) and aurora (8 times) since Dec 6.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: