Closed Bug 634084 Opened 13 years ago Closed 13 years ago

Assertion failure: ready > 0 Crash [@ _PR_MD_PR_POLL PR_Poll nsSocketTransportService::Poll(int, unsigned int*) nsSocketTransportService::DoPollIteration(int)]

Categories

(Core :: Networking, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 674147
mozilla7

People

(Reporter: bc, Assigned: mayhemer)

References

Details

(Keywords: crash)

Attachments

(6 files, 2 obsolete files)

The crash automation has been getting

Assertion failure: ready > 0, at c:/work/mozilla/builds/2.0.0/mozilla/nsprpub/pr/src/md/windows/w32poll.c:322

on 32 bit Windows 7 for the following urls (reproduced today). Note that I haven't seen this with 32 bit builds on 64bit Windows 7.

http://drlunswe.blogspot.com/
https://addons.mozilla.org/pt-br/firefox/addon/switch-to-tab-no-more/
https://addons.mozilla.org/en-US/firefox/addon/lavafox-v1-green/
https://addons.mozilla.org/es-ES/firefox/addon/spanish-mexico-language-pack/
https://addons.mozilla.org/firefox/addon/6366
https://addons.mozilla.org/pt-br/firefox/addon/video-downloadhelper/
https://addons.mozilla.org/ru/firefox/addon/3006
https://services.addons.mozilla.org/en-US/firefox/discovery/addon/add-on-compatibility-reporter/
https://services.addons.mozilla.org/ru/firefox/discovery/addon/1865/
https://addons.mozilla.org/en-US/firefox/addon/2464

This appears to occur during shutdown after loading the page.

Example crash reports

Operating system: Windows NT
                  6.1.7600 
CPU: x86
     GenuineIntel family 6 model 44 stepping 2
     2 CPUs

Crash reason:  EXCEPTION_BREAKPOINT
Crash address: 0x759c22a1

Thread 8 (crashed)
 0  KERNELBASE.dll + 0x122a1
    eip = 0x759c22a1   esp = 0x0451c51c   ebp = 0x0451c520   ebx = 0x003b8150
    esi = 0x00000000   edi = 0x00000000   eax = 0x00000000   ecx = 0xd692d71c
    edx = 0x736c1d48   efl = 0x00000212
    Found by: given as instruction pointer in context
 1  nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b]
    eip = 0x735968d0   esp = 0x0451c528   ebp = 0x0451f5bc
    Found by: previous frame's frame pointer
 2  nspr4.dll!PR_Poll [prio.c : 173 + 0x10]
    eip = 0x73586f74   esp = 0x0451f5c4   ebp = 0x0451f5d0
    Found by: call frame info
 3  xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 358 + 0x11]
    eip = 0x6f9f5ca2   esp = 0x0451f5d8   ebp = 0x0451f600
    Found by: call frame info
 4  xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 673 + 0xf]
    eip = 0x6f9f6987   esp = 0x0451f608   ebp = 0x0451f650
    Found by: call frame info
 5  xul.dll!nsSocketTransportService::OnProcessNextEvent(nsIThreadInternal *,int,unsigned int) [nsSocketTransportService2.cpp : 552 + 0xc]
    eip = 0x6f9f650d   esp = 0x0451f658   ebp = 0x0451f660
    Found by: call frame info


Operating system: Windows NT
                  6.1.7600 
CPU: x86
     GenuineIntel family 6 model 44 stepping 2
     1 CPU

Crash reason:  EXCEPTION_BREAKPOINT
Crash address: 0x7541194b

Thread 8 (crashed)
 0  KERNELBASE.dll + 0x3194b
    eip = 0x7541194b   esp = 0x0326c7c0   ebp = 0x0326c7c4   ebx = 0x003a50c8
    esi = 0x00000000   edi = 0x00000000   eax = 0x00000000   ecx = 0x37b8f996
    edx = 0x6f0f1d48   efl = 0x00000202
    Found by: given as instruction pointer in context
 1  nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b]
    eip = 0x6efc68d0   esp = 0x0326c7cc   ebp = 0x0326f860
    Found by: previous frame's frame pointer
 2  nspr4.dll!PR_Poll [prio.c : 173 + 0x10]
    eip = 0x6efb6f74   esp = 0x0326f868   ebp = 0x0326f874
    Found by: call frame info
 3  xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 358 + 0x11]
    eip = 0x66065ca2   esp = 0x0326f87c   ebp = 0x0326f8a4
    Found by: call frame info
 4  xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 673 + 0xf]
    eip = 0x66066987   esp = 0x0326f8ac   ebp = 0x0326f8f4
    Found by: call frame info
 5  xul.dll!nsSocketTransportService::OnProcessNextEvent(nsIThreadInternal *,int,unsigned int) [nsSocketTransportService2.cpp : 552 + 0xc]
    eip = 0x6606650d   esp = 0x0326f8fc   ebp = 0x0326f904
    Found by: call frame info

On Windows XP the crash automation has seen ABORT: imgCacheEntry release isn't thread-safe! with a related stack. See Bug 529949 

https://addons.mozilla.org/en-US/firefox/addon/prospector-home-dash/

Assertion failure: ready > 0, at c:/work/mozilla/builds/2.0.0/mozilla/nsprpub/pr/src/md/windows/w32poll.c:322

###!!! ASSERTION: nsScriptCacheCleaner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:/work/mozilla/builds/2.0.0/mozilla/content/base/src/nsFrameMessageManager.cpp, line 729
###!!! ABORT: imgCacheEntry release isn't thread-safe!: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:\work\mozilla\builds\2.0.0\mozilla\modules\libpr0n\src\imgLoader.h, line 85

(minidump_stackwalk crashed and was unable to provide a stack)

https://addons.mozilla.org/pt-br/firefox/addon/switch-to-tab-no-more/

Assertion failure: ready > 0, at c:/work/mozilla/builds/2.0.0/mozilla/nsprpub/pr/src/md/windows/w32poll.c:322
nsStringStats
 => mAllocCount:          16373
 => mReallocCount:         1694
 => mFreeCount:            8758  --  LEAKED 7615 !!!
 => mShareCount:          23106
 => mAdoptCount:           1643
 => mAdoptFreeCount:       1629  --  LEAKED 14 !!!
###!!! ASSERTION: nsScriptCacheCleaner not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:/work/mozilla/builds/2.0.0/mozilla/content/base/src/nsFrameMessageManager.cpp, line 729
###!!! ABORT: imgCacheEntry release isn't thread-safe!: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file c:\work\mozilla\builds\2.0.0\mozilla\modules\libpr0n\src\imgLoader.h, line 85

Operating system: Windows NT
                  5.1.2600 Service Pack 3
CPU: x86
     GenuineIntel family 6 model 44 stepping 2
     1 CPU

Crash reason:  EXCEPTION_BREAKPOINT
Crash address: 0x7c90120e

Thread 6 (crashed)
 0  ntdll.dll + 0x120e
    eip = 0x7c90120e   esp = 0x0264cc64   ebp = 0x0264cc68   ebx = 0x017ec2c0
    esi = 0x00000000   edi = 0x00000000   eax = 0x00000000   ecx = 0x21d847c8
    edx = 0x003b3d38   efl = 0x00000202
    Found by: given as instruction pointer in context
 1  nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b]
    eip = 0x00c468d0   esp = 0x0264cc70   ebp = 0x0264fd04
    Found by: previous frame's frame pointer
 2  nspr4.dll!PR_Poll [prio.c : 173 + 0x10]
    eip = 0x00c36f74   esp = 0x0264fd0c   ebp = 0x0264fd18
    Found by: call frame info
 3  xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 358 + 0x11]
    eip = 0x106d5ca2   esp = 0x0264fd20   ebp = 0x0264fd48
    Found by: call frame info
 4  xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 673 + 0xf]
    eip = 0x106d6987   esp = 0x0264fd50   ebp = 0x0264fd98
    Found by: call frame info
 5  xul.dll!nsSocketTransportService::OnProcessNextEvent(nsIThreadInternal *,int,unsigned int) [nsSocketTransportService2.cpp : 552 + 0xc]
    eip = 0x106d650d   esp = 0x0264fda0   ebp = 0x0264fda8
    Found by: call frame info

Note the following does appear in the console for the addons.mozilla.org crashes:

api-secure.recaptcha.net : server does not support RFC 5746, see CVE-2009-3555

I haven't been able to reproduce these crashes locally with Windows XP or Vista and I don't have Windows 7 locally. I have occasionally experienced shutdown hangs, bug haven't been able to learn much from them.

Note that each of these urls came from the nightly crash dumps which means our users are experiencing some kind of crash behavior. For the addons related crashes, I expect they crash when restarting after installing the addon.
FYI, I see Assertion failure: ready > 0 on the following domains:

addons.mozilla.org, blogs.ubc.ca, boh.com, drlunswe.blogspot.com, forum.smallworlds.com, listen.grooveshark.com, saladecristo.webnode.it, services.addons.mozilla.org, thatguywiththeglasses.com, www.biggby.com, www.diziizleyelim.com, www.fightlinker.com, www.girlsaskguys.com, www.htcsense.com, www.izmitliyizbiz.com, www.kvfan.net, www.macromedia.com, www.minecraft.net, www.podpocalypse.com, www.premiumissuer.com, www.pwrshow.com, www.scribd.com, www.tnawrestling.com, www.vidyoara.com, www.wizard101.com
Attached file example crash report
I see this on literally hundreds of urls on Windows during crash testing.

Operating system: Windows NT
                  6.1.7601 Service Pack 1
CPU: x86
     GenuineIntel family 6 model 44 stepping 2
     2 CPUs

Crash reason:  EXCEPTION_BREAKPOINT
Crash address: 0x76b722a1

Thread 8 (crashed)
 0  KERNELBASE.dll + 0x122a1
    eip = 0x76b722a1   esp = 0x0446cacc   ebp = 0x0446cad0   ebx = 0x005663c8
    esi = 0x00000000   edi = 0x00000000   eax = 0x00000000   ecx = 0x35dce649
    edx = 0x72071d48   efl = 0x00000212
    Found by: given as instruction pointer in context
 1  nspr4.dll!_PR_MD_PR_POLL [w32poll.c : 322 + 0x1b]
    eip = 0x72fc68d0   esp = 0x0446cad8   ebp = 0x0446fb6c
    Found by: previous frame's frame pointer
 2  nspr4.dll!PR_Poll [prio.c : 173 + 0x10]
    eip = 0x72fb6f74   esp = 0x0446fb74   ebp = 0x0446fb80
    Found by: call frame info
 3  xul.dll!nsSocketTransportService::Poll(int,unsigned int *) [nsSocketTransportService2.cpp : 355 + 0x11]
    eip = 0x69dd02d2   esp = 0x0446fb88   ebp = 0x0446fbb0
    Found by: call frame info
 4  xul.dll!nsSocketTransportService::DoPollIteration(int) [nsSocketTransportService2.cpp : 668 + 0xf]
    eip = 0x69dd1227   esp = 0x0446fbb8   ebp = 0x0446fbf0
    Found by: call frame info

nsprpub/pr/src/md/windows/w32poll.c

284     /*
    285     ** Now to unravel the select sets back into the client's poll
    286     ** descriptor list. Is this possibly an area for pissing away
    287     ** a few cycles or what?
    288     */
    289     if (ready > 0)
    290     {
    291         ready = 0;
    292         for (pd = pds, epd = pd + npds; pd < epd; pd++)
    293         {
    294             PRInt16 out_flags = 0;
    295             if ((NULL != pd->fd) && (0 != pd->in_flags))
    296             {
    297                 SOCKET osfd;
    298                 bottom = PR_GetIdentitiesLayer(pd->fd, PR_NSPR_IO_LAYER);
    299                 PR_ASSERT(NULL != bottom);
    300 
    301                 osfd = (SOCKET) bottom->secret->md.osfd;
    302 
    303                 if (FD_ISSET(osfd, &rd))
    304                 {
    305                     if (pd->out_flags & _PR_POLL_READ_SYS_READ)
    306                         out_flags |= PR_POLL_READ;
    307                     if (pd->out_flags & _PR_POLL_WRITE_SYS_READ)
    308                         out_flags |= PR_POLL_WRITE;
    309                 } 
    310                 if (FD_ISSET(osfd, &wt))
    311                 {
    312                     if (pd->out_flags & _PR_POLL_READ_SYS_WRITE)
    313                         out_flags |= PR_POLL_READ;
    314                     if (pd->out_flags & _PR_POLL_WRITE_SYS_WRITE)
    315                         out_flags |= PR_POLL_WRITE;
    316                 } 
    317                 if (FD_ISSET(osfd, &ex)) out_flags |= PR_POLL_EXCEPT;
    318             }
    319             pd->out_flags = out_flags;
    320             if (out_flags) ready++;
    321         }

This assert is the one firing

    322         PR_ASSERT(ready > 0);

    323     }
    324     else if (ready == SOCKET_ERROR)
    325     {
    326         err = WSAGetLastError();
    327         if (err == WSAENOTSOCK)
    328         {
    329             /* Find the bad fds */
    330             int optval;
    331             int optlen = sizeof(optval);
    332             ready = 0;
    333             for (pd = pds, epd = pd + npds; pd < epd; pd++)
    334             {
    335                 pd->out_flags = 0;
    336                 if ((NULL != pd->fd) && (0 != pd->in_flags))
    337                 {
    338                     bottom = PR_GetIdentitiesLayer(pd->fd, PR_NSPR_IO_LAYER);
    339                     if (getsockopt(bottom->secret->md.osfd, SOL_SOCKET,
    340                         SO_TYPE, (char *) &optval, &optlen) == -1)
    341                     {
    342                         PR_ASSERT(WSAGetLastError() == WSAENOTSOCK);
    343                         if (WSAGetLastError() == WSAENOTSOCK)
    344                         {
    345                             pd->out_flags = PR_POLL_NVAL;
    346                             ready++;
    347                         }
    348                     }
    349                 }
    350             }
    351             PR_ASSERT(ready > 0);
    352         }

Either the loop isn't executed at all or out_flags is never set inside the loop. What is the effect if _PR_MD_PR_POLL incorrectly returns 0 when it should return non-zero? Does it warrant a fatal assertion?
Severity: normal → critical
Summary: Assertion failure: ready > 0 @ KERNELBASE.dll@0x122a1 _PR_MD_PR_POLL PR_Poll nsSocketTransportService::Poll(int, unsigned int*) nsSocketTransportService::DoPollIteration(int) → Assertion failure: ready > 0 Crash [@ _PR_MD_PR_POLL PR_Poll nsSocketTransportService::Poll(int, unsigned int*) nsSocketTransportService::DoPollIteration(int)]
Hi Bob,

Thanks for updating the bug. I just want to check if you're seeing a recent change in behavior, or this has remained more or less the same since first reported. (I want to make sure d527ceffd806 from 6/15 is not a factor - it shouldn't be.)
No, not really a change. It has been pretty constant in the automation but I haven't been able to reproduce locally though I didn't try this weekend. It might need a fast internet connection.
I just wanted to let you know this is one of those cases where I keep seeing the same fatal assertion over and over. I've seen this so far on 623 different urls from 63 domains including:

https://addons.mozilla.org/
https://developer.mozilla.org/
https://services.addons.mozilla.org/

The cost to the automation crash testing is that there may be other more important issues being hidden by this fatal assertion.
Bob, to be clear, this is still happening after bug 658580 has been landed?  Do we have any estimation on the regression range?  Is there a way for us to run the crash automation as e.g. a try server push?

If this can be regularly reproduced we may add more assertions to figure out what condition and where fails.
(In reply to comment #6)
> Bob, to be clear, this is still happening after bug 658580 has been landed? 

yes.

> Do we have any estimation on the regression range?

Not really but it exists in the Gecko 2.0 branch so it is not recent.

>  Is there a way for us to run the crash automation as e.g. a try server push?
> 

No.

> If this can be regularly reproduced we may add more assertions to figure out
> what condition and where fails.

It is pretty common in the automation. As soon as my morning windows builds finish I'll try to reproduce locally again. More assertions to narrow it down would be nice.
I might have a patch with added assertions today, but more likely tomorrow.  

Patrick, would you be willing to review it quickly?
(In reply to comment #8)
> I might have a patch with added assertions today, but more likely tomorrow.  
> 
> Patrick, would you be willing to review it quickly?

yes.. sorry I was out-of-'office' yesterday.

this would be good to understand.
Attached patch debug assertions, v1 (obsolete) — Splinter Review
Pushed to try: http://tbpl.mozilla.org/?tree=Try&rev=276ead1213ee
Attachment #541448 - Flags: review?(mcmanus)
Comment on attachment 541448 [details] [diff] [review]
debug assertions, v1

This is fine.

My only suggestion would be that this new assert:

>+#ifdef DEBUG
[.]
>+        PR_ASSERT(ready == has_been_set);
>+#endif

is pretty much guaranteed to fire in the same situations Bob already reports and it doesn't provide as much info as the ones a few lines later on that you just added (and it will obscure them by firing first). maybe a NS_WARNING instead?
Attachment #541448 - Flags: review?(mcmanus) → review+
The patch completely killed the try run.  I tried to run it locally, I could reproduce with 100%.  But this is unfortunately not this bug caught, it just shows PR_ASSERT(ready == has_been_set) check is wrong.  select Win32 function (and linux as well, actually) doesn't return the number of sockets that has set at least one flag, but it returns the number of flags set on all sockets.  So, if we get READ and WRITE flags on a single socket (i.e. contained in both read and write sets), select returns 2.
Let's try this one: http://tbpl.mozilla.org/?tree=Try&rev=1e940833c028
Attachment #541448 - Attachment is obsolete: true
Attachment #541701 - Flags: review?(mcmanus)
Attachment #541448 - Flags: review+
Comment on attachment 541701 [details] [diff] [review]
debug assertions, v2

Green.
Comment on attachment 541701 [details] [diff] [review]
debug assertions, v2

Review of attachment 541701 [details] [diff] [review]:
-----------------------------------------------------------------

instead of checking this in, is it possible for bob to just run the try build? If not, then let's check it in.
Attachment #541701 - Flags: review?(mcmanus) → review+
It's easier for me if you go ahead and check it in if you don't mind.
Thanks for the review, Patrick.

Landed on inbound: http://hg.mozilla.org/integration/mozilla-inbound/rev/235d9fac4881
Whiteboard: [inbound]
http://hg.mozilla.org/mozilla-central/rev/235d9fac4881
Assignee: nobody → honzab.moz
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla7
This isn't really fixed. The patch was a diagnostic to help determine the root issue.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bob, thanks.

Ehsan, should I mark the patch or the bug somehow next time to make it more visible there was just a debugging patch landed instead of a real fix if this happens more often?
Comment on attachment 541701 [details] [diff] [review]
debug assertions, v2

Yes, you can mark it like this.  Sorry for marking the bug as fixed by mistake.
Attachment #541701 - Flags: checkin+
With the new patch I got:

No ready socket but should be at least one, dumping: 
        fd=009A04A0, in=1, out=0, osfd=500 / rd:0 wt:0, ex:0
        fd=066E1A90, in=5, out=0, osfd=3cc / rd:0 wt:0, ex:0
        fd=069AD768, in=5, out=0, osfd=300 / rd:0 wt:0, ex:0
        fd=06A8C908, in=5, out=0, osfd=308 / rd:0 wt:0, ex:0
        fd=069E9AB8, in=5, out=0, osfd=330 / rd:0 wt:0, ex:0
        fd=06A87100, in=5, out=0, osfd=328 / rd:0 wt:0, ex:0
        fd=069C7C30, in=5, out=0, osfd=320 / rd:0 wt:0, ex:0
        fd=068499B0, in=5, out=0, osfd=318 / rd:0 wt:0, ex:0
        fd=06A8DD78, in=5, out=0, osfd=310 / rd:0 wt:0, ex:0
Assertion failure: ready > 0, at c:/work/mozilla/builds/nightly/mozilla/nsprpub/pr/src/md/windows/w32poll.c:355
Bob, can you attach the complete log reference please?
Attached file log download-statusbar (obsolete) —
Attached file log save-images
Honza, I still hit Assertion failure: ready > 0 at nsprpub/pr/src/md/windows/w32poll.c:355 with the same kind of dump as in comment 22. Is there anything else we can do?
Honza, this fatal assertion is a real drag on crash testing. I've been hitting this thing for almost 6 months. It prevents the testing of other, possibly more important, crashes. Can we just make this a non-fatal assertion?
Honza: do you still need your NSPR patch "debug assertions, v2"
(attachment 541701 [details] [diff] [review]) in mozilla-central?  It was overwritten by
the latest NSPR update (http://hg.mozilla.org/mozilla-central/rev/d0c53c6ed070).

If you need the patch, let's do it properly:
1. Create a nsprpub/patches directory.
2. Add nsprpub/patches/README to document all the NSPR patches.
3. Add the patch files to nsprpub/patches.
4. Add a step to the NSPR/NSS update procedure at
https://developer.mozilla.org/en/Updating_NSPR_or_NSS_in_mozilla-central
to check the nsprpub/patches directory, similar to Step 4 for
the security/patches directory.
Thank you.  I don't need the patch, feel free to overwrite it.

What I need is to find time and try to reproduce/fix locally.

In comment 30 Bob asked about ability to temporarily remove the assertion from the windows NSPR polling function because it breaks the tests quit often and prevents more important stuff get tested.

Wan-teh, would you agree with that?
Honza: PR_ASSERT is a fatal assertion.  You can replace it
with PR_LOG, or you can imitate the "debug-patch" in
bug 662557 and control the PR_ASSERT with an environment
variable.
Wan-teh: in other words, you don't agree.

Bob: I'll try to find time ASAP to figure this out.
Status: REOPENED → ASSIGNED
I know the cause.  Could you please try and apply patch from bug 674147?

If crashes stop, then it proves the SSL thread switches the NSPR layer socket during the time we wait in _MD_SELECT.  With it it also switches the osdf handle and it is then not found in the resulting list of select().

Another reason to remove the ssl thread.

I discovered this during debug tests of SPDY patch 0.
Confirmed in the debugger that we put the nsSSLIOLayerHelpers::mSharedPollableEvent osfd to select() lists but after return from select() the NSPR layer is switched back to the original socket with a different osfd.

I think this is dup of or dep on bug 674147.
bsmith: it isn't clear to me what patches I need to apply from bug 674147.
Bob, I guess Honza means that he wants you to try out things with that WIP patch to see if the crashes stop. Don't be surprised if that patch breaks other things.
Depends on: 674147
Bug 674147 landed. Please let us know if we hit this assertion again.
I think this is fixed. I've only seen this on beta (55 times) and aurora (8 times) since Dec 6.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: