Closed Bug 738812 Opened 12 years ago Closed 12 years ago

Intermittent TEST-UNEXPECTED-FAIL | browser/docshell/test/browser/browser_bug435325.js or content/events/test/test_bug336682_1.html | Exited with code 1 during test run [@ nsHttpConnectionMgr::ReadTimeoutTick]

Categories

(Core :: Networking: HTTP, defect)

14 Branch
x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox12 --- unaffected
firefox13 --- fixed
firefox14 --- fixed

People

(Reporter: mbrubeck, Assigned: mcmanus)

References

Details

(Keywords: crash, intermittent-failure, Whiteboard: [test which aborts the suite][qa-])

Crash Data

Attachments

(1 file)

TEST-START | chrome://mochitests/content/browser/docshell/test/browser/browser_bug435325.js
++DOCSHELL 0x4abbfe0 == 30 [id = 3080]
++DOMWINDOW == 98 (0x47249e8) [serial = 8031] [outer = (nil)]
++DOMWINDOW == 99 (0x80c42c8) [serial = 8032] [outer = 0x4724970]
WARNING: NS_ENSURE_TRUE(mMutable) failed: file /builds/slave/m-in-lnx64-dbg/build/netwerk/base/src/nsSimpleURI.cpp, line 293
###!!! ABORT: wrong thread: 'PR_GetCurrentThread() == gSocketThread', file /builds/slave/m-in-lnx64-dbg/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 1987
std::vector<unsigned short, std::allocator<unsigned short> >::resize(unsigned long, unsigned short) (/home/cltbld/talos-slave/test/build/firefox/libxul.so)
XRE_AddStaticComponent (/home/cltbld/talos-slave/test/build/firefox/libxul.so)
XRE_AddStaticComponent (/home/cltbld/talos-slave/test/build/firefox/libxul.so)
XRE_AddStaticComponent (/home/cltbld/talos-slave/test/build/firefox/libxul.so)
std::vector<unsigned int, std::allocator<unsigned int> >::resize(unsigned long, unsigned int) (/home/cltbld/talos-slave/test/build/firefox/libxul.so)
XRE_AddStaticComponent (/home/cltbld/talos-slave/test/build/firefox/libxul.so)
PR_Select (/home/cltbld/talos-slave/test/build/firefox/libnspr4.so)
UNKNOWN  (/lib64/libpthread.so.0)
clone+0x0000006D  (/lib64/libc.so.6)
###!!! ABORT: wrong thread: 'PR_GetCurrentThread() == gSocketThread', file /builds/slave/m-in-lnx64-dbg/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 1987
++DOMWINDOW == 100 (0x95af7f8) [serial = 8033] [outer = 0x4724970]
--DOCSHELL 0x3acc480 == 29 [id = 3057]
WARNING: shutting down early because of crash!: file /builds/slave/m-in-lnx64-dbg/build/dom/plugins/ipc/PluginModuleChild.cpp, line 743
WARNING: plugin process _exit()ing: file /builds/slave/m-in-lnx64-dbg/build/dom/plugins/ipc/PluginModuleChild.cpp, line 708
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/docshell/test/browser/browser_bug435325.js | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:18:44.526552
INFO | automation.py | Reading PID log: /tmp/tmp9pz-ORpidlog
==> process 2409 launched child process 2493
INFO | automation.py | Checking for orphan process with PID: 2493
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64-debug/1332518773/firefox-14.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
PROCESS-CRASH | chrome://mochitests/content/browser/docshell/test/browser/browser_bug435325.js | application crashed (minidump found)
Crash dump filename: /tmp/tmpmjazKd/minidumps/20c714cb-9e3b-2518-6fcdd29b-0ee15e90.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
     family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x0

Thread 3 (crashed)
 0  libmozalloc.so!TouchBadMemory [mozalloc_abort.cpp:93fc7bf5e5be : 68 + 0x0]
    rbx = 0xd3172d78   r12 = 0x02f46790   r13 = 0xd3172d78   r14 = 0x02f46790
    r15 = 0x120d18da   rip = 0x1470a0d9   rsp = 0x02f46750   rbp = 0x02f46770
    Found by: given as instruction pointer in context
 1  libmozalloc.so!mozalloc_abort [mozalloc_abort.cpp:93fc7bf5e5be : 89 + 0x4]
    rbx = 0xd3172d78   r12 = 0x02f46790   r13 = 0xd3172d78   r14 = 0x02f46790
    r15 = 0x120d18da   rip = 0x1470a115   rsp = 0x02f46760   rbp = 0x02f46770
    Found by: call frame info
 2  libxul.so!NS_DebugBreak_P [nsDebugImpl.cpp:93fc7bf5e5be : 388 + 0x4]
    rbx = 0x02f46b80   r12 = 0x02f46790   r13 = 0xd3172d78   r14 = 0x02f46790
    r15 = 0x120d18da   rip = 0x11a715ab   rsp = 0x02f46780   rbp = 0x02f46bc0
    Found by: call frame info
 3  libxul.so!nsHttpConnectionMgr::ReadTimeoutTick [nsHttpConnectionMgr.cpp:93fc7bf5e5be : 1987 + 0x32]
    rbx = 0x00ec3fe0   r12 = 0xf4001920   r13 = 0x120f4618   r14 = 0x00000003
    r15 = 0x00000000   rip = 0x10cb0dd5   rsp = 0x02f46bd0   rbp = 0x02f46be0
    Found by: call frame info
 4  libxul.so!nsHttpConnectionMgr::Observe [nsHttpConnectionMgr.cpp:93fc7bf5e5be : 274 + 0x7]
    rbx = 0x00ec3fe0   r12 = 0xf4001920   r13 = 0x120f4618   r14 = 0x00000003
    r15 = 0x00000000   rip = 0x10cb30bb   rsp = 0x02f46bf0   rbp = 0x02f46c20
    Found by: call frame info
 5  libxul.so!nsTimerImpl::Fire [nsTimerImpl.cpp:93fc7bf5e5be : 516 + 0x15]
    rbx = 0xf4001920   r12 = 0x00ec3fe0   r13 = 0x00000003   r14 = 0x00000003
    r15 = 0x00000000   rip = 0x11a6cb60   rsp = 0x02f46c30   rbp = 0x02f46cd0
    Found by: call frame info
 6  libxul.so!nsTimerEvent::Run [nsTimerImpl.cpp:93fc7bf5e5be : 591 + 0x10]
    rbx = 0x00a4fe28   r12 = 0x0000cae2   r13 = 0x00000000   r14 = 0xffffffff
    r15 = 0x02f46dff   rip = 0x11a6ccca   rsp = 0x02f46ce0   rbp = 0x02f46d20
    Found by: call frame info
 7  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:93fc7bf5e5be : 657 + 0xd]
    rbx = 0x00c3c1e0   r12 = 0x02f46d00   r13 = 0x00000000   r14 = 0xffffffff
    r15 = 0x02f46dff   rip = 0x11a68c8b   rsp = 0x02f46d30   rbp = 0x02f46de0
    Found by: call frame info
 8  libxul.so!NS_ProcessPendingEvents_P [nsThreadUtils.cpp:93fc7bf5e5be : 195 + 0xd]
    rbx = 0x00c3c1e0   r12 = 0x02f46dff   r13 = 0x4091a679   r14 = 0xffffffff
    r15 = 0x00000003   rip = 0x11a272c2   rsp = 0x02f46df0   rbp = 0x02f46e30
    Found by: call frame info
 9  libxul.so!nsThread::ThreadFunc [nsThread.cpp:93fc7bf5e5be : 308 + 0xa]
    rbx = 0x02f46e40   r12 = 0x00c3c1e0   r13 = 0x00c3c200   r14 = 0x02f46e50
    r15 = 0x00000003   rip = 0x11a67e9f   rsp = 0x02f46e40   rbp = 0x02f46ea0
Forgot to paste the log link:
https://tbpl.mozilla.org/php/getParsedLog.php?id=10320105&tree=Mozilla-Inbound
Rev3 Fedora 12x64 mozilla-inbound debug test mochitest-other on 2012-03-23 09:47:31 PDT for push 93fc7bf5e5be
In content/events/test/test_bug336682_1.html:

https://tbpl.mozilla.org/php/getParsedLog.php?id=10343028&tree=Mozilla-Inbound
Summary: Intermittent TEST-UNEXPECTED-FAIL | browser/docshell/test/browser/browser_bug435325.js | Exited with code 1 during test run [@ nsHttpConnectionMgr::ReadTimeoutTick] → Intermittent TEST-UNEXPECTED-FAIL | browser/docshell/test/browser/browser_bug435325.js or content/events/test/test_bug336682_1.html | Exited with code 1 during test run [@ nsHttpConnectionMgr::ReadTimeoutTick]
honza, at first glance this is really baffling.. I'm wondering if you have any ideas?

it is asserting that readtimeouttick is not being called on the socket thread.. but the timer is created on the socket thread and only ever init'd on the socket thread.. (those asserts don't go off, so we can be sure that happened ok).

indeed the stack trace is some other unknown thread - not the socket or main thread.

the test doesn't look like its trying to shut down or anything like that.
Just to update, as I've had an off-zilla inquiry. This is on my radar for next week.

I'll also say when I first saw this signature I expected it to be randomly distributed across the testbase and its pretty interesting (and promising for a fix) that it is normally associated with a couple particular tests.
Blocks: 739762
I thought I had this figured out today, but try disagrees with me :) However that failed run does have lots of debug added to it which I can use to keep working.

I'm only making this comment so people know I'm not treating this as ok-perma-orange
through some debugging annotations and many try runs of mochi-other on linux-64 I have learned that at the time of the assert the socket thread is complete but the connection manager has not yet shutdown (and has not yet received the begin-shutdown notification) which leaves the tick is still active
Attached patch patch 0Splinter Review
its possible for the socket transport service to shutdown before the connection manager. It actually does this all the time if you go offline.

In that case the readtimeouttick continued to try to tick until it determined that there were no active connections, when it disabled itself. but that was too late to do safely if the socket thread was already gone (as the number of active connections is designed to only be accessed on the socket thread).

The patch changes things to disable the timer tick any time the active connection count drops to zero instead of cleaning it up out of the tick itself. In the case with a long running active connection when the transport service goes away first (the tick is still removed by the connection manager shutdown routine if it goes away first) the transport actually detaches all active sockets which drives the count to 0 and disables the tick.
Assignee: nobody → mcmanus
Status: NEW → ASSIGNED
Attachment #613409 - Flags: review?(honzab.moz)
Blocks: 739029
Comment on attachment 613409 [details] [diff] [review]
patch 0

Review of attachment 613409 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab with a comment.

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +254,5 @@
>      }
>  }
>  
> +bool
> +nsHttpConnectionMgr::ConditionallyStopReadTimeoutTick()

This can be just void.  ConditionallyStopPruneDeadConnectionsTimer is the same.  It's confusing when a method does the thing but also returns a flag whether conditions to do the thing has been met (like the caller should do thing it self).  You are not using the result anywhere.
Attachment #613409 - Flags: review?(honzab.moz) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/69e021051aa5

> > +bool
> > +nsHttpConnectionMgr::ConditionallyStopReadTimeoutTick()
> 
> This can be just void. 

I made it void.
Blocks: 744788
https://hg.mozilla.org/mozilla-central/rev/69e021051aa5
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment on attachment 613409 [details] [diff] [review]
patch 0

This is one of a two patch series (bug 744788 and bug 738812) that should be accepted or declined for aurora atomically.

We've only seen this orange on FF14, but it can happen on FF13. The reason it is more prominent on 14 is the timer event is much more frequent there due to the pipelining changes introduced on 14, but the same timer is used by spdy on 13.

[Approval Request Comment]
Regression caused by (bug #): 728113
User impact if declined: potential crash risk during offline or shutdown
Testing completed (on m-c, etc.): m-c - orange cleared up.
Risk to taking this patch (and alternatives if risky): fairly low.
String changes made by this patch: none.
Attachment #613409 - Flags: approval-mozilla-aurora?
Comment on attachment 613409 [details] [diff] [review]
patch 0

[Triage comment]
Low risk, clears up what sounds like intermittent orange on 13, please go ahead with landing.
Attachment #613409 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Whiteboard: [orange][test which aborts the suite] → [orange][test which aborts the suite][qa-]
Whiteboard: [orange][test which aborts the suite][qa-] → [test which aborts the suite][qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: