Closed Bug 654201 Opened 13 years ago Closed 13 years ago

ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js browser_bug618502.js browser_uninstalling.js browser_globalinformations.js

Categories

(Core :: Networking: HTTP, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: philikon, Assigned: mcmanus)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [test which aborts the suite])

Attachments

(1 file, 5 obsolete files)

Based on the assertions I suspect this is a http issue:

###!!! ASSERTION: Invalid start index: 'count == 0 || start < Length()', file ../../../dist/include/nsTArray.h, line 832
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementsAt [nsTArray.h:833]
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementAt [nsTArray.h:840]
nsHttpConnectionMgr::ProcessPendingQForEntry [netwerk/protocol/http/nsHttpConnectionMgr.cpp:588]
nsHttpConnectionMgr::OnMsgProcessPendingQ [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1024]
nsHttpConnectionMgr::OnMsgReclaimConnection [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1110]
nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1505]
nsSocketOutputStream::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:515]
nsSocketTransport::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:1530]
nsSocketTransportService::DoPollIteration [netwerk/base/src/nsSocketTransportService2.cpp:740]
nsSocketTransportService::OnProcessNextEvent [netwerk/base/src/nsSocketTransportService2.cpp:606]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:584]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsSocketTransportService::Run [netwerk/base/src/nsSocketTransportService2.cpp:637]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:618]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:272]
_pt_root [nsprpub/pr/src/pthreads/ptthread.c:190]
libSystem.B.dylib + 0x32155
libSystem.B.dylib + 0x32012
###!!! ASSERTION: Invalid length: 'start + count <= Length()', file ../../../dist/include/nsTArray.h, line 833
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementsAt [nsTArray.h:834]
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementAt [nsTArray.h:840]
nsHttpConnectionMgr::ProcessPendingQForEntry [netwerk/protocol/http/nsHttpConnectionMgr.cpp:588]
nsHttpConnectionMgr::OnMsgProcessPendingQ [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1024]
nsHttpConnectionMgr::OnMsgReclaimConnection [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1110]
nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1505]
nsSocketOutputStream::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:515]
nsSocketTransport::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:1530]
nsSocketTransportService::DoPollIteration [netwerk/base/src/nsSocketTransportService2.cpp:740]
nsSocketTransportService::OnProcessNextEvent [netwerk/base/src/nsSocketTransportService2.cpp:606]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:584]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsSocketTransportService::Run [netwerk/base/src/nsSocketTransportService2.cpp:637]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:618]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:272]
_pt_root [nsprpub/pr/src/pthreads/ptthread.c:190]
libSystem.B.dylib + 0x32155
libSystem.B.dylib + 0x32012
###!!! ASSERTION: Release()ing something that hasn't been Acquire()ed: 'chainFront && CallStack::kNone != mDDEntry->mAcquisitionContext', file /builds/slave/services-cen-osx-dbg/build/obj-firefox/xpcom/build/BlockingResourceBase.cpp, line 193
mozilla::BlockingResourceBase::Release [obj-firefox/xpcom/build/BlockingResourceBase.cpp:195]
mozilla::Mutex::Unlock [obj-firefox/xpcom/build/BlockingResourceBase.cpp:269]
mozilla::Monitor::Unlock [Monitor.h:76]
mozilla::MonitorAutoUnlock::MonitorAutoUnlock [Monitor.h:172]
TimerThread::Run [xpcom/threads/TimerThread.cpp:290]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:618]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:272]
_pt_root [nsprpub/pr/src/pthreads/ptthread.c:190]
libSystem.B.dylib + 0x32155
libSystem.B.dylib + 0x32012
WARNING: Resource acquired at calling context
: file /builds/slave/services-cen-osx-dbg/build/obj-firefox/xpcom/build/BlockingResourceBase.cpp, line 200
mozilla::Mutex::Lock [obj-firefox/xpcom/build/BlockingResourceBase.cpp:258]
mozilla::Monitor::Lock [Monitor.h:71]
mozilla::MonitorAutoUnlock::~MonitorAutoUnlock [Monitor.h:177]
TimerThread::Run [xpcom/threads/TimerThread.cpp:320]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:618]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:272]
_pt_root [nsprpub/pr/src/pthreads/ptthread.c:190]
libSystem.B.dylib + 0x32155
libSystem.B.dylib + 0x32012
WARNING:
is being released in non-LIFO order; why?: file /builds/slave/services-cen-osx-dbg/build/obj-firefox/xpcom/build/BlockingResourceBase.cpp, line 202
Blocks: 438871
Whiteboard: [orange]
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1304369910.1304373507.6518.gz&fulltext=1
Rev3 MacOSX Leopard 10.5.8 tracemonkey debug test mochitest-other on 2011/05/02 13:58:30 

TEST-PASS | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Should be displaying the correct UI
TEST-INFO | before wait for focus -- loaded: uninitialized active window: ([object ChromeWindow @ 0x5bf3c90 (native @ 0x5bf0838)]) chrome://browser/content/browser.xul focused window: ([object Window @ 0x465acbb0 (native @ 0x469a6038)]) about:addons desired window: ([object Window @ 0x465acbb0 (native @ 0x469a6038)]) about:addons child window: ([object Window @ 0x465acbb0 (native @ 0x469a6038)]) about:addons docshell visible: true
TEST-INFO | must wait for load
TEST-INFO | already focused
TEST-INFO | maybe run tests <load:false, focus:true> -- loaded: uninitialized active window: ([object ChromeWindow @ 0x5bf3c90 (native @ 0x5bf0838)]) chrome://browser/content/browser.xul focused window: ([object Window @ 0x465acbb0 (native @ 0x469a6038)]) about:addons desired window: ([object Window @ 0x465acbb0 (native @ 0x469a6038)]) about:addons child window: ([object Window @ 0x465acbb0 (native @ 0x469a6038)]) about:addons docshell visible: true
WARNING: NS_ENSURE_TRUE(mState == STATE_TRANSFERRING) failed: file /builds/slave/tm-osx-dbg/build/netwerk/base/src/nsSocketTransport2.cpp, line 1886
++DOMWINDOW == 404 (0x46942888) [serial = 6793] [outer = 0x466bf2e0]
TEST-INFO | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Console message: [JavaScript Warning: "CSP: Directive "script-src https://services.addons.mozilla.org https://static.addons.mozilla.net https://www.google.com https://www.paypalobjects.com" violated by https://addons.mozilla.org/en-US/jsi18n//build:7d7e85b"]
TEST-INFO | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Console message: [JavaScript Warning: "CSP: Directive "script-src https://services.addons.mozilla.org https://static.addons.mozilla.net https://www.google.com https://www.paypalobjects.com" violated by https://addons.mozilla.org/en-US/firefox/addons/buttons.js/build:7d7e85b"]
TEST-INFO | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Console message: [JavaScript Warning: "CSP: Directive "script-src https://services.addons.mozilla.org https://static.addons.mozilla.net https://www.google.com https://www.paypalobjects.com" violated by https://addons.mozilla.org/en-US/jsi18n//build:7d7e85b"]
TEST-INFO | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Console message: [JavaScript Warning: "Error in parsing value for 'max-width'.  Declaration dropped." {file: "https://static.addons.mozilla.net/media/css/zamboni/discovery-pane-min.css?build=7d7e85b" line: 1}]
TEST-INFO | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Console message: [JavaScript Warning: "Unknown property 'zoom'.  Declaration dropped." {file: "https://static.addons.mozilla.net/media/css/zamboni/discovery-pane-min.css?build=7d7e85b" line: 1}]
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80530008: file /builds/slave/tm-osx-dbg/build/dom/src/storage/nsDOMStorage.cpp, line 1280
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80530008: file /builds/slave/tm-osx-dbg/build/dom/src/storage/nsDOMStorage.cpp, line 1280
TEST-INFO | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Console message: [JavaScript Warning: "CSP: Directive "script-src https://services.addons.mozilla.org https://static.addons.mozilla.net https://www.google.com https://www.paypalobjects.com" violated by https://addons.mozilla.org/en-US/firefox/addons/buttons.js/build:7d7e85b"]
###!!! ASSERTION: Invalid start index: 'count == 0 || start < Length()', file ../../../dist/include/nsTArray.h, line 832
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementsAt [nsTArray.h:833]
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementAt [nsTArray.h:840]
nsHttpConnectionMgr::ProcessPendingQForEntry [netwerk/protocol/http/nsHttpConnectionMgr.cpp:588]
nsHttpConnectionMgr::OnMsgProcessPendingQ [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1024]
nsHttpConnectionMgr::OnMsgReclaimConnection [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1110]
nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1505]
nsSocketOutputStream::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:515]
nsSocketTransport::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:1530]
nsSocketTransportService::DoPollIteration [netwerk/base/src/nsSocketTransportService2.cpp:740]
nsSocketTransportService::OnProcessNextEvent [netwerk/base/src/nsSocketTransportService2.cpp:606]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:584]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsSocketTransportService::Run [netwerk/base/src/nsSocketTransportService2.cpp:637]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:618]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:272]
_pt_root [nsprpub/pr/src/pthreads/ptthread.c:190]
libSystem.B.dylib + 0x32155
libSystem.B.dylib + 0x32012
###!!! ASSERTION: Invalid length: 'start + count <= Length()', file ../../../dist/include/nsTArray.h, line 833
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementsAt [nsTArray.h:834]
nsTArray<nsHttpTransaction*,nsTArrayDefaultAllocator>::RemoveElementAt [nsTArray.h:840]
nsHttpConnectionMgr::ProcessPendingQForEntry [netwerk/protocol/http/nsHttpConnectionMgr.cpp:588]
nsHttpConnectionMgr::OnMsgProcessPendingQ [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1024]
nsHttpConnectionMgr::OnMsgReclaimConnection [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1110]
nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady [netwerk/protocol/http/nsHttpConnectionMgr.cpp:1505]
nsSocketOutputStream::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:515]
nsSocketTransport::OnSocketReady [netwerk/base/src/nsSocketTransport2.cpp:1530]
nsSocketTransportService::DoPollIteration [netwerk/base/src/nsSocketTransportService2.cpp:740]
nsSocketTransportService::OnProcessNextEvent [netwerk/base/src/nsSocketTransportService2.cpp:606]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:584]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsSocketTransportService::Run [netwerk/base/src/nsSocketTransportService2.cpp:637]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:618]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:250]
nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:272]
_pt_root [nsprpub/pr/src/pthreads/ptthread.c:190]
libSystem.B.dylib + 0x32155
libSystem.B.dylib + 0x32012
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:40:05.018135
INFO | automation.py | Reading PID log: /var/folders/Xr/Xr--yJnSEY0U11ET5NZuMU+++TM/-Tmp-/tmpZbm47Bpidlog
PROCESS-CRASH | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_searching.js | application crashed (minidump found)
Operating system: Mac OS X
                  10.5.8 9L31a
CPU: x86
     GenuineIntel family 6 model 23 stepping 10
     2 CPUs

Crash reason:  EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE
Crash address: 0x2b

Thread 0 (crashed)
 0  XUL!imgFrame::UnlockImageData [imgFrame.cpp:b67b035200ef : 691 + 0x3]
    eip = 0x02ac0e34   esp = 0xbfffc1a0   ebp = 0xbfffc1d8   ebx = 0x02ac0e30
    esi = 0x00000000   edi = 0x00000000   eax = 0xffffffff   ecx = 0x00000000
    edx = 0x049da4d8   efl = 0x00210286
    Found by: given as instruction pointer in context
 1  XUL!mozilla::imagelib::RasterImage::InternalAddFrame [RasterImage.cpp:b67b035200ef : 826 + 0xa]
    eip = 0x02ab4605   esp = 0xbfffc1e0   ebp = 0xbfffc258   ebx = 0x02ab438f
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
 2  XUL!mozilla::imagelib::RasterImage::AppendFrame [RasterImage.cpp:b67b035200ef : 880 + 0x67]
    eip = 0x02ab4dcb   esp = 0xbfffc260   ebp = 0xbfffc2a8   ebx = 0x02ab4cb6
    esi = 0x4679dd48   edi = 0x00000000
    Found by: call frame info
 3  XUL!mozilla::imagelib::nsPNGDecoder::CreateFrame [nsPNGDecoder.cpp:b67b035200ef : 120 + 0x51]
    eip = 0x02adeed5   esp = 0xbfffc2b0   ebp = 0xbfffc308   ebx = 0x02adee82
    esi = 0x4679dd48   edi = 0x00000000
    Found by: call frame info
 4  XUL!mozilla::imagelib::nsPNGDecoder::info_callback [nsPNGDecoder.cpp:b67b035200ef : 633 + 0x34]
    eip = 0x02adf50d   esp = 0xbfffc310   ebp = 0xbfffc3c8   ebx = 0x02adf08f
    esi = 0x4679e000   edi = 0x0004a252
    Found by: call frame info
 5  XUL!MOZ_PNG_push_have_info [pngpread.c:b67b035200ef : 1900 + 0x17]
    eip = 0x040fe23c   esp = 0xbfffc3d0   ebp = 0xbfffc3e8   ebx = 0x040fc0d7
    esi = 0x4679e000   edi = 0x0004a252
    Found by: call frame info
 6  XUL!MOZ_PNG_push_read_chunk [pngpread.c:b67b035200ef : 435 + 0x11]
    eip = 0x040fc7a7   esp = 0xbfffc3f0   ebp = 0xbfffc428   ebx = 0x040fc0d7
    esi = 0x4679e000   edi = 0x0004a252
    Found by: call frame info
 7  XUL!MOZ_PNG_proc_some_data [pngpread.c:b67b035200ef : 64 + 0x11]
    eip = 0x040fbfac   esp = 0xbfffc430   ebp = 0xbfffc458   ebx = 0x02ade761
    esi = 0x4679e000   edi = 0x0004a252
    Found by: call frame info
 8  XUL!MOZ_PNG_process_data [pngpread.c:b67b035200ef : 41 + 0x11]
    eip = 0x040fbf3d   esp = 0xbfffc460   ebp = 0xbfffc478   ebx = 0x02ade761
    esi = 0x4679e000   edi = 0x0004a252
    Found by: call frame info
 9  XUL!mozilla::imagelib::nsPNGDecoder::WriteInternal [nsPNGDecoder.cpp:b67b035200ef : 354 + 0x25]
    eip = 0x02ade9e1   esp = 0xbfffc480   ebp = 0xbfffc4b8   ebx = 0x02ade761
    esi = 0x4679e000   edi = 0x0004a252
    Found by: call frame info
10  XUL!mozilla::imagelib::Decoder::Write [Decoder.cpp:b67b035200ef : 103 + 0x1f]
    eip = 0x02aada33   esp = 0xbfffc4c0   ebp = 0xbfffc4e8   ebx = 0x02aad9c2
    esi = 0x00000000   edi = 0x0004a252
    Found by: call frame info
11  XUL!mozilla::imagelib::RasterImage::WriteToDecoder [RasterImage.cpp:b67b035200ef : 2264 + 0x25]
    eip = 0x02ab0a1c   esp = 0xbfffc4f0   ebp = 0xbfffc538   ebx = 0x02ab0952
    esi = 0x00000000   edi = 0x0004a252
    Found by: call frame info
12  XUL!mozilla::imagelib::RasterImage::DecodeSomeData [RasterImage.cpp:b67b035200ef : 2567 + 0x30]
    eip = 0x02ab0c16   esp = 0xbfffc540   ebp = 0xbfffc588   ebx = 0x02ab0b1d
    esi = 0x00000000   edi = 0x0004a252
    Found by: call frame info
13  XUL!mozilla::imagelib::imgDecodeWorker::Run [RasterImage.cpp:b67b035200ef : 2681 + 0x11]
    eip = 0x02ab0dcf   esp = 0xbfffc590   ebp = 0xbfffc618   ebx = 0x02ab0c38
    esi = 0x163db16e   edi = 0x0004a252
    Found by: call frame info
14  XUL!mozilla::imagelib::RasterImage::AddSourceData [RasterImage.cpp:b67b035200ef : 1259 + 0x1d]
    eip = 0x02ab15dc   esp = 0xbfffc620   ebp = 0xbfffc698   ebx = 0x02ab12fe
    esi = 0x3d0d9004   edi = 0x469efbac
    Found by: call frame info
15  XUL!mozilla::imagelib::RasterImage::WriteToRasterImage [RasterImage.cpp:b67b035200ef : 2758 + 0x18]
    eip = 0x02ab175b   esp = 0xbfffc6a0   ebp = 0xbfffc6c8   ebx = 0x03ee6810
    esi = 0x3d0d9004   edi = 0x469efbac
    Found by: call frame info
16  XUL!nsPipeInputStream::ReadSegments [nsPipe3.cpp:b67b035200ef : 798 + 0x2f]
    eip = 0x03ee6951   esp = 0xbfffc6d0   ebp = 0xbfffc728   ebx = 0x03ee6810
    esi = 0x3d0d9004   edi = 0x469efbac
    Found by: call frame info
17  XUL!imgRequest::OnDataAvailable [imgRequest.cpp:b67b035200ef : 1152 + 0x42]
    eip = 0x02ad5424   esp = 0xbfffc730   ebp = 0xbfffca68   ebx = 0x02ad45db
    esi = 0x03ee6802   edi = 0x02999f56
    Found by: call frame info
18  XUL!ProxyListener::OnDataAvailable [imgLoader.cpp:b67b035200ef : 2020 + 0x3e]
    eip = 0x02ac3b85   esp = 0xbfffca70   ebp = 0xbfffcaa8   ebx = 0x0299ed9a
    esi = 0x00000000   edi = 0x469c45d0
    Found by: call frame info
19  XUL!nsHttpChannel::OnDataAvailable [nsHttpChannel.cpp:b67b035200ef : 4198 + 0x60]
    eip = 0x0299f0bb   esp = 0xbfffcab0   ebp = 0xbfffcb28   ebx = 0x0299ed9a
    esi = 0x00000000   edi = 0x469c45d0
    Found by: call frame info
Component: Add-ons Manager → Networking: HTTP
Product: Toolkit → Core
QA Contact: add-ons.manager → networking.http
Summary: Test failure in chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser-window/browser_bug581076.js | application timed out → ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js
Whiteboard: [orange] → [orange][test which aborts the suite]
Version: unspecified → Trunk
I'm not familiar with either the services-central or tracemonkey branches, but from the stack I suspect this is a dup of 653530 .. did these branches contain changeset 9b12c34434f4 ? (the fix for that bug)
The services-central instance was on a merge from mozilla-central up to http://hg.mozilla.org/mozilla-central/rev/4c616579cdfa, and the tracemonkey instance was on a merge from mozilla-central up to http://hg.mozilla.org/mozilla-central/rev/bf1db7f29686, so m-c as of 04:33 and 10:44 this morning, both well after 9b12c34434f4.
i see what's wrong. grrr.
(In reply to comment #5)
> i see what's wrong.

nope. It was me that was wrong.
I added some debug and started running mochitest-other on osx-32-debug on try and eventually came up with a stack trace that shows the culprit. (below).

Basically the processpendingQ code is testing the idle persistent connections to make sure they are still alive, looking for one to dispatch the pending entry on. It appears that touching an SSL idle persistent connection with IsAlive() is enough to make it synchronously start the socket thread event loop again (with IsAlive() on the stack). Any code that isn't re-entrant with respect to IsAlive() is going to have a problem. 

In this case mPendingQ is being changed (possibly the same transaction dispatched twice, possibly a bounds exception).. I can also see a problem with the idle connection list - it is being iterated in processPendingQ, but the second event loop could certainly modify that as this stack shows.

IsAlive() basically does a non-blocking network read to see if there is an EOF marker there.


 0  libmozalloc.dylib!TouchBadMemory [mozalloc_abort.cpp:273c0fc2f4be : 64 + 0x5]
    eip = 0x00015f43   esp = 0xb02875d0   ebp = 0xb02875d8   ebx = 0x00015f66
    esi = 0x00419fb0   edi = 0x42c9a570   eax = 0x00000000   ecx = 0x00015f3d
    edx = 0x00000000   efl = 0x00010282
    Found by: given as instruction pointer in context
 1  libmozalloc.dylib!mozalloc_abort [mozalloc_abort.cpp:273c0fc2f4be : 85 + 0x4]
    eip = 0x00015f9f   esp = 0xb02875e0   ebp = 0xb02875f8   ebx = 0x00015f66
    esi = 0x00419fb0   edi = 0x42c9a570
    Found by: call frame info
 2  XUL!Abort [nsDebugImpl.cpp:273c0fc2f4be : 388 + 0xa]
    eip = 0x03f27c15   esp = 0xb0287600   ebp = 0xb0287618   ebx = 0x03f27f17
    esi = 0x00419fb0   edi = 0x42c9a570
    Found by: call frame info
 3  XUL!NS_DebugBreak_P [nsDebugImpl.cpp:273c0fc2f4be : 345 + 0xd]
    eip = 0x03f281be   esp = 0xb0287620   ebp = 0xb0287a48   ebx = 0x03f27f17
    esi = 0x00419fb0   edi = 0x42c9a570
    Found by: call frame info
 4  XUL!assertPending [nsHttpConnectionMgr.cpp:273c0fc2f4be : 341 + 0x3d]
    eip = 0x0297bd5d   esp = 0xb0287a50   ebp = 0xb0287a78   ebx = 0x0297bd1e
    esi = 0x00419fb0   edi = 0x42c9a570
    Found by: call frame info
 5  XUL!nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady [nsHttpConnectionMgr.cpp:273c0fc2f4be : 1525 + 0x4]
    eip = 0x02980ce8   esp = 0xb0287a80   ebp = 0xb0287b38   ebx = 0x02980713
    esi = 0x00419fb0   edi = 0x42c9a570
    Found by: call frame info
 6  XUL!nsSocketOutputStream::OnSocketReady [nsSocketTransport2.cpp:273c0fc2f4be : 515 + 0x1f]
    eip = 0x028de8bb   esp = 0xb0287b40   ebp = 0xb0287b78   ebx = 0x028de754
    esi = 0x028dec50   edi = 0x03f177f6
    Found by: call frame info
 7  XUL!nsSocketTransport::OnSocketReady [nsSocketTransport2.cpp:273c0fc2f4be : 1527 + 0x17]
    eip = 0x028ded20   esp = 0xb0287b80   ebp = 0xb0287bc8   ebx = 0x028dec5c
    esi = 0x028dec50   edi = 0x03f177f6
    Found by: call frame info
 8  XUL!nsSocketTransportService::DoPollIteration [nsSocketTransportService2.cpp:273c0fc2f4be : 742 + 0x2e]
    eip = 0x028e66d8   esp = 0xb0287bd0   ebp = 0xb0287c38   ebx = 0x028e62f6
    esi = 0x028dec50   edi = 0x03f177f6
    Found by: call frame info
 9  XUL!nsSocketTransportService::OnProcessNextEvent [nsSocketTransportService2.cpp:273c0fc2f4be : 603 + 0x12]
    eip = 0x028e6934   esp = 0xb0287c40   ebp = 0xb0287c78   ebx = 0x03f167a8
    esi = 0x0043ef48   edi = 0x03f177f6
    Found by: call frame info
10  XUL!nsThread::ProcessNextEvent [nsThread.cpp:273c0fc2f4be : 582 + 0x61]
    eip = 0x03f1693b   esp = 0xb0287c80   ebp = 0xb0287d28   ebx = 0x03f167a8
    esi = 0x0043ef48   edi = 0x03f177f6
    Found by: call frame info
11  XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:273c0fc2f4be : 250 + 0x1f]
    eip = 0x03e9e50a   esp = 0xb0287d30   ebp = 0xb0287d78   ebx = 0x03e9e494
    esi = 0x00419fb0   edi = 0x03f177f6
    Found by: call frame info
12  XUL!nsProxyEventObject::CallMethod [nsProxyEventObject.cpp:273c0fc2f4be : 260 + 0x12]
    eip = 0x03f24266   esp = 0xb0287d80   ebp = 0xb0287e08   ebx = 0x03f23ec4
    esi = 0x00419fb0   edi = 0x03f177f6
    Found by: call frame info
13  XUL!PrepareAndDispatch [xptcstubs_unixish_x86.cpp:273c0fc2f4be : 93 + 0x31]
    eip = 0x03f3e2a8   esp = 0xb0287e10   ebp = 0xb0287f18   ebx = 0x03f3e03f
    esi = 0x00000010   edi = 0x03f23eb6
    Found by: call frame info
14  XUL!nsXPTCStubBase::Stub16 [xptcstubsdef.inc : 14 + 0x25]
    eip = 0x03f39215   esp = 0xb0287f20   ebp = 0xb0287f68   ebx = 0x03f391ee
    esi = 0x3f617430   edi = 0xb0287f30
    Found by: call frame info
15  XUL!nsNSSSocketInfo::EnsureDocShellDependentStuffKnown [nsNSSIOLayer.cpp:273c0fc2f4be : 398 + 0x3c]
    eip = 0x039febdb   esp = 0xb0287f70   ebp = 0xb0288018   ebx = 0x039fe9e9
    esi = 0x3f617430   edi = 0x03f391e0
    Found by: call frame info
16  XUL!nsNSSSocketInfo::GetPreviousCert [nsNSSIOLayer.cpp:273c0fc2f4be : 847 + 0xa]
    eip = 0x039fef6a   esp = 0xb0288020   ebp = 0xb0288068   ebx = 0x039fef14
    esi = 0x00000000   edi = 0x00000001
    Found by: call frame info
17  XUL!HandshakeCallback [nsNSSCallbacks.cpp:273c0fc2f4be : 919 + 0x31]
    eip = 0x039e99b5   esp = 0xb0288070   ebp = 0xb02883b8   ebx = 0x039e93e9
    esi = 0x00000000   edi = 0x00000001
    Found by: call frame info
18  libssl3.dylib!ssl3_HandleFinished [ssl3con.c : 8503 + 0x1f]
    eip = 0x001f1b72   esp = 0xb02883c0   ebp = 0xb0288418   ebx = 0x001f139c
    esi = 0x00000000   edi = 0x00000001
    Found by: call frame info
19  libssl3.dylib!ssl3_HandleHandshakeMessage [ssl3con.c : 8659 + 0x1f]
    eip = 0x001f22e2   esp = 0xb0288420   ebp = 0xb0288498   ebx = 0x001f1b91
    esi = 0x00000301   edi = 0x00000001
    Found by: call frame info
20  libssl3.dylib!ssl3_HandleHandshake [ssl3con.c : 8727 + 0x20]
    eip = 0x001f250e   esp = 0xb02884a0   ebp = 0xb02884d8   ebx = 0x001f2332
    esi = 0x00000301   edi = 0x00000001
    Found by: call frame info
21  libssl3.dylib!ssl3_HandleRecord [ssl3con.c : 9066 + 0x11]
    eip = 0x001f3075   esp = 0xb02884e0   ebp = 0xb02885d8   ebx = 0x001f2784
    esi = 0x00000301   edi = 0x00000001
    Found by: call frame info
22  libssl3.dylib!ssl3_GatherCompleteHandshake [ssl3gthr.c : 209 + 0x1d]
    eip = 0x001f41ca   esp = 0xb02885e0   ebp = 0xb0288628   ebx = 0x001f40ed
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
23  libssl3.dylib!ssl_GatherRecord1stHandshake [sslcon.c : 1258 + 0x12]
    eip = 0x001f6e71   esp = 0xb0288630   ebp = 0xb0288678   ebx = 0x001f6dec
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
24  libssl3.dylib!ssl_Do1stHandshake [sslsecur.c : 151 + 0x10]
    eip = 0x0020247c   esp = 0xb0288680   ebp = 0xb02886a8   ebx = 0x00202237
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
25  libssl3.dylib!ssl_SecureRecv [sslsecur.c : 1141 + 0xa]
    eip = 0x0020436f   esp = 0xb02886b0   ebp = 0xb0288708   ebx = 0x002041ee
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
26  libssl3.dylib!ssl_Recv [sslsock.c : 1591 + 0x25]
    eip = 0x0020bd8b   esp = 0xb0288710   ebp = 0xb0288748   ebx = 0x0020bcd4
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
27  XUL!nsSSLThread::requestRecvMsgPeek [nsSSLThread.cpp:273c0fc2f4be : 208 + 0x2b]
    eip = 0x039e1597   esp = 0xb0288750   ebp = 0xb02887a8   ebx = 0x039e133a
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
28  XUL!PSMRecv [nsNSSIOLayer.cpp:273c0fc2f4be : 2180 + 0x26]
    eip = 0x039fcb94   esp = 0xb02887b0   ebp = 0xb02887f8   ebx = 0x039fcaea
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
29  libnspr4.dylib!PR_Recv [priometh.c:273c0fc2f4be : 220 + 0x2b]
    eip = 0x000484ac   esp = 0xb0288800   ebp = 0xb0288828   ebx = 0x02979d0d
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
30  XUL!nsSocketTransport::IsAlive [nsSocketTransport2.cpp:273c0fc2f4be : 1852 + 0x29]
    eip = 0x028df1ae   esp = 0xb0288830   ebp = 0xb0288898   ebx = 0x02979d0d
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
31  XUL!nsHttpConnection::IsAlive [nsHttpConnection.cpp:273c0fc2f4be : 255 + 0x22]
    eip = 0x02979049   esp = 0xb02888a0   ebp = 0xb02888d8   ebx = 0x02979d0d
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
32  XUL!nsHttpConnection::CanReuse [nsHttpConnection.cpp:273c0fc2f4be : 220 + 0x3b]
    eip = 0x02979d4a   esp = 0xb02888e0   ebp = 0xb0288918   ebx = 0x02979d0d
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
33  XUL!nsHttpConnectionMgr::GetConnection [nsHttpConnectionMgr.cpp:273c0fc2f4be : 716 + 0xa]
    eip = 0x0297f377   esp = 0xb0288920   ebp = 0xb0288958   ebx = 0x0297f2d7
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
34  XUL!nsHttpConnectionMgr::ProcessPendingQForEntry [nsHttpConnectionMgr.cpp:273c0fc2f4be : 604 + 0x26]
    eip = 0x0297fb68   esp = 0xb0288960   ebp = 0xb02889b8   ebx = 0x0297fa2b
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
35  XUL!nsHttpConnectionMgr::OnMsgProcessPendingQ [nsHttpConnectionMgr.cpp:273c0fc2f4be : 1057 + 0x17]
    eip = 0x02980396   esp = 0xb02889c0   ebp = 0xb0288a08   ebx = 0x0298030e
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
36  XUL!nsHttpConnectionMgr::OnMsgReclaimConnection [nsHttpConnectionMgr.cpp:273c0fc2f4be : 1142 + 0x19]
    eip = 0x029806d4   esp = 0xb0288a10   ebp = 0xb0288a98   ebx = 0x029803fd
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
37  XUL!nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady [nsHttpConnectionMgr.cpp:273c0fc2f4be : 1552 + 0x32]
    eip = 0x02980ecd   esp = 0xb0288aa0   ebp = 0xb0288b58   ebx = 0x02980713
    esi = 0x42cfa9f0   edi = 0x42c9b6c0
    Found by: call frame info
38  XUL!nsSocketOutputStream::OnSocketReady [nsSocketTransport2.cpp:273c0fc2f4be : 515 + 0x1f]
    eip = 0x028de8bb   esp = 0xb0288b60   ebp = 0xb0288b98   ebx = 0x028de754
    esi = 0x028dec50   edi = 0x00000000
    Found by: call frame info
39  XUL!nsSocketTransport::OnSocketReady [nsSocketTransport2.cpp:273c0fc2f4be : 1527 + 0x17]
    eip = 0x028ded20   esp = 0xb0288ba0   ebp = 0xb0288be8   ebx = 0x028dec5c
    esi = 0x028dec50   edi = 0x00000000
    Found by: call frame info
40  XUL!nsSocketTransportService::DoPollIteration [nsSocketTransportService2.cpp:273c0fc2f4be : 742 + 0x2e]
    eip = 0x028e66d8   esp = 0xb0288bf0   ebp = 0xb0288c58   ebx = 0x028e62f6
    esi = 0x028dec50   edi = 0x00000000
    Found by: call frame info
41  XUL!nsSocketTransportService::OnProcessNextEvent [nsSocketTransportService2.cpp:273c0fc2f4be : 607 + 0x12]
    eip = 0x028e6949   esp = 0xb0288c60   ebp = 0xb0288c98   ebx = 0x03f167a8
    esi = 0x0043ef48   edi = 0x00000000
    Found by: call frame info
42  XUL!nsThread::ProcessNextEvent [nsThread.cpp:273c0fc2f4be : 582 + 0x61]
    eip = 0x03f1693b   esp = 0xb0288ca0   ebp = 0xb0288d48   ebx = 0x03f167a8
    esi = 0x0043ef48   edi = 0x00000000
    Found by: call frame info
43  XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:273c0fc2f4be : 250 + 0x1f]
    eip = 0x03e9e50a   esp = 0xb0288d50   ebp = 0xb0288d98   ebx = 0x03e9e494
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
44  XUL!nsSocketTransportService::Run [nsSocketTransportService2.cpp:273c0fc2f4be : 649 + 0x12]
    eip = 0x028e6138   esp = 0xb0288da0   ebp = 0xb0288de8   ebx = 0x028e6019
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
45  XUL!nsThread::ProcessNextEvent [nsThread.cpp:273c0fc2f4be : 618 + 0x18]
    eip = 0x03f16b26   esp = 0xb0288df0   ebp = 0xb0288e98   ebx = 0x03f167a8
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
46  XUL!NS_ProcessNextEvent_P [nsThreadUtils.cpp:273c0fc2f4be : 250 + 0x1f]
    eip = 0x03e9e50a   esp = 0xb0288ea0   ebp = 0xb0288ee8   ebx = 0x03e9e494
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
47  XUL!nsThread::ThreadFunc [nsThread.cpp:273c0fc2f4be : 273 + 0x12]
    eip = 0x03f174f9   esp = 0xb0288ef0   ebp = 0xb0288f48   ebx = 0x03f173ff
    esi = 0x00000000   edi = 0x00000000
    Found by: call frame info
48  libnspr4.dylib!_pt_root [ptthread.c:273c0fc2f4be : 187 + 0x10]
    eip = 0x0006cca9   esp = 0xb0288f50   ebp = 0xb0288f78   ebx = 0x0006cb93
    esi = 0xb0289000   edi = 0x00000000
    Found by: call frame info
49  libSystem.B.dylib + 0x32154
    eip = 0x9296b155   esp = 0xb0288f80   ebp = 0xb0288fc8   ebx = 0x9296b028
    esi = 0xb0289000   edi = 0x00000000
    Found by: call frame info
50  libSystem.B.dylib + 0x32011
    eip = 0x9296b012   esp = 0xb0288fd0   ebp = 0xb0288fec
    Found by: previous frame's frame pointer
Assignee: nobody → mcmanus
Status: NEW → UNCONFIRMED
Ever confirmed: false
adding brian and honza to the cc as they have more ssl understanding than I do.. do either of you have thoughts on comment 8?
Yeah, this is a regression I was a bit worried about when you started to use PR_Available form the http code.  

I'll try to figure out what we can do about this.
Assignee: mcmanus → honzab.moz
Status: UNCONFIRMED → NEW
Ever confirmed: true
I guess the event loop is running because the ensuredocshell() routine uses synchronous proxies to the main thread. whoa.

AFAIK IsAlive() would have triggered this for a long time although maybe somewhat less painfully in other code rearrangements.. including firefox 4.

e.g. look at the mozilla-2.0 code which calls CanReuse()->IsAlive() out of nsHttpConnectionMgr::GetConnection() .. that holds an index reference to mIdleConns across the call which might certainly be invalid after the event loop runs resulting in a similar bounds exception problem or the deletion of the wrong object. The same goes for the pruneDeadConnections code..
honza,

Do you think the "ssl-recv can spin the event loop" rises to the level of a separate bug? It certainly seems very likely to cause other bugs.

meanwhile, something as simple as this probably gets us back to the state of things as they are in 4.0.. what do you think of that as a stop gap? (untested)

diff --git a/netwerk/protocol/http/nsHttpConnectionMgr.cpp b/netwerk/protocol/http/nsHttpConnectionMgr.cpp
--- a/netwerk/protocol/http/nsHttpConnectionMgr.cpp
+++ b/netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ -579,7 +579,14 @@ nsHttpConnectionMgr::ProcessPendingQForE
             if (conn)
                 break;
         }
-        if (conn) {
+
+        // It is possible that the GetConnection() call invoked CanReuse()
+        // on an idle SSL persistent connection. That combination can result
+        // in the socket thread event loop being run synchronously on the stack,
+        // which in turn means ent->mPending may have changed. To fix that
+        // enure the index value we are using.
+        i = ent->mPendingQ.IndexOf(trans);
+        if (conn && (i >= 0)) {
             LOG(("  dispatching pending transaction...\n"));
 
             // remove pending transaction
I misslooked the stack in comment 8.  I thought this was a regression from calling PR_Available.  But it's not the reason.

Before patch for HTTP acceleration, ProcessPendingQForEntry was called (indirectly) from ~nsConnectionHandle dtor or in vary rare case from nsHttpConnection::OnSocketReadable.  Both places being called on the socket thread, any events involved in the chain also posted to the socket thread only.  So, this problem is not a regression.

Why we see this now (happening just more often?) is that we store all new transactions to a pending queue now and establish half-opens for them.  Those soon success and dispatches transactions for real.  Before the patch, transaction were dispatched immediately, so this bug happened much more rarely, if even so.


What can we do:

1) Change nsNSSSocketInfo::EnsureDocShellDependentStuffKnown to not proxy and say that PR_Read must not spin the event loop (out of scope of this bug, but I'd like to do it later anyway)


2) Patch from comment 12 is not enough from following reasons:
- it must keep ref to the object held in |trans| to prevent it from being released during call to GetConnection
- the iteration must use direct mPendingQ.Count() check instead of caching it in the |count| variable
- GetConnection calls CreateTransport that might create a new half-open for a transaction that doesn't necessarily have to be in the pendingQ anymore (personally I play with a thought to break the relation transaction/half-open soon anyway)
- other possible problems I cannot anticipate now


3) [I vote for this one] Change the code to first search for an idle connection if there is a positive number of pending transactions that are not already referred by their respective half-opens ; the goal is to first call CanReuse before we iterate mPendingQ.  So, my idea for the design is:
- let's split GetConnection method as follows
- have a new FindConnection to only look for idle connections, but do not create new ones (half opens)
- have a new method CreateConnection that creates a new half open and bound it with a transaction
- GetConnection will call these two new methods
- in ProcessPendingQForEntry: first call FindConnection, i.e. gather an idle conn and remove it from the list, regardless of trans->Caps() & NS_HTTP_ALLOW_KEEPALIVE
- if found, search for a transaction that trans->Caps() & NS_HTTP_ALLOW_KEEPALIVE and is not referred by a half-open already and dispatch it with the connection found by FindConnection; we are done
- (if such a transaction has not been found, return the connection back to the idle list)
- otherwise, take any transaction that is not referred by a half-open already and create a new half-open (call CreateConnection)
- FindConnection also have to check that the connection on which it just called CanReuse is still present in the idle list; if not, ignore that connection since it has already been dispatched to a transaction


This seems to me clearest, Patrick, please check the idea and let me know what you think.
>Change the code to first search for an idle connection
>if there is a positive number of pending transactions that are not already
>referred by their respective half-opens ; the goal is to first call CanReuse
>before we iterate mPendingQ

This should stand as:

Change the code to first search for an idle connection ; the goal is to first call CanReuse before we iterate mPendingQ
(In reply to comment #14)
> >Change the code to first search for an idle connection

Hypothesis A : any use of IsAlive() can run the event loop.

Assuming this is true, I don't understand how we can safely search for an idle connection at all assuming the search uses CanReuse()->IsAlive()

for (int i=0; i< idleConns.Count(); i++)
 if (idleConns[i].CanReuse()) {
  dispatchOn(idleConns[i]);
  break;
}

That's not any safer if isAlive() can run the event loop - the event loop can do anything, including modifying idleConns. Changing mPending is the effect we are seeing, but it's not the only possibility. Indeed, the pseudo code above could theoretically return TRUE for canreuse of some conn and at the same time have removed it from the idle list (and have put some other transaction on it).

It can probably even deadlock the ssl connection if isalive() runs the event loop which calls isalive() on the same ssl conn again maybe due to something like a transaction cancel.

If Hypothesis A is true, I don't see any real fix other than option 1.

Hypothesis B: the only uses of IsAlive() that really cause the event loop to run in practice are checks on idle SSL connections that have not yet completed the initial SSL handshake. In practice these are idle SSL connections that have not yet had an HTTP transaction run on them.

That's what is really different about the syn retry code - because we have separated the concept of matching connections and transactions from the process of starting connections these connections are in the idle pool looking for a pending transaction, when there never was a "used 0 times" connection in the idle pool before. Comment 8 is clearly showing IsAlive() being tested on such a connection causing the event loop to run when stuff is needed from the main thread to complete the handshake.

If that is the case, the direct thing to do is just work around the problem in IsAlive().. identify connections that meet that criteria (very easy to do) and just skip the mSocketTransport->IsAlive() test for them, making the assumption that they are alive. The test is basically an optimization anyhow - Connections that have not been used before are timed out quite quickly anyhow (5 seconds) and if they should happen to be dead when we use them then they will generate errors that we already cope with as a "ships passing in the night" race detection anyhow (i.e. the FIN could have been sent by the server but not read by us at the time we run IsAlive()).. landing bug 604796 would further help (it would reduce the cases of the isalive() check at dispatch time identifying a closed connection to be a very rare event).

I like this better than the long series of steps of option 3. I think this is especially true because the way the code is structured now is doing the right thing assuming isalive() doesn't run the event loop. And there are quite a few other pre-existing cases in netwerk that make that assumption.

I suspect Hypothesis B gets us back to the danger level of Firefox-4 (i.e. the danger appears theoretical.) and therefore is a realistic workaround while we work on option 1.
> 1) Change nsNSSSocketInfo::EnsureDocShellDependentStuffKnown to not proxy and
> say that PR_Read must not spin the event loop (out of scope of this bug, but
> I'd like to do it later anyway)
> 

I think this is a "must do", even if we do something else faster. Honza, shall I open the bug or would you like to? :)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1304564530.1304568333.21811.gz&fulltext=1
Summary: ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js → ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js browser_bug618502.js
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1304603564.1304607399.25958.gz&fulltext=1#err0
Summary: ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js browser_bug618502.js → ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js browser_bug618502.js browser_uninstalling.js
(In reply to comment #15)

I agree with your hypothesis B.  As I understand you suggest to remove the IsAlive check for connections that has never been used?  If so, please go forward and create a patch.  I'll review it quickly.

(In reply to comment #16)
> > 1) Change nsNSSSocketInfo::EnsureDocShellDependentStuffKnown to not proxy 
> I think this is a "must do", even if we do something else faster. Honza, shall
> I open the bug or would you like to? :)

Let me think of this first a bit.  Then I'll submit the bug my self.  Thanks.
(In reply to comment #20)
> (In reply to comment #15)
> 
> I agree with your hypothesis B.  As I understand you suggest to remove the
> IsAlive check for connections that has never been used?  

Yes - but just for SSL.

> If so, please go
> forward and create a patch.  I'll review it quickly.

I will do it tonight. Thanks!
patch as discussed from comment 20.

I have run this patch successfully, along with the assertions added to reproduce the trace in comment 8, on an OS X debug build for 7 iterations on try with mochitest-other. Without the fix that combination was generating a failure for me about 1 out of 3 times.

I have also sent the patch off for a full try run - results are pending.
Attachment #530486 - Flags: review?(honzab.moz)
note the new stack in comment 18 is a different stack, but is consistent with hypothesis B. In that stack processPending isn't being used at all - the issue is with calling canreuse->isalive() on an idle conn during the prune idle connections callback which again leads to the proxying in ensuredocshell(). So this makes me fairly confident in the approach of the patch in comment 22.
Attachment #530486 - Flags: review?(honzab.moz) → review-
(In reply to comment #22)
> Created attachment 530486 [details] [diff] [review] [review]
> skip isalive for unused ssl connections v1
> 
> I have also sent the patch off for a full try run - results are pending.

That patch makes /tests/security/ssl/stricttransportsecurity/test_sts_privatebrowsing.html fail fairly often in mochitest-5..

I suspect it is probably because I used mIsReused which only gets set on transaction completion instead of on dispatch.. I guess I need to bite the bullet and add another flag. I'll experiment with try and update the patch.
(In reply to comment #26)
> mfinkle%mozilla.com
> http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Aurora/1304767144.
> 1304770073.24827.gz
> Rev3 MacOSX Leopard 10.5.8 mozilla-aurora debug test mochitest-other on
> 2011/05/07 04:19:04
>
I don't think that this incident is due to this bug. The test case is the same, but nothing about the stacks is similar.
honza, I'm sorry about the delay here - but I was caught up in some other failures I saw when running this test case again and again and again :)

I'll post the fixes as two patches.
This is basically the approach agreed in comment 21.

However, totally skipping IsAlive() for non-dispatched SSL connections wasn't quite sufficient. That regressed security/ssl/stricttransportsecurity/test_sts_privatebrowsing.html in mochitest-5 largely for the same root causes as found in bug 480619 (see comment 45 in that bug)

This can be pretty easily accommodated by instead of skipping isalive() changing the call to msg_rcv(PEEK) inside it to simply pr_poll() for the non-dispatched ssl connection. That will give us the necessary checks on the SSL status without actually worrying about calling recv and running the event loop.
Attachment #530486 - Attachment is obsolete: true
Attachment #531677 - Flags: review?(honzab.moz)
While running the tests involved in this bug I sometimes would get a failure in browser/components/privatebrowsing/test/browser/browser_privatebrowsing_certexceptionsui.js from browser-chrome.

The root cause of that was out of date security callbacks that were not reset when an extra idle persistent connection was used for a different transaction than the one that initiated it.

The fix is simple.
Attachment #531680 - Flags: review?(honzab.moz)
the try server gently reminded me that I need to proxy release the callback reference when it changes just as a I do in the dtor. That's the only difference. Try is still running with the update, but things look good so far.
Attachment #531680 - Attachment is obsolete: true
Attachment #531680 - Flags: review?(honzab.moz)
Attachment #531793 - Flags: review?(honzab.moz)
(In reply to comment #31)
> Try is still running with the update, but things look good so
> far.

Everything is green so we should expedite both these fixes so they get at least a little nightly testing before aurora-6.

There is a whitespace change to one line in the second patch - I'll fix that before committing anything.
I just tacked to biesi:

- we should use PR_Poll always instead of PR_Read
- this can also tell us there are still pending data to read (PR_POLL_READ on output)
- it is OK to change the IsAlive signature

So, we could also fix bug 631801 with it!

We definitely need good tests for changes like these to be sure we do not break the transaction restart logic...
Comment on attachment 531677 [details] [diff] [review]
workaround-isalive-runs-eventloop.2

Review of attachment 531677 [details] [diff] [review]:
-----------------------------------------------------------------

Good idea with PR_Poll.

::: netwerk/base/public/nsISocketTransport.idl
@@ +161,5 @@
>      /**
> +     * This flag indicates that IsAlive() method should not
> +     * read, even non-destructively, from the network. Some protocol
> +     * implementations may deadlock if called without this.
> +     */

I don't see a reason for a new flag here.  You set it only for the time nsSocketTransport::IsAlive is called and I don't see another use to it.  The interface is not frozen and I got confirmed from biesi we can modify it.  Also, this patch is a workaround for very specific issue with a new SSL socket PR_Read event loop re-spin that has and needs to be fixed in near future.  We would then remove this flag again.

Let's just pass this flag as an argument to IsAlive().

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +1847,5 @@
>      }
>  
>      // XXX do some idle-time based checks??
>  
> +    if (mConnectionFlags & nsISocketTransport::PASSIVE_ISALIVE) {

As discussed above, let's turn this to a method argument.

@@ +1853,4 @@
>  
> +        PRPollDesc desc;
> +        desc.fd = mFD;
> +        desc.in_flags = PR_POLL_EXCEPT | PR_POLL_READ;

Is it really needed to poll also for READ when you want only EXCEPT and ERROR on out?

@@ +1856,5 @@
> +        desc.in_flags = PR_POLL_EXCEPT | PR_POLL_READ;
> +        desc.out_flags = 0;
> +        if ((PR_Poll(&desc, 1, 0) == 1) &&
> +            (desc.out_flags & (PR_POLL_EXCEPT | PR_POLL_ERR))) {
> +            *result = PR_FALSE;

You probably also want to check for PR_POLL_NVAL.  Not sure about PR_POLL_HUP but probably as well.

Is anywhere defined what happens when you poll on a socket that is already being polled (with a timeout) on a different thread?  This might happen after we land your patch for bug 604796 and this code gets invoked on a thread different then socket transport thread.  Here is a good place to add ABORT_IF_FALSE with check for the socket thread.
Attachment #531677 - Flags: review?(honzab.moz)
(In reply to comment #34)
> Comment on attachment 531677 [details] [diff] [review] [review]
> workaround-isalive-runs-eventloop.2

> Is it really needed to poll also for READ when you want only EXCEPT and
> ERROR on out?
> 

yes, READ is required. I started without it but the poll implementation optimizes the relevant ssl health checks away if there are no fd's queried for read or write.

> Is anywhere defined what happens when you poll on a socket that is already
> being polled (with a timeout) on a different thread? 

poll, being level triggered, is going to be fine with that. The implementation generally just looks at the recv queue in the kernel and sets the poll bit based on that. I've done things like that in the past - as long as everything is non blocking we'll be ok.
Updated based on review comments - this one now changes the IsAlive() interface, and comes with an sr?
Attachment #532054 - Flags: superreview?(cbiesinger)
Attachment #532054 - Flags: review?(honzab.moz)
Attachment #531677 - Attachment is obsolete: true
Comment on attachment 532054 [details] [diff] [review]
workaround-isalive-runs-eventloop.3

Review of attachment 532054 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/public/nsISocketTransport.idl
@@ +102,2 @@
>       */
> +    boolean isAlive(in boolean aPassive);

Please don't add this argument. Just always use PR_Poll instead.

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +1848,5 @@
>  
>      // XXX do some idle-time based checks??
>  
> +    if (aPassive) {
> +        *result = PR_TRUE;                        /* presume true */

I'm not sure this should return true if PR_Poll returns -1

@@ +1853,4 @@
>  
> +        PRPollDesc desc;
> +        desc.fd = mFD;
> +        desc.in_flags = PR_POLL_READ | PR_POLL_EXCEPT |

Why pass PR_POLL_READ? You're not using that flag.

@@ +1853,5 @@
>  
> +        PRPollDesc desc;
> +        desc.fd = mFD;
> +        desc.in_flags = PR_POLL_READ | PR_POLL_EXCEPT |
> +            PR_POLL_HUP | PR_POLL_NVAL;

NVAL is not used in in_flags according to https://developer.mozilla.org/en/PR_Poll

Same for HUP:
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/prio.h#1939
Attachment #532054 - Flags: superreview?(cbiesinger) → superreview-
(In reply to comment #37)
> Comment on attachment 532054 [details] [diff] [review] [review]
> workaround-isalive-runs-eventloop.3
> 

> Please don't add this argument. Just always use PR_Poll instead.

Doing that is a regression for the non-ssl case. recv is testing for EOF - a condition that poll() cannot detect.

> 
> Why pass PR_POLL_READ? You're not using that flag.

as noted in comment 35, it is necessary in order to get the necessary ssl checks that were added as part of bug 480619 (see comment 45 in that bug). And in a sense I really am using that flag - I am logically testing the read side of that socket for exceptions and errors. I just don't want to actually do a read.
> > +    if (aPassive) {
> > +        *result = PR_TRUE;                        /* presume true */
> 
> I'm not sure this should return true if PR_Poll returns -1
> 
my reasoning is that a failure of the PR_Poll call doesn't tell me anything about the status of a particular FD - so it seems unwise to asusme the FD is dead because of it. Any FD specific failure is returned in the PRFileDesc.
update patch based on sr- in comment 37, modulo my explanations in comment 38 and comment 39
Attachment #532054 - Attachment is obsolete: true
Attachment #532054 - Flags: review?(honzab.moz)
Attachment #532082 - Flags: superreview?(cbiesinger)
Attachment #532082 - Flags: review?(honzab.moz)
try server is green with both of these patches applied.
Comment on attachment 532082 [details] [diff] [review]
workaround-isalive-runs-eventloop.4

Review of attachment 532082 [details] [diff] [review]:
-----------------------------------------------------------------

I agree with Patrick's arguments here.

r=honzab with a small change:

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +1858,2 @@
>  
> +        if ((PR_Poll(&desc, 1, 0) == 1) &&

So, it really looks like on some platforms poll (better say select() on windows) is not guaranteed to be accessible concurrently from different treads. [1]

Please add NS_ABORT_IF_FALSE with check for the socket thread before we call PR_Poll.

[1] http://tangentsoft.net/wskfaq/intermediate.html, section 3.12
Attachment #532082 - Flags: review?(honzab.moz) → review+
Comment on attachment 531793 [details] [diff] [review]
out of date security callbacks v2

Review of attachment 531793 [details] [diff] [review]:
-----------------------------------------------------------------

This is an important change and needs to be moved to a separate bug and have at least some STR on the record.

r=honzab

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +166,4 @@
>      // take ownership of the transaction
>      mTransaction = trans;
>      mIsActivated = PR_TRUE;
> +    

Just remove this whitespace change, but you have already mentioned that, I think.
Attachment #531793 - Flags: review?(honzab.moz) → review+
Attachment #531793 - Attachment is obsolete: true
I just pushed a fix on behalf of Patrick over on bug 656989 comment 3, but then noticed that the commit message had the wrong bug number (this bug).

So, anyone looking for the bug associated with this cset...
  http://hg.mozilla.org/projects/cedar/rev/6d0e02e42e34
...and arriving at this bug will want to hop over to bug 656989 comment 3.
(In reply to comment #38)
> > Why pass PR_POLL_READ? You're not using that flag.
> 
> as noted in comment 35, it is necessary in order to get the necessary ssl
> checks that were added as part of bug 480619 (see comment 45 in that bug).

Shouldn't the SSL poll implementation be fixed, then?

> And in a sense I really am using that flag - I am logically testing the read
> side of that socket for exceptions and errors. I just don't want to actually
> do a read.

That's not how poll() works... the read flag means "is data available for reading". it does not mean what you're written.
Comment on attachment 532082 [details] [diff] [review]
workaround-isalive-runs-eventloop.4

So in the interest of getting a topcrasher fixed I'll mark sr+ but I am not happy with having the read flag here. It is very confusing to readers. At least:

- File a bug to fix the poll implementation
- Add a comment that it is needed because our SSL poll implementation is bad, referencing that bug number.
Attachment #532082 - Flags: superreview?(cbiesinger) → superreview+
(In reply to comment #47)

> - File a bug to fix the poll implementation

bug 658138
Blocks: 651635
OS: Mac OS X → All
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test mochitest-other on 2011/05/19 06:49:10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1305812950.1305816753.13101.gz
Summary: ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js browser_bug618502.js browser_uninstalling.js → ASSERTION: Invalid start index: 'count == 0 || start < Length()' browser_bug581076.js browser_searching.js browser_bug618502.js browser_uninstalling.js browser_globalinformations.js
http://hg.mozilla.org/mozilla-central/rev/f1d79c22fd71
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
This seems to have caused a Tp4 regression on Win XP and 7.

Regression :( Tp4 (%CPU) increase 8.32% on XP Firefox
-----------------------------------------------------
    Previous: avg 52.138 stddev 0.987 of 30 runs up to revision f594c196fac7
    New     : avg 56.477 stddev 0.446 of 5 runs since revision f1d79c22fd71
    Change  : +4.339 (8.32% / z=4.398)
    Graph   : http://mzl.la/ljrnVR

Changeset range: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f594c196fac7&tochange=f1d79c22fd71

Regression :( Tp4 (%CPU) increase 8.27% on Win7 Firefox
-------------------------------------------------------
    Previous: avg 60.215 stddev 1.474 of 30 runs up to revision f594c196fac7
    New     : avg 65.194 stddev 0.390 of 5 runs since revision f1d79c22fd71
    Change  : +4.979 (8.27% / z=3.378)
    Graph   : http://mzl.la/lduVoO

Changeset range: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f594c196fac7&tochange=f1d79c22fd71
Assignee: honzab.moz → mcmanus
Depends on: 658498
This probably broke mailnews which also uses IsAlive
Blocks: 658503
(In reply to comment #52)
> This seems to have caused a Tp4 regression on Win XP and 7.
> 

it is possible that without this patch the code was doing some speculative SSL work which was speeding things up - but it is exactly that work that was causing the deadlock. And that work wasn't being done in FF 4.

IN some ways it is the flip side to:

http://hg.mozilla.org/mozilla-central/rev/9b12c34434f4

which caused an improvement in the same area.

It isn't clear to me at all why win tp4 is so sensitive to connection rates (there is another bug that shows that too) which should be very low latency in the test harness, but I think its pretty clear that something in there was hitting these bugs (and this definitely is a bug that can cause timeouts, early errors, or hangs - this code is definitely needed).
(In reply to comment #54)
> (In reply to comment #52)
> > This seems to have caused a Tp4 regression on Win XP and 7.
> > 
> 
> it is possible that without this patch the code was doing some speculative
> SSL work which was speeding things up - but it is exactly that work that was
> causing the deadlock. And that work wasn't being done in FF 4.

If I understand the report correctly (it's quite possible that I don't), the regression isn't about time but CPU load.
Indeed.  The Tp4 regression report shows that on Windows only CPU usage went up 8% or so (while overall time remained unchanged).
* that was my mistake confusing the cpu and time metrics. The other changeset I referenced resulted in a 4% improvement in time. (it is also part of the feature)

* the point of the feature, for which this patch is a incremental fix, is to trade a bit more work for improvements in time. Its only in SSL that this will turn out to be significant cpu, and SSL also has the most to gain (because of higher latency handshakes). Our tests, being localhost based, are going to undermeasure the improvements in time because those will scale up with rtt. So in the grand scheme of things this is consistent - though it is weird that it shows up in this patch.

* something about tp4 on windows only seems extremely sensitive to connection changes - that bears investigation. We want to separate those effects out of the normal tp4 and get them into the network perf test nick hurley is working on  (which should also be able to quantify the performance improvements of changes like this over various types of networks).

* the only real change this bugfix makes is that for some early states of an SSL connect isalive() is tested with pr_poll instead of going right to a recv(MSG_PEEK). I would speculate that poll isn't caching some bit of computation that a subsequent recv() (and perhaps other poll() calls) redoes. I think the right thing to do is file a follow-on bug to investigate that and maybe I can get help from brian smith or another ssl guru in digging into it.
Filing a followup or two sounds good.  Please do!
Note that mozilla-central will be merged to mozilla-aurora next week for Firefox 6. So as long as you don't fully understood what's going on, you might want to consider a backout too.
Blocks: 658580
Depends on: 658799
No longer blocks: 658580
Depends on: 661995
This changes a public interface that is being used by at least a couple of add-ons. Adding dev-doc-needed. It should be included in Firefox 6 for Developers.
Keywords: dev-doc-needed
(In reply to comment #60)
> This changes a public interface that is being used by at least a couple of
> add-ons. Adding dev-doc-needed. It should be included in Firefox 6 for
> Developers.

This might soon be backed out.  See bug 658580.
Making the new parameter optional would mean this had no impact on add-on compatibility.
I see that this change has been backed out (#658580) in FF7 but will it be removed from the FF6 beta or will I have to write a workaround for my add-on to achieve FF6 compatibility?
FF6 has been restored to the original api. (everything matches now).
Depends on: 658580
Is there any word on this change going back in at some point? It's on my doc radar but as far as I can tell is not actually in the tree right now.
(In reply to Eric Shepherd [:sheppy] from comment #65)
> Is there any word on this change going back in at some point?

no, it was fixed a different way.
So does that mean this no longer impacts developer documentation, or is there something I need to know?
(In reply to Eric Shepherd [:sheppy] from comment #67)
> So does that mean this no longer impacts developer documentation, or is
> there something I need to know?

From the pov of actual releases there has been no interface change. So as long as you didn't make an update for the beta stream (which did have a change that got reverted) there is nothing to do.
Whiteboard: [orange][test which aborts the suite] → [test which aborts the suite]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: