Closed Bug 673438 Opened 8 years ago Closed Last year

crash [@ nsMsgProtocol::CloseSocket()] | [@ nsPop3Protocol::ProcessProtocolState] caused by code changes to enable detecting the server dropping the connection on an auth failure

Categories

(MailNews Core :: Networking: POP, defect, critical)

x86
All
defect
Not set
critical

Tracking

(thunderbird11-, thunderbird12-, thunderbird13-, thunderbird14-)

RESOLVED WORKSFORME
Tracking Status
thunderbird11 - ---
thunderbird12 - ---
thunderbird13 - ---
thunderbird14 - ---

People

(Reporter: wsmwk, Unassigned)

References

Details

(4 keywords, Whiteboard: [regression:tb7?][not reproducible][bug 701533 more common])

Crash Data

Attachments

(2 files)

crash [@ nsMsgProtocol::CloseSocket()] |  [@ nsPop3Protocol::ProcessProtocolState]
also @0x0 | nsMsgProtocol::CloseSocket()


bf9fa1c7-0b9c-4df1-8fa2-a07292110606 (dgbole) TB7.0a
EXCEPTION_ACCESS_VIOLATION_EXEC
0x5c32700
0		@0x5c32700	
1	xul.dll	nsMsgProtocol::CloseSocket	mailnews/base/util/nsMsgProtocol.cpp:343
2	xul.dll	nsPop3Protocol::ProcessProtocolState	mailnews/local/src/nsPop3Protocol.cpp:4159
3	xul.dll	nsMsgProtocol::OnDataAvailable	mailnews/base/util/nsMsgProtocol.cpp:387
4	xul.dll	nsInputStreamPump::OnStateTransfer	netwerk/base/src/nsInputStreamPump.cpp:510
5	xul.dll	nsInputStreamPump::OnInputStreamReady	netwerk/base/src/nsInputStreamPump.cpp:400
6	xul.dll	nsOutputStreamReadyEvent::Run	xpcom/io/nsStreamUtils.cpp:114
7	xul.dll	nsThread::ProcessNextEvent	xpcom/threads/nsThread.cpp:618
8	xul.dll	NS_ProcessNextEvent_P	objdir-tb/mozilla/xpcom/build/nsThreadUtils.cpp:245
9	xul.dll	mozilla::ipc::MessagePump::Run	ipc/glue/MessagePump.cpp:110 


There is also a variation that does not have pop3 in the stack
bp-84a5678c-9ab4-4e32-99cb-69d7b2110705
EXCEPTION_ILLEGAL_INSTRUCTION
0xaf9d88
0	thunderbird.exe	nsMsgProtocol::CloseSocket	mailnews/base/util/nsMsgProtocol.cpp:310
1	thunderbird.exe	nsMailboxProtocol::CloseSocket	mailnews/local/src/nsMailboxProtocol.cpp:728
2	thunderbird.exe	nsMsgProtocol::OnStopRequest	mailnews/base/util/nsMsgProtocol.cpp:467
3	thunderbird.exe	nsMailboxProtocol::OnStopRequest	mailnews/local/src/nsMailboxProtocol.cpp:381
4	thunderbird.exe	nsInputStreamPump::OnStateStop	netwerk/base/src/nsInputStreamPump.cpp:578
5	thunderbird.exe	nsInputStreamPump::OnInputStreamReady	netwerk/base/src/nsInputStreamPump.cpp:403
6	xpcom_core.dll	nsOutputStreamReadyEvent::Run	xpcom/io/nsStreamUtils.cpp:112
7	xpcom_core.dll	nsThread::ProcessNextEvent	xpcom/threads/nsThread.cpp:527
regression, or something like, because 99% of crashes are v6 and newer.  Echoed by this comment in bp-2807f25f-1f64-49e8-851c-e497a2110818
"this is the 4th or 5th time Thunderbird has crashed since it's most recent update ... Prior to this update, Thunderbird had never crashed on my system."

EXCEPTION_ACCESS_VIOLATION_EXEC
0x200020
0		@0x200020	
1	xul.dll	nsMsgProtocol::CloseSocket	mailnews/base/util/nsMsgProtocol.cpp:343
2	xul.dll	nsPop3Protocol::ProcessProtocolState	mailnews/local/src/nsPop3Protocol.cpp:4159
3	xul.dll	nsMsgProtocol::OnDataAvailable	mailnews/base/util/nsMsgProtocol.cpp:387
4	xul.dll	nsInputStreamPump::OnStateTransfer	netwerk/base/src/nsInputStreamPump.cpp:510
5	xul.dll	nsInputStreamPump::OnInputStreamReady	netwerk/base/src/nsInputStreamPump.cpp:400
Keywords: regression, topcrash
a pop3 protocol log might be useful here. There's no obvious fix (e.g., a null check). m_transport should be valid, so strans should also be valid.
requested logs from two reporters
From the log sent by a reporter, it looks like the OnStopRequest is coming after we've destroyed the pop3 sink, in the crash case, but not in the non-crashing case.  So, in general, it seems to be coming later in the crash case. I'll investigate a bit more.
I suspect the regression might have to do with the code changes to enable detecting the server dropping the connection on an auth failure. From the protocol logs, it seems as if the failure case is when the server does not close the connection promptly in response to us sending the QUIT command. I'll see if I can duplicate that with the fake server.
Hmm, fakeserver never closes the socket, but relies on the client to do so. I'll try tweaking it to close the socket, perhaps after a slight delay.
Hmm, looking at the logs more closely, we're going from the send quit state to the error done state in the failing case. We're not even getting to the wait for response case, which I think means the server is dropping the connection immediately. Or, the SendData command is failing, when we send QUIT.
david, do we need more from bachastain?
I've requested a try server build with a little bit of extra pop3 logging:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/bienvenu@nventure.com-d0fc8b936460

It's basically our TB 7 beta build with a bit more logging to verify that SendData is failing. If someone who can reproduce this issue can try one of these builds when it comes out sometime later today (I'll update the bug when the build is ready), and log the failure, that would be helpful, thx.
Of course, the windows try server build failed - requested a new one - http://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/bienvenu@nventure.com-100eeec6abea - will update bug if that succeeds.
Attached patch add more loggingSplinter Review
I'd like to get this onto the trunk and into aurora builds since try server is failing me on windows...
Assignee: nobody → dbienvenu
Attachment #562761 - Flags: review?(mbanner)
Attachment #562761 - Flags: review?(mbanner) → review+
tomorrow's nightly trunk build should have the extra logging - http://hg.mozilla.org/comm-central/rev/3e550b22c449
bienvenu, do we want a separate bug for this? ... 

on IRC you stated "we're pumping events from the nsPop3Protocol::Error function, which closes the socket out from under us" regarding bp-bab59133-4143-4e6f-9535-30e922110831 

I've not found any other examples of this 47 line long stack. However, the top frames do roughly match up to a much shorter stack in the likes of bp-54433f4f-73b2-4e53-9ec8-ce33a2111017 and bp-2f00801d-b02c-4736-8bc5-89d1f2111019  (this stack is a low percentage for this signature - only 2 of 12 in a small sample from today).
so 80% of the stacks are like comment 1 ("the regression comment"), where frame 2 is
nsPop3Protocol::ProcessProtocolState 	mailnews/local/src/nsPop3Protocol.cpp:4160 

rather than 
nsPop3Protocol::OnStopRequest 	mailnews/local/src/nsPop3Protocol.cpp:999
in the case of comment 13. So comment 13 perhaps does need its own bug report
(In reply to Wayne Mery (:wsmwk) from comment #1)
> regression, or something like, because 99% of crashes are v6 and newer. 

fallout from Bug 617721??
(startup) crash [@ nsPop3Protocol::LoadUrl(nsIURI*, nsISupports*)] and [@ nsPop3Protocol::LoadUrl] (Mac)
(In reply to Wayne Mery (:wsmwk) from comment #15)

> fallout from Bug 617721??
> (startup) crash [@ nsPop3Protocol::LoadUrl(nsIURI*, nsISupports*)] and [@
> nsPop3Protocol::LoadUrl] (Mac)

Highly doubtful. That only affects the start of the loading of the url, and only in certain error cases. The CloseSocket crash is after the whole url has run.
topcrash still.
nicking this issue might go a long way toward fixing other topcrash signatures

bp-6a08de55-dc15-4919-a796-0f3062111015 (sione, has gone back to version 3, plus a refreshed PC)
bp-bab59133-4143-4e6f-9535-30e922110831 (matthew, crashes stopped in v7)
bp-??  (dmthagar) I've asked for a protocol log
bp-0257b213-1839-4692-a15d-682562110913 (André, waiting for feedback)
bp-7d8b4cea-c474-4d61-947c-05e222120107 (rorscan, asked for protocol log)

Happens also for trunk users bp-c6c0f0e4-5a38-407e-bdd7-e64cd2120102 version 10

As do other signatures for pop users, like rorscan who also sees 
mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast
arena_dalloc_small | arena_dalloc | free | NS_DestroyXPTCallStub_P
@0x0 | nsASDOMWindowEnumerator::GetNext(nsISupports**)
mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast
nsASDOMWindowEnumerator::GetNext(nsISupports**) 
in 
bp-5f240d4d-9133-43f1-8c9f-dce192111231 12/31/201110:08
bp-15704f27-8fc0-4a81-be39-96f9a2111218 12/18/20115:22
bp-786b43c5-f325-4af1-8ba5-5e72f2111205 12/5/20118:30
bp-07ad904c-04f8-4d7b-a41f-e468a2111112 11/12/20117:38
Whiteboard: [regression in v7?]
Blocks: 701533
Blocks: 692981
Duplicate of this bug: 723882
This is at risk for getting fixed in version 11.

#3 crash for version 10, based on combined crash count of all signatures
Crash Signature: [@ nsMsgProtocol::CloseSocket()] [@ @0x0 | nsMsgProtocol::CloseSocket()] → [@ nsMsgProtocol::CloseSocket()] [@ @0x0 | nsMsgProtocol::CloseSocket()] [@ nsRefPtr<nsTypedSelection>::~nsRefPtr<nsTypedSelection>() | `anonymous namespace''::ServerSocketListenerProxy::Release()]
This didn't make 11, but we'll see if there's anything we can do for 12.
Flags: wanted-thunderbird+
Duplicate of this bug: 692981
Duplicate of this bug: 676826
per bienvenu "I'd need is access to a server & setup that caused this bug because I can't reproduce it here. "

no testcase from any reporters yet :(
Keywords: testcase-wanted
I've requested a try server build with this patch, which adds a bit of logging when the protocol object is destroyed, and moves the CloseSocket call earlier in the sequence, to see if that has an effect on the crash location. I'll post a link to the try server build (which will be off the beta branch) when it's available. This is probably not a fix per se, because if the protocol object is getting deleted out from under itself, there's some sort of ref-counting or ownership model problem. But a protocol log from someone seeing the crash from a build with this patch would be very helpful.
builds should show up here - http://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/bienvenu@nventure.com-2524c134a535/ - only linux ones are there so far.
If we can get some testing on the try server builds, it'd be really useful.
(In reply to Mark Banner (:standard8) from comment #26)
> If we can get some testing on the try server builds, it'd be really useful.

unfortunately out of a ffew dozen contacts over the past few months, mostly people either can't reproduce at all or arent' responding.  

For Sione, who is the most prolific crasher, David needs access to his server - which isn't happening.

I'm about ready to give up on this hunt. Perhaps it would be best to get debug info actually IN the dumps reported to crash-stats, so we don't have to do these intensive goose chases.
(In reply to Wayne Mery (:wsmwk) from comment #27)
> (In reply to Mark Banner (:standard8) from comment #26)
> > If we can get some testing on the try server builds, it'd be really useful.
> 
> unfortunately out of a ffew dozen contacts over the past few months, mostly
> people either can't reproduce at all or arent' responding.

Can't reproduce with the try build? Or can't reproduce in general?  
> 
> For Sione, who is the most prolific crasher, David needs access to his
> server - which isn't happening.
Having him run the try server build would be fine.
> 
> I'm about ready to give up on this hunt. Perhaps it would be best to get
> debug info actually IN the dumps reported to crash-stats, so we don't have
> to do these intensive goose chases.

I'm happy to land the patch as is and see if it helps. But I don't know how to add logging information to dumps reported to crash-stats.
(In reply to David :Bienvenu from comment #28)
> (In reply to Wayne Mery (:wsmwk) from comment #27)
> (In reply to Mark
> Banner (:standard8) from comment #26)
> > If we can get some testing on the
> try server builds, it'd be really useful.
> 
> unfortunately out of a ffew
> dozen contacts over the past few months, mostly
> people either can't
> reproduce at all or arent' responding.

Can't reproduce with the try build?
> Or can't reproduce in general?  
> 
> For Sione, who is the most prolific
> crasher, David needs access to his
> server - which isn't happening.
Having
> him run the try server build would be fine.
> 
> I'm about ready to give up
> on this hunt. Perhaps it would be best to get
> debug info actually IN the
> dumps reported to crash-stats, so we don't have
> to do these intensive
> goose chases.

I'm happy to land the patch as is and see if it helps. But I
> don't know how to add logging information to dumps reported to crash-stats.
(Ignore comment 29 please )

(In reply to David :Bienvenu from comment #28)
> (In reply to Wayne Mery (:wsmwk) from comment #27)
> > (In reply to Mark Banner (:standard8) from comment #26)
> > > If we can get some testing on the try server builds, it'd be really useful.
> > 
> > unfortunately out of a ffew dozen contacts over the past few months, mostly
> > people either can't reproduce at all or arent' responding.
> 
> Can't reproduce with the try build? Or can't reproduce in general?  

The later. And for those few that I contacted specifically about the last try build, I think only 2 of ~10 people replied. Sione being one.


> > For Sione, who is the most prolific crasher, David needs access to his
> > server - which isn't happening.
> Having him run the try server build would be fine.
> > 
> > I'm about ready to give up on this hunt. Perhaps it would be best to get
> > debug info actually IN the dumps reported to crash-stats, so we don't have
> > to do these intensive goose chases.
> 
> I'm happy to land the patch as is and see if it helps.

Closing in on a year anniversary, we probably do have to take our best shot at it. However, it will not be sufficient to have it only in the daily build. The crash rate is too low - only 7 crashes in the past 4 months for all betas and alphas (v12/13/14)[1]. I think we need to drive the patch into the beta. Then, if it doesn't break anything, it will still take us 2-3 weeks to know if it had any impact.

And if that doesn't work, we might blast a message via crash-stats to all the crashers that we need volunteers to work with the developer in devising a fix.


> But I don't know how to add logging information to dumps reported to crash-stats.

yeah, me neither. I wonder for how many issues an internal log might be helpful??
a separate discussion perhaps?


[1] 4 weeks of non-production crashes - https://crash-stats.mozilla.com/report/list?product=Thunderbird&version=Thunderbird%3A14.0a1&version=Thunderbird%3A13.0a2&version=Thunderbird%3A12.0b4&version=Thunderbird%3A12.0b3&version=Thunderbird%3A12.0b2&version=Thunderbird%3A12.0b1&query_search=signature&query_type=exact&query=nsMsgProtocol%3A%3ACloseSocket%28%29&reason_type=contains&date=04%2F18%2F2012%2014%3A57%3A55&range_value=4&range_unit=weeks&hang_type=any&process_type=any&do_query=1&signature=nsMsgProtocol%3A%3ACloseSocket%28%29
In general, the logging information is way too private to include in crash-stats. We'd have to sanitize it to the extent that it's hard to know if it would be useful.
one v13 crash in 4 weeks is bp-ff44a4a4-2c33-48e3-ad0c-a03062120409 - attempting to contact reporter.
hmm, if bug 701533 is the same issue, and if we give the patch a week on v13 alpha (soon to be beta) then we can probably assess it's effectiveness. Because in 4 weeks we had 5 v14 crashes and 10 v13 crashes.

can we get it into the v13 beta?
Comment on attachment 613031 [details] [diff] [review]
a bit more logging, move CloseSocket call - checked in.

The idea behind moving the CloseSocket call is that perhaps the code before it might have been causing the pop3 protocol object's ref count to go to zero and getting deleted. This isn't a proper fix, but if the crashes go down or move, that'll be a clue. And logging the destructor will also tell us if that's happening.
Attachment #613031 - Flags: review?(mbanner)
Attachment #613031 - Flags: review?(mbanner) → review+
extra logging and moving of close socket landed - http://hg.mozilla.org/comm-central/rev/ff9100b84368
Status: NEW → ASSIGNED
Attachment #613031 - Attachment description: a bit more logging, move CloseSocket call → a bit more logging, move CloseSocket call - checked in.
Still tracking this, we'll have to watch how the latest changes are going.
Taking off tracking for now. Might be good to see if the changes landed previously have affected anything.
sadly, no one can reproduce.
Whiteboard: [regression in v7?] → [regression in v7?][not reproducible for most]
Blocks: 714341
bug 701533 is more common. and is topcrash also for 10.0.6esr
Whiteboard: [regression in v7?][not reproducible for most] → [regression in v7?][not reproducible for most][bug 701533 more common]
currently attempting to get data via Larry who reported bp-f9ef40ba-f082-4fd6-8c69-286702120721 and other crashes
Larry is no longer able to reproduce


(In reply to Wayne Mery (:wsmwk) from comment #40)
> currently attempting to get data via Larry who reported
> bp-f9ef40ba-f082-4fd6-8c69-286702120721 and other crashes

perhaps a related crash is Bug 637338, which is another signature Larry was hitting
With combined signatures, it's #42 top crasher in TB 17.0.2 so not a top crasher.
Keywords: topcrash
Blocks: 637338
Whiteboard: [regression in v7?][not reproducible for most][bug 701533 more common] → [regression:tb7?][not reproducible][bug 701533 more common]
See Also: → 902158
Assignee: mozilla → nobody
Status: ASSIGNED → NEW
Removing myslef on all the bugs I'm cced on. Please NI me if you need something on MailNews Core bugs from me.
Crash Signature: [@ nsMsgProtocol::CloseSocket()] [@ @0x0 | nsMsgProtocol::CloseSocket()] [@ nsRefPtr<nsTypedSelection>::~nsRefPtr<nsTypedSelection>() | `anonymous namespace''::ServerSocketListenerProxy::Release()] → [@ nsMsgProtocol::CloseSocket()] [@ @0x0 | nsMsgProtocol::CloseSocket()] [@ nsRefPtr<nsTypedSelection>::~nsRefPtr<nsTypedSelection>() | `anonymous namespace''::ServerSocketListenerProxy::Release()] [@ nsMsgProtocol::CloseSocket] [@ @0x0 | nsMsgProtoco…
bp-bd2dce79-98e2-4a15-85d2-e24102170324 is perhaps a current example. But an extremely rare example - only 1 in 3 crashes have the same stack.

per comment 5 caused by "code changes to enable detecting the server dropping the connection on an auth failure". The reduction in crash rate since then might be because of patches or changes in password handling. Some are cited in https://hg.mozilla.org/comm-central/log/tip/mailnews/local/src/nsPop3Protocol.cpp ... but one must doubt the underlying protocol bugs have been fixed.
Summary: crash [@ nsMsgProtocol::CloseSocket()] | [@ nsPop3Protocol::ProcessProtocolState] → crash [@ nsMsgProtocol::CloseSocket()] | [@ nsPop3Protocol::ProcessProtocolState] caused by code changes to enable detecting the server dropping the connection on an auth failure
No crashes after 17.0.8 (which are still being reported)
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.