crash in nsHttpConnection::OnSocketWritable

RESOLVED FIXED in Firefox 20

Status

()

defect
--
critical
RESOLVED FIXED
7 years ago
2 years ago

People

(Reporter: wsmwk, Assigned: mcmanus)

Tracking

({crash, topcrash})

Trunk
mozilla22
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox20+ verified, firefox21+ verified, firefox22 verified, firefox-esr1721+ verified)

Details

(Whiteboard: [native-crash][regression:TB15][tbird topcrash], crash signature)

Attachments

(1 attachment)

Reporter

Description

7 years ago
This crash sig is new in Thunderbird 15. But didn't research enough to say it's a regression.

There are matching crashes is firefox, like bp-b2bd386d-b212-4d92-af40-42af92120824  BUT not all firefox crashes have the same line#. 

This bug was filed from the Socorro interface and is 
report bp-aa23ce79-8061-4b01-a9fb-b4af92120830 .
============================================================= 
0	xul.dll	nsHttpConnection::OnSocketWritable	netwerk/protocol/http/nsHttpConnection.cpp:1268
1	xul.dll	nsHttpConnection::OnOutputStreamReady	netwerk/protocol/http/nsHttpConnection.cpp:1526
2	xul.dll	nsSocketOutputStream::OnSocketReady	netwerk/base/src/nsSocketTransport2.cpp:490
3	xul.dll	nsSocketTransport::OnSocketDetached	netwerk/base/src/nsSocketTransport2.cpp:1624
4	xul.dll	nsSocketTransportService::DetachSocket	netwerk/base/src/nsSocketTransportService2.cpp:181
5	xul.dll	nsSocketTransportService::Run	netwerk/base/src/nsSocketTransportService2.cpp:638
6	xul.dll	nsThread::ProcessNextEvent	xpcom/threads/nsThread.cpp:624
Component: Networking → Networking
Product: MailNews Core → Core
Version: 15 → 15 Branch
Reporter

Updated

7 years ago
Whiteboard: [regression?] → [tbird crash][regression?]

Comment 1

7 years ago
I've seen this twice in the last two days.  It appears related, in my case, to a resume from suspend.  I would guess the network interface is in the process of being restarted.
The crash is a null crash, just for ref.
There's a spike in crashes starting in Firefox 20.0a1/20121220 that makes it #28 top browser crasher in 20.0a2 and #19 in 21.0a1. The regression range might be:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=bfd85c9652fa&tochange=21195f52311c
It might be caused by bug 794240.

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsHttpConnection%3A%3AOnSocketWritable%28%29
OS: Windows NT → Windows 7
Version: 15 Branch → Trunk

Updated

6 years ago
OS: Windows 7 → All
Hardware: x86 → All
Whiteboard: [tbird crash][regression?] → [tbird crash][native-crash][regression?]
Could we get URLs?  It might be some SPDY/related bug.

mTransaction of nsHttpConnection is null.  Could it be that we just passed the |if (!EnsureNPNComplete())| branch?  Sets n=0 and also rv to jump just to |if (n == 0)| branch w/o crashing on null mTransaction earlier.

Updated

6 years ago
Keywords: needURLs
10 	http://www.facebook.com/
9 	https://www.facebook.com/
5 	http://g1.globo.com/
5 	about:blank
3 	about:sessionrestore
3 	http://www.uai.com.br/
3 	https://www.facebook.com/?ref=tn_tnmn
2 	http://apps.facebook.com/drawmything/?fb_source=bookmark_apps&ref=bookmarks&coun
1 	https://www.facebook.com/messages/
1 	http://br.msn.com/?ocid=hmlogout
1 	http://phimvang.org/xem-phim/gia-dinh-la-so-mot-phan-3-high-kick-3-the-revenge-o
1 	http://www.gooddrama.net/japanese-drama/iryu-team-medical-dragon-2-episode-2
1 	about:newtab
1 	http://pelis24.com/page/3/
1 	http://edition.cnn.com/
1 	http://www.um.ac.id/
1 	https://www.facebook.com/settings/?tab=privacy&ref=mb&privacy_source=settings_me
1 	http://adf.ly/5DyEP
1 	https://apps.facebook.com/rubyblast/?fb_source=search&ref=ts&fref=ts
1 	http://www.odnoklassniki.ru/guests
1 	https://sector61.c2.galactic.wonderhill.com/platforms/kabam/game
1 	https://edit.yahoo.com/forgot?stage=fe100&src=pg&intl=us&done=http://www.yahoo.c
1 	http://www.reliancenetconnect.co.in/9381275487
1 	http://bazoocam.org/
1 	https://web.vodafone.de/sbb/showPopupErrorPage?ltfu=1359288003000&SESSION_TARGET
1 	http://www.itfacts.org/2013/01/windows-phone-7-8-update-fuer-alle-htc-geraete/
1 	http://www.audiomicro.com/sound-effects
1 	http://ziurim.lt/serialas/2803-vedes-ir-turi-vaiku-9-sezonas-married-with-childr
1 	http://www.youtube.com/watch?v=bpdNSOOnk_s
1 	http://85.198.15.99/Subsystem/Portal/Dashboard/Dashboard2.aspx?param=1846A103B6A
1 	https://www.facebook.com/ajax/pagelet/generic.php/PhotoViewerInitPagelet?ajaxpip
1 	http://www.tunisia-sat.com/vb/forumdisplay.php?f=349
1 	http://addmefast.com/bonus_points.html
1 	https://our.intern.facebook.com/intern/ui
1 	http://www.reliancenetconnect.co.in/
1 	https://platform.twitter.com/widgets/tweet_button.html?url=http%3A%2F%2Fon-msn.c
1 	http://www.myvideo.de/tv#808587_8924933
1 	file:///C:/Users/Puma/Pictures/maps%20casa.htm
1 	http://apps.facebook.com/stormfall/?fb_source=bookmark_apps&ref=bookmarks&count=
1 	https://de-de.facebook.com/
1 	http://www.reliancenetconnect.co.in/
Keywords: needURLs
It's #13 top browser crasher in 20.0a2 and #26 in 21.0a1.
Assignee: nobody → honzab.moz
Whiteboard: [tbird crash][native-crash][regression?] → [native-crash][regression?]
Reporter

Comment 7

6 years ago
I've done more research and for Thunderbird this definitely increased in version 15 such that I'd say it was a regression for TB.  Not so however for Firefox in that time period. And Thunderbird crashes highly correlate to lightning addon and google provider. 

    98% (57/58) vs.   7% (954/14292) {a62ef8ec-5fdc-40c2-873c-223b8a6925cc} (Provider for Google Calendar, https://addons.mozilla.org/addon/4631)
    100% (58/58) vs.  21% (3027/14292) {e2fda1a4-762b-4020-b5ad-a41df1933103} (Lightning, https://addons.mozilla.org/addon/2313)
 
Also TB crashes correlate highly to windows 7, whereas Firefox does not.  And that firefox crashes are a low rate compared to thunderbird, and mostly development releases vs public releases for Thunderbird. 
bp-bcfd5620-a1cf-40eb-b4c5-7274d2130204 is a current thunderbird crash

It looks like most TB users crash only once. And so far, I don't have a testcase user.

#41 crash for TB17.0.2
Whiteboard: [native-crash][regression?] → [native-crash][regression?][tbird topcrash]

Comment 8

6 years ago
This is topcrash #11 on Firefox Aurora 20.0a2 over the last 3 days.
Patrick, could this be related to fact that fb turned on SPDY recently?
(In reply to Honza Bambas (:mayhemer) from comment #9)
> could this be related to fact that fb turned on SPDY recently?
There was no spike across all versions on a certain date. It's a regression in the Firefox code. See comment 3 for an estimation of the regression range.
Then, it's definitely regression from bug 819044 that has landed that day, that was a risky patch and I had some (not fully addressed) concerns on it:

http://hg.mozilla.org/releases/mozilla-aurora/log/tip/netwerk/protocol/http/SpdySession2.cpp

http://hg.mozilla.org/mozilla-central/log/tip/netwerk/protocol/http/SpdySession2.cpp

By, the way, I don't see approval for aurora landing in that bug.

Reassigning.
Assignee: honzab.moz → mcmanus
Blocks: 819044
(In reply to Honza Bambas (:mayhemer) from comment #11)
> By, the way, I don't see approval for aurora landing in that bug.
Ah, that landed on central before merge!
Assignee

Comment 13

6 years ago
I'll look at this early next week - considering the backout possibility. I'm traveling and don't want to just do it blind.

note that this whole crash signature can't be related to 819044 and this bug predates that code and many of the urls aren't spdy related.. but the spike certainly could be caused by it.
Patrick, if you want I can backout, at least from aurora, for you.

There is definitely yet a different bug in that code.  However, this null deref crash is caused with hi probability by this patch.

Also the timing seems to be somehow related to fb doing spdy.  Lot's of urls are facebook and I checked with nightly we do spdy with them when https.
Let's move forward with backing out bug 819044, even if speculatively.
Assignee

Comment 16

6 years ago
(In reply to Alex Keybl [:akeybl] from comment #15)
> Let's move forward with backing out bug 819044, even if speculatively.

I'm working on it right now.. I just want to make sure I understand the implications of removing that code and want to test a couple scenarios with it backed out. I think its "no big deal".
Assignee

Comment 17

6 years ago
(In reply to Alex Keybl [:akeybl] from comment #15)
> Let's move forward with backing out bug 819044, even if speculatively.

done
Assignee

Comment 18

6 years ago
you might not want to mark this bug fixed as 819044 cannot be the sole source of it (though it is no doubt a contributor).
Let's consider the fixed status of Firefox 20 only for the spike, the reason it's tracked for that version. The bug is not fixed.
The spike stopped after 20.0a2/20130209 and 21.0a1/20130209 on February 10 so before the backout of bug 819044 (not yet backout in m-c).
In case it might be applicable, the working ranges would be:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=876652a0c8ef&tochange=29dd80c95b7d
http://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=0b515f5a44d4&tochange=c4aa14e5e8eb
There are no fixes in common so maybe it's an external cause that made them stop.
Assignee

Comment 21

6 years ago
I think a facebook spdy trial ended, so that's probably the external trigger. (the code backed out was spdy code - but that also means 819044 might not have really been the source it could be some other latent spdy thing)
(In reply to Patrick McManus [:mcmanus] from comment #21)
> I think a facebook spdy trial ended, so that's probably the external
> trigger. (the code backed out was spdy code - but that also means 819044
> might not have really been the source it could be some other latent spdy
> thing)
It might be a combination of both. The regression started with bug 819044 when the Facebook trial was already begun. It has worked again since the Facebook trial stopped.
It has spiked again in 20.0 since February 23, 3H UTC, and is #8 top browser crasher in 20.0b1. So there's likely a new SPDY Facebook trial and bug 819044 doesn't seem the only culprit.
It's also #10 top browser crasher in 21.0a2 and #19 in 22.0a1.

Comment 25

6 years ago
(In reply to Patrick McManus [:mcmanus] from comment #21)
> I think a facebook spdy trial ended, so that's probably the external
> trigger. (the code backed out was spdy code - but that also means 819044
> might not have really been the source it could be some other latent spdy
> thing)

Patrick, any chance we can find a fix for this before they turn SPDY on permanently?
From email with Patrick:
"there is nothing really actionable in that bug so I wouldn't
expect a fix in the next week. The traces are extremely generic. (i.e.
a crash on I/O ready.)

facebook has spdy on (presumably permamently) and has for quite a
while.. they just took a little break to fix an internal security
problem on their end. So I wouldn't expect it to explode from current
levels. You can get the spdy indincator addon and look for yourself.

I do have a couple of leads to look into (that really come from other
bugs but if I squint hard enough I can imagine they are related) that
I'm hopeful about - but they aren't anywhere near well understood
enough at this point to comment on their backportability."

Patrick - since this crash is not presently showing on 19 (with Facebook SPDY on presumably) but it is the #6 topcrasher in FF 20 right now - I am concerned that this will explode when we release FF20 and that there's something internal we need to dig up here.  Can you take a look at code that landed in the 20 timeframe and could be affecting this?
ni? for comment 26
Flags: needinfo?(mcmanus)
Assignee

Comment 28

6 years ago
bug 829120 has a crash that matches this.
Flags: needinfo?(mcmanus)
Reporter

Comment 29

6 years ago
1. regarding comment 0 ... this is indeed a regression for Thunderbird 15. (or at least the signature is new)  Crashes prior to TB15 were startup crashes, and were almost non-existent.

2. "resume from suspend" is the most common crash comment for thunderbird.
bp-a1734b76-dc32-4481-894e-1712c2130301 (firefox)
bp-0569d3c6-8125-4b92-bb4f-ae7182121209 (thunderbird)

The above two points are not true for Firefox.  Additionally, Thunderbird crash rate is higher than Firefox.  Therefore, does analysis in the above bug comments 3-26 about Firefox address the crashes in Thunderbird, for which this bug was originally filed?
Crash Signature: [@ nsHttpConnection::OnSocketWritable()] → [@ nsHttpConnection::OnSocketWritable()] [@ nsHttpConnection::OnSocketWritable]
Whiteboard: [native-crash][regression?][tbird topcrash] → [native-crash][regression:TB15][tbird topcrash]
Assignee

Comment 30

6 years ago
I think we should promote bug 829120 as a speculative cause of this bug. It has a weird stack that is consistent with this stack trace (though not an exact match). I'd say maybe 50/50 chance - its consistent but hardly a smoking gun. It could cause a failed activation which may lead to this problem in a non debug build.

Almost all the stacks I've looked at are part of a "detached socket" case (i.e. server hangup). The simple ones look like https://crash-stats.mozilla.com/report/index/72f49cbb-d7a4-4989-b394-11d8e2130312

Therefore another theory is that something on the stack of OnSocketWritable called CloseTransaction() which makes sense given the evidence of a hangup. That would result in mtransaction being null at the time of this crash but obviously not null when OnSocketWritable was put onto the stack. That's a pretty clear bug but I don't see any recent regressor for it that would tie it to FF20; though I've still got a little homework to do now that I have a better candidate.

its not clear why this would show up starting with FF20 - but its plausible something related has changed to create the "bunching" of the queue that 829120 needs to be triggered. Whatever it is I haven't identified it yet - but it could even be (perfectly valid and better) SSL code that changes the timing of when NPN completes or whether it detects hangups in one pass. It could also just be an optimizer change that makes this symbol visible for the first time (but not the origin of the crashes) - the 829120 stack isn't quite right nor are several with this signature I've explored.

Again, I haven't got a good theory on why this is a new phenomenon in FF20, but it looks pretty simple and safe to patch up the proximate crash reason here with a basic null check and given the possibility of CloseTransaction() from the its also the right thing to do.

To recap I think we should:
 a] speculatively take 829120 on beta
 b] take a patch on this bug to fix the obvious local cause related to closetranscation
Assignee

Comment 31

6 years ago
Posted patch patch 0Splinter Review
this is pretty close to trivial. Urgency needed on this review too - sorry :(
Attachment #723797 - Flags: review?(honzab.moz)

Updated

6 years ago
Depends on: 829120
Assignee

Comment 32

6 years ago
(In reply to Patrick McManus [:mcmanus] from comment #30)

> Almost all the stacks I've looked at are part of a "detached socket" case
>
[..]
> Therefore another theory is that something on the stack of OnSocketWritable
> called CloseTransaction() which makes sense given the evidence of a hangup.

[..]
> it to FF20; though I've still got a little homework to do now that I have a
> better candidate.
> 

I now think it is likely that we see this spike on FF20 because of 769288 which landed on FF20. That bug closes active private browsing sockets using detachsocket() when the last private browsing tab is closed. That's the right thing to do, but code using that path would have been quite rare (perhaps more dependent on server behavior quirks) before hand and now would be a much more common event (and indeed one that is commonly seen in the stack traces of this bug). That would be consistent with this bug pre-exisiting FF20 but spiking considerably at that version.

My recommendations in comment 30 are unchanged - I don't think we want to be backing out the private browsing changes just because they may tickle a latent bug.
Comment on attachment 723797 [details] [diff] [review]
patch 0

Review of attachment 723797 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab

Just a question, maybe a suggestion for a followup: what happens with the connection that doesn't have a transaction at this place?  Will it correctly move to the idle list?  It would be bad if it were in a state different from what the Idle/Active arrays show.
Attachment #723797 - Flags: review?(honzab.moz) → review+
Assignee

Comment 35

6 years ago
Comment on attachment 723797 [details] [diff] [review]
patch 0

[Approval Request Comment]
please see comments 30 and 32

Bug caused by (feature/regressing bug #): best guess is this is a latent bug triggered by 769288
User impact if declined: roughly top 10 crasher.. pretty safe null pointer deref.
Testing completed (on m-c, etc.): not much :)
Risk to taking this patch (and alternatives if risky): this is a very small change unlikely to make anything worse.
String or UUID changes made by this patch: none
Attachment #723797 - Flags: approval-mozilla-beta?
Attachment #723797 - Flags: approval-mozilla-aurora?
Comment on attachment 723797 [details] [diff] [review]
patch 0

With the low risk involved in this patch and the fact that we really want to give this a shot for our fourth week of beta - approving now for uplift with the caveat that this will be merged to central prior to landing on branches.
Attachment #723797 - Flags: approval-mozilla-beta?
Attachment #723797 - Flags: approval-mozilla-beta+
Attachment #723797 - Flags: approval-mozilla-aurora?
Attachment #723797 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/bfb1bb39e88b
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
(In reply to Manuela Muntean [:Manuela] [QA] from comment #39)
> 1) for the first siganture, [@ nsHttpConnection::OnSocketWritable()], there
> is a pretty high number of crashes on Firefox 20 beta 4:
The fix is in 20 Beta 5: http://hg.mozilla.org/releases/mozilla-beta/graph
> The fix is in 20 Beta 5: http://hg.mozilla.org/releases/mozilla-beta/graph

Ok, thanks. Hopefully won't appear new crashes until beta 6.
QA Contact: manuela.muntean
There are no crashes after 21.0a2/20130313 and in 20.0b5.
Verified fixed, since there are no more crash reports in Socorro after 2013-03-13, for both signatures.
Reporter

Comment 44

6 years ago
Comment on attachment 723797 [details] [diff] [review]
patch 0

#31 crash for thunderbird 17.0.4 
so I'd like to see this on ESR.
to recap, for thunderbird, per reporters this crash occurs when computer comes out of sleep

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined: thunderbird users crash
Fix Landed on Version: 22, 21, 20
Risk to taking this patch (and alternatives if risky): this is a very small change unlikely to make anything worse. (per comment 35)
String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #723797 - Flags: approval-mozilla-esr17?
(In reply to Wayne Mery (:wsmwk) from comment #44)
> Comment on attachment 723797 [details] [diff] [review]
> patch 0
> 
> #31 crash for thunderbird 17.0.4 

We'd rather take this early in the next ESR cycle than now. If TB disagrees, I'd suggest landing to comm-specific branches.
(In reply to Alex Keybl [:akeybl] from comment #45)
> (In reply to Wayne Mery (:wsmwk) from comment #44)
> > Comment on attachment 723797 [details] [diff] [review]
> > patch 0
> > 
> > #31 crash for thunderbird 17.0.4 
> 
> We'd rather take this early in the next ESR cycle than now. If TB disagrees,
> I'd suggest landing to comm-specific branches.

I'm happy to take it in the next cycle.
Comment on attachment 723797 [details] [diff] [review]
patch 0

Let's go ahead with the ESR uplift now, for the one shipping alongside FF21.
Attachment #723797 - Flags: approval-mozilla-esr17? → approval-mozilla-esr17+

Comment 49

6 years ago
I just got one of these crashes (I think) with TB 17.05.  Is the above fix in 17.05?

https://crash-stats.mozilla.com/report/index/bp-c4c94cf5-ebf6-4ba0-82ae-cd4f12130410

Comment 50

6 years ago
Sorry for not looking carefully.  17.05 was installed on 4/02, the above "fix" is dated 4/05.
Manuela, can you verify that this bug is fixed on 17esr, 21 and 22, since you were able to do so for 20?
While checking on Socorro the reports regarding the last month, this is what I've found:

1) for the first signature, [@ nsHttpConnection::OnSocketWritable()] :

- for 17.0.5esr - 3 crashes (on Firefox)
- for 21 - 2 crashes with 21.0b2; 8 crashes with 21.0b3; 1 crash with 21.0b4; 1 crash with 21.0b5; 6 crashes with 21.0b7; 
- for 22 -  no crashes


2) for the second signature, [@ nsHttpConnection::OnSocketWritable], I couldn't find any reports
(In reply to Manuela Muntean [:Manuela] [QA] from comment #52)
> - for 17.0.5esr - 3 crashes (on Firefox)
The patch hasn't landed in that version. You should verify the fix in TB 17.0.6:
https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=nsHttpConnection%3A%3AOnSocketWritable%28%29
There are no crashes with 17.0.6esr for the first signature, [@ nsHttpConnection::OnSocketWritable()], within last month.
(In reply to Manuela Muntean [:Manuela] [QA] from comment #55)
> There are no crashes with 17.0.6esr for the first signature
It's not released. Wait one week to have enough feedback.
I see only one crash in TB 17.0.6 and one in FF 17.0.6esr so we can't consider it as fixed.

Updated

6 years ago
Keywords: verifyme
Comment hidden (spam)
Comment hidden (spam)
Comment hidden (spam)
You need to log in before you can comment on or make changes to this bug.