Closed Bug 1031697 Opened 10 years ago Closed 10 years ago

Huge Aurora 32 crash spike after landing of bug 1013587, several signatures

Categories

(Core :: General, defect)

32 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla32
Tracking Status
firefox32 --- verified
firefox33 --- unaffected

People

(Reporter: alice0775, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression, topcrash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-c90f7e96-2101-4cf8-8d55-5f0242140628.
=============================================================


https://hg.mozilla.org/releases/mozilla-aurora/rev/18d3fdc4a940
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 ID:20140628004001

Aurora32.0as(2014-06-28) crashes after update.
Reproducible: often

Steps To Reproduce:
1. Open mail.live.com and logged in
2. Open www.jongla.com if necessary

3. Exit browser and Restart Browser
4. Restore Previous Session
Regression window(mozilla-aurora)
Bad:
https://hg.mozilla.org/releases/mozilla-aurora/rev/d9c6731ad8c3
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 ID:20140627064630
Crash:
https://hg.mozilla.org/releases/mozilla-aurora/rev/04303003b896
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 ID:20140627073329
Pushlog:
http://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=d9c6731ad8c3&tochange=04303003b896
Keywords: regression
Summary: crash in nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions() → Aurora32.0a1, crash in nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions()
another crash id
bp-3b676ac9-8088-4573-ba5c-ebe4a2140628
bp-69c80ed5-b5dc-414d-9fe8-b92d02140628
Crash Signature: [@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions()] → [@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions()] , crash in NS_CycleCollectorSuspect3 , [@ nsXPCWrappedJS::Call…
In local build
Last Good: 417056b401e5
First Bad: 80f77d10896b

Regressed by:
	80f77d10896b	Honza Bambas — Bug 1013587 - HTTP cache v2: Start preload on input stream open for existing entries. r=michal, a=lmandel
Blocks: 1013587
workaround browser.cache.use_new_backend_temp = false
Bug 1013587 should be backed out from Aurora.
Flags: needinfo?(honzab.moz)
This is URGENT. Due to this, we have 20x the crashes as before on Aurora (and we already had elevated crash rates before).
Crash Signature: [@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions()] , crash in NS_CycleCollectorSuspect3 , [@ nsXPCWrappedJS::Call… → [@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions()] [@ NS_CycleCollectorSuspect3 ] [@ nsXPCWrappedJS::CallMethod(uns…
Sheriffs, can the backout of bug 1013587 on Aurora please be done ASAP? This is causing a *really* bad amount of crashes (we had 77 crashes per 100 ADI yesterday, where usual rate should be around 2 crashes per 100 ADI).
Crash Signature: [@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions()] [@ NS_CycleCollectorSuspect3 ] [@ nsXPCWrappedJS::CallMethod(uns… → [@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions()] [@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithCons…
Flags: needinfo?(sheriffs)
bug 1013587 cannot be on landed w/o bug 1013638.  I've caused the confusion here, since I dropped approval request for that bug based on some intermittent test failure that is probably caused by that patch and forgot to remove a? for this patch as well.  Sorry!
Flags: needinfo?(honzab.moz)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Thanks Nigel. We also should see to get a "nightly" aurora build done ASAP after the backout goes green.
Flags: needinfo?(sheriffs)
Summary: Aurora32.0a1, crash in nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithConstructors<JS::Heap<JSObject*> > >::IncrementLength(unsigned int) | mozilla::net::HttpBaseChannel::ApplyContentConversions() → Huge Aurora 32 crash spike after landing of bug 1013587, several signatures
The tree looks okay. I've triggered a new nightly built. Should be done in a while! Thanks for hand-holding me through this, Kairo :)
Is there an affordance for end-users stuck in this scenario when a breakage like this makes it past Nightly? Grabbing the latest build (with the fix) required starting up Aurora manually using a separate profile, and that seems beyond the skill level of a typical user. Maybe it doesn't matter on the assumption that a broken build like this would never hit release channel. (I seem to remember things like session restore getting disabled in this scenario in older builds of FF, but in this case I had 3 restarts restore session then immediately crash.)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10)
> Sheriffs, can the backout of bug 1013587 on Aurora please be done ASAP? This
> is causing a *really* bad amount of crashes (we had 77 crashes per 100 ADI
> yesterday, where usual rate should be around 2 crashes per 100 ADI).

Be aware that needinfo doesn't work on watch-alias email addresses due to bug 179701, only CC does (which the needinfo may/may not do, depending on the watch-alias' settings).

Also in these situations anyone can perform the backout, not just sheriffs - so in many cases it's best to ask on IRC, or even ping the original developer since there may be conflicts when performing the backout - which they would be much better suited to handle :-)
(In reply to Ed Morley [:edmorley UTC+0] from comment #17)
> Also in these situations anyone can perform the backout, not just sheriffs -
> so in many cases it's best to ask on IRC, or even ping the original
> developer since there may be conflicts when performing the backout - which
> they would be much better suited to handle :-)

I did ask on IRC as well and nigelb thankfully stepped in after a bit. It was Sunday, I was on my way off to some event (otherwise I'd have done the backout myself), Honza wasn't on when I asked (he came in when Nigel was already on it), and this was *really* bad (see the crash rates I stated in comment #10, for yesterday it even jumped to 280 crashes per 100 ADI, and I heard many comments that Aurora was completely unusable), so I pulled all registers I could think of to get this fixes as soon as in any way possible.

FWIW, the second aurora "nightly" build for yesterday that Nigel triggered when the backout was green looks pretty decent in the little crash data we have from it, back to normal levels (but given that the Windows version only was uploaded at 10pm UTC and we use full UTC days in crash-stats, it's not reflected much in yesterday's data yet).
Target Milestone: --- → mozilla32
Honza, I was just thinking about this one in more depth, and wondered why we did not catch such a large issue with our automation shown on tbpl - are we missing some test coverage there?
Flags: needinfo?(honzab.moz)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #19)
> Honza, I was just thinking about this one in more depth, and wondered why we
> did not catch such a large issue with our automation shown on tbpl - are we
> missing some test coverage there?

I think I've explained this enough already.  I just had to drop the a+ from the second patch - you have land on m-a, since it was dependent on a prerequisite patch (w/o it it crashes) that I removed a+ for at [1] (suspected to cause some test failures).

You've just landed something w/o a prerequisite, you have not checked dependencies.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1013638#c9
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #20)
> You've just landed something w/o a prerequisite, you have not checked
> dependencies.

Yes, I know all that!

What I was asking is why we had no automated tests catching it after it was done. It took us more than a day and looking at crash-stats from the wild to catch the problem, while we should be catching huge issues like that right after checkin by tests going orange. Can we reasonably have tests for this and missed creating some or are the cases where we hit this just untestable and we were unfortunate?
Flags: needinfo?(honzab.moz)
kairo has an extremely valid point - Honza, please can you follow this up (perhaps in a new bug). Whilst you weren't to blame for the crashes, it still highlighted a gap in our test coverage that would be good to fill if possible. Thanks :-)
Ah, I get it now.  I honestly hit the bug (and filed bug 1013638) only locally and only rarely (not with every request).

I can open a new bug here to build a test for this, but I will probably not have time to work on it anyway.

Are you OK with that?
Flags: needinfo?(honzab.moz)
New bug filed for making a test, with someone needinfo'd sounds good to me :-)
Depends on: 1034087
(In reply to Ed Morley [:edmorley UTC+0] from comment #24)
> New bug filed for making a test, with someone needinfo'd sounds good to me
> :-)

I agree, thanks for doing that!
This issue still accounts for half of our Aurora crashes on any given day. Do we have any way to get people past this build?
Depends on: 1045509
Reproduced on Aurora 2014-06-28 with STR from comment 1.
No crash encountered with Firefox 32 beta 4 (Build ID: 20140804164216) on Windows 7 x32:
Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0

Crash reports for 32 beta 3 are present for the last 2 signatures:

[@ nsXPCWrappedJS::AddRef()]: https://crash-stats.mozilla.com/report/list?signature=nsXPCWrappedJS%3A%3AAddRef%28%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&version=Firefox%3A32.0b&hang_type=any&date=2014-08-05+13%3A00%3A00&range_value=2#tab-reports

[@ NS_CycleCollectorSuspect3 ]: https://crash-stats.mozilla.com/report/list?signature=NS_CycleCollectorSuspect3&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&version=Firefox%3A32.0b&hang_type=any&date=2014-08-05+13%3A00%3A00&range_value=2#reports

Any idea why? Are those related?

If there's anyone else that could help me out on this matter earlier - since KaiRo is currently out of office - please let me know..
Flags: needinfo?(kairo)
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #27)
> Any idea why? Are those related?

As you can see in those signatures' "Bugzilla" tabs, there's a lot of different bugs connected to those rather generic signatures. It's expected that a few of the signatures on this bug will still exist, as they did before as well - as long as the spike is gone, everything's fine here, though.
Flags: needinfo?(kairo)
(In reply to Robert Kaiser (:kairo@mozilla.com, slow reaction due to vacation backlog) from comment #28)
> (In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #27)
> > Any idea why? Are those related?
> 
> As you can see in those signatures' "Bugzilla" tabs, there's a lot of
> different bugs connected to those rather generic signatures. It's expected
> that a few of the signatures on this bug will still exist, as they did
> before as well - as long as the spike is gone, everything's fine here,
> though.

Marking as Verified, thanks Robert!
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.