Closed Bug 474915 Opened 16 years ago Closed 15 years ago

intermittent Mac talos crash on www.104.com.tw or www.eastmoney.com

Categories

(Core :: General, defect)

x86
macOS
defect
Not set
blocker

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: dbaron, Unassigned)

References

Details

(Keywords: intermittent-failure, stackwanted)

For at least 24 hours or so (I'm still looking for when the crashes started, Mac talos builds have been intermittently orange).

I think the orange is a real crash, because it seems to be crashing all the time on the same URL.  The end of the output always looks like:


===============
NOISE: Cycle 8: loaded http://localhost/page_load_test/pages/www.104.com.tw/www.104.com.tw/index.html (next: http://localhost/page_load_test/pages/www.eastmoney.com/www.eastmoney.com/index.html)
Failed tp: 
		Stopped Wed, 21 Jan 2009 23:10:56
FAIL: Busted: tp
FAIL: browser crash
===============

Except I've seen the cycle number vary from 2 through 9.

I'm not sure if that means we're crashing on www.104.com.tw or www.eastmoney.com (or if we can even tell).
I'd also note that I think this would have been filed sooner if the tinderbox logs had been easier to read.  I found stars on the builds with this problem (from both johnath and rstrong) labelling the problem as a twisted connection drops, which is what happens when something goes wrong with the connection to a slave, and often happens when there are network problems.  But in this case, the error message is the above, followed by the twisted connection dropped message, which leads people to see only the last message and miss the fact that just above it there was a report of a crash.  (Although the "crash" is TinderboxPrint:ed.)
(In reply to comment #0)
> Except I've seen the cycle number vary from 2 through 9.

Which has gone up to 2 through 10 as I've looked at more samples.


There have also been a smaller number of crashes (especially in the older parts of the window I've been looking at) that have been crashing after:
NOISE: Cycle 7: loaded http://localhost/page_load_test/pages/www.target.com/www.target.com/gp/homepage.html (next: http://localhost/page_load_test/pages/www.it.com.cn/www.it.com.cn/index.html)
but I can't be sure that those aren't already fixed, or that they're the same bug.  I suggest saying for now that this bug does NOT cover those, but that we won't look into them aggressively until this bug is fixed.


The oldest build I've found with this crash is from
MacOSX Darwin 8.8.1 talos mozilla-central qm-pmac-trunk01 on 2009/01/19 10:37:53  
which was a build of http://hg.mozilla.org/mozilla-central/rev/cc3b3a8f35cb .  That said, these have been reasonably intermittent, so the problem could have been introduced up to 24 hours before that point.

It seems like there were at least 2 or 3 of these oranges in every 24 hour period back to then (and quite a few more than that in the past few hours), while the three 24-hour periods before that build showed no Mac talos orange at all.
Also, we may well have had a clearer regression window for this if we hadn't turned the periodic build frequency down to 10 hours from 2.
So I think the regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2009-01-18+06%3A00%3A00&enddate=2009-01-19+11%3A00%3A00
with things in the later part of the range (i.e., near the beginning of the list) more likely than those earlier.

I think that makes bzbarsky's frame construction changes the most likely candidate (although far from certain), since there was nothing other than backouts and test landings for 9 hours prior.
Since I figured something causing intermittent crashes in talos might also be causing intermittent crashes for users, I checked the crash-stats data, and one new topcrash among the high-frequency crashes.  I filed it as bug 474938, and *it* is definitely related to bzbarsky's frame construction changes.  This bug *may* be the same as that bug, although it's not certain.
So given that http://hg.mozilla.org/mozilla-central/rev/692ae2bf70de essentially backed out http://hg.mozilla.org/mozilla-central/rev/9ac7c363cf78 which was a fix for intermittent talos orange, I think it's pretty likely that was the cause.

It's interesting that it's also showing up in crash-stats this time, though.
Blocks: 473390
(In reply to comment #1)
> I'd also note that I think this would have been filed sooner if the tinderbox
> logs had been easier to read.  I found stars on the builds with this problem
> (from both johnath and rstrong) labelling the problem as a twisted connection
> drops, which is what happens when something goes wrong with the connection to a
> slave, and often happens when there are network problems.  But in this case,
> the error message is the above, followed by the twisted connection dropped
> message, which leads people to see only the last message and miss the fact that
> just above it there was a report of a crash.  (Although the "crash" is
> TinderboxPrint:ed.)

I filed this as bug 474950.
(In reply to comment #2)
> There have also been a smaller number of crashes (especially in the older parts
> of the window I've been looking at) that have been crashing after:
> NOISE: Cycle 7: loaded
> http://localhost/page_load_test/pages/www.target.com/www.target.com/gp/homepage.html
> (next:
> http://localhost/page_load_test/pages/www.it.com.cn/www.it.com.cn/index.html)
> but I can't be sure that those aren't already fixed, or that they're the same
> bug.  I suggest saying for now that this bug does NOT cover those, but that we
> won't look into them aggressively until this bug is fixed.

I've filed this as bug 474961 and that one is affecting all platforms
Blocks: 474961
Depends on: 474938
I restored the null-check in bug 474938.  Let's see whether it helps.
Keywords: stackwanted
Whiteboard: [orange]
Blocks: 438871
No comments for a few months (and the ones after comment 9 were different URLs).  Should this be marked as WFM?
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Resolution: WONTFIX → WORKSFORME
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.