Closed Bug 609543 Opened 14 years ago Closed 14 years ago

Frequent hangs in dromaeo_sunspider during sunspider-access-nsieve.html (after sunspider-access-nbody.html)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 617505
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: philor, Assigned: gwagner)

References

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

I don't have any feeling for how this could be, since it feels like it's been going on since long before numerous merges to mozilla-central, but on TraceMonkey dromaeo_sunspider hangs quite often (disguising itself as a crash that's really from the harness crashing it in an attempt to show where it hung, on Linux), and on mozilla-central it does not.

From the last 12 hours on TM:

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288840835.1288844159.12858.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/03 20:20:35 
s: talos-r3-fed-053

Running test dromaeo_sunspider: 
		Started Wed, 03 Nov 2010 20:47:44
	Screen width/height:1280/1024
	colorDepth:24
	Browser inner width/height: 1024/681

NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
NOISE: Found crashdump: /tmp/tmp8KwSdg/profile/minidumps/1c339def-6b29-b54f-0dc34568-76736310.dmp


http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288836283.1288839609.28223.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/03 19:04:43 
s: talos-r3-fed-052
Running test dromaeo_sunspider: 
		Started Wed, 03 Nov 2010 19:31:51
	Screen width/height:1280/1024
	colorDepth:24
	Browser inner width/height: 1024/681

NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
NOISE: Found crashdump: /tmp/tmpcSA5E_/profile/minidumps/702a9df3-032f-5c1d-2b95bb99-331c4a1c.dmp

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288820268.1288823768.26592.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/03 14:37:48 
s: talos-r3-w7-015
Running test dromaeo_sunspider: 
		Started Wed, 03 Nov 2010 15:06:54
	Screen width/height:1280/1024
	colorDepth:24
	Browser inner width/height: 1008/675

NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
Failed dromaeo_sunspider: 

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288821577.1288825085.31571.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/03 14:59:37 
s: talos-r3-w7-039
Running test dromaeo_sunspider: 
		Started Wed, 03 Nov 2010 15:28:48
	Screen width/height:1280/1024
	colorDepth:24
	Browser inner width/height: 1008/675

NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-morph.html (next: http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-3d-raytrace.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-binary-trees.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-fannkuch.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)
NOISE: 
NOISE: __FAILbrowser frozen__FAIL
Failed dromaeo_sunspider:
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288843658.1288847424.24753.gz
s: talos-r3-w7-022

FAIL: Busted: dromaeo_sunspider
FAIL: browser frozen
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288845009.1288848801.29548.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/03 21:30:09 
s: talos-r3-w7-029

FAIL: Busted: dromaeo_sunspider
FAIL: browser frozen
I guess the other possibility is that someone broke it recently, and people have broken it before, and that's why it seems like I've seen it before.

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288847377.1288850895.4789.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/03 22:09:37 
s: talos-r3-w7-005

FAIL: Busted: dromaeo_sunspider
FAIL: browser frozen
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288865498.1288869037.23187.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/04 03:11:38 
s: talos-r3-w7-041

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288872217.1288875561.24177.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/04 05:03:37 
s: talos-r3-fed-026
Might be a dup of bug 604961.
Could well be, though the closest tie there seems to be RyanVM saying he sees it in sunspider-access-nsieve (in that I could be misunderstanding where the hang is, and sunspider-access-nbody finishes fine but sunspider-access-nsieve hangs without saying anything).

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288906353.1288909714.16349.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/04 14:32:33 
s: talos-r3-fed-033
Yeah, we don't know for sure so we should keep this open. I just wanted to make sure you (and other followers of this bug) knew there was a related one out there.
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288911710.1288915245.9871.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/04 16:01:50 
s: talos-r3-w7-029

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288913348.1288916763.16593.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/04 16:29:08 
s: talos-r3-fed-030
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288920485.1288923995.17524.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/04 18:28:05 
s: talos-r3-w7-042
And whether or not it's the same hang that's been around since April, something has certainly happened recently to turn it from rare to nearly-constant - even with my enormous ability to ignore Talos orange, there is absolutely no way I could have been ignoring one to two instances of this on every single run for more than a couple of weeks, tops. Nor is there any way for it to be TM-only other than by being caused by something landed since the last merge to mozilla-central.

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288921965.1288925515.24510.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/04 18:52:45 
s: talos-r3-w7-019
blocking2.0: --- → ?
We need a blocking-next-merge flag :)
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288970383.1288973906.14178.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/05 08:19:43 
s: talos-r3-w7-005
While we have had four pushes since when neither Windows nor Linux failed, so it could have been caused by something other than the first push where it hit, it seems pretty suspicious that the first instance was on the push for bug 598650.
Blocks: 598650
Keywords: regression
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289059297.1289062807.19687.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/06 09:01:37 
s: talos-r3-w7-045
What is this test doing? Similar to http://dromaeo.com/?sunspider ?
I don't know why increasing the malloc trigger should result in a timeout. I looked at the working set size for this page and it looks the same as before. 
For the Prime Number Computation test we go up to 1GB but access-nbody is no high-throughput benchmark afaik.
I may well be wrong about access-nbody - if the only output comes at the _end_ of a page's cycle, then the significant part of "Cycle 1: loaded http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next: http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)" would be the "next:" rather than the "loaded".

Files are in http://hg.mozilla.org/build/talos/file/tip/page_load_test/dromaeo, but I'm not sure where the harness bits are, or what decides on __FAILbrowser frozen__FAIL.
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289217283.1289220634.17403.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/08 03:54:43 
s: talos-r3-fed-051

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289205002.1289208539.19610.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/08 00:30:02 
s: talos-r3-w7-020
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289248063.1289251402.28653.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/08 12:27:43 
s: talos-r3-fed-028
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289240344.1289243869.24834.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/08 10:19:04 
s: talos-r3-w7-041
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289255973.1289259763.7655.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/08 14:39:33 
s: talos-r3-w7-005
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289257377.1289260725.12090.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/08 15:02:57 
s: talos-r3-fed-033
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289260194.1289263794.26757.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/08 15:49:54
s: talos-r3-w7-014
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289264684.1289268186.14899.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/08 17:04:44 
s: talos-r3-w7-050

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289265830.1289269364.20234.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/08 17:23:50 
s: talos-r3-w7-027

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289270567.1289273912.8326.gz
Rev3 Fedora 12 tracemonkey talos dromaeo on 2010/11/08 18:42:47 
s: talos-r3-fed-027
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289329845.1289333358.22942.gz
Rev3 WINNT 6.1 tracemonkey talos dromaeo on 2010/11/09 11:10:45 
s: talos-r3-w7-009
>  I may well be wrong about access-nbody - if the only output comes at the _end
>  of a page's cycle, then the significant part of "Cycle 1: loaded
>  http://localhost/page_load_test/dromaeo/sunspider-access-nbody.html (next:
>  http://localhost/page_load_test/dromaeo/sunspider-access-nsieve.html)" would be
>  the "next:" rather than the "loaded".

The output comes at the end of a page's cycle (once the page has fully loaded and timed) so we would be more interested in the 'next' page.  

Timeouts occur after 20 minutes with no output from the browser (longer for mobile tests).
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1289343919.1289347276.25741.gz
Summary: Frequent hangs in dromaeo_sunspider during dromaeo/sunspider-access-nbody.html on TraceMonkey → Frequent hangs in dromaeo_sunspider during sunspider-access-nsieve.html (after sunspider-access-nbody.html) on TraceMonkey
What does it take to get a stack trace from the hung browser at that point?  Can we force a crash in a way that causes breakpad to trigger, or use cdb/ntsd?  That would probably go a long way towards finding the problem.
Alice/Ted: can we hack the scripts on windows so that when the browser hangs we run a command like this?

/c/windows/system32/ntsd.exe -pv -pn firefox.exe -g -noio -c ".dump /ma /u c:\temp\hang.dmp;q"
(cd /c/temp; echo *.dmp; zip -p `ls *.dmp | sed s/dmp/zip/` *.dmp)

That should give us a minidump for the hung process, at a cost of < 100MB accumulated disk space per failure.
(ntsd should be on win2k3 in that location by default, but if not we should probably install the debugging tools on the slaves anyway)
The unit test harnesses already use crashinject ( http://mxr.mozilla.org/mozilla-central/source/build/win32/crashinject.cpp ) to produce minidumps and get stacks, it probably wouldn't be hard to make Talos do the same thing.
Attached patch patchSplinter Review
This patch (reducing MAX_MALLOC_BYTES to 80 MB) fixes the hang (on tryserver) and still avoids GC runs during SS.
It is not ideal since we still don't know what the real problem is.
Attachment #489816 - Flags: review?(gal)
I think you said yourself why I can't r+ this. We have to know what's going on.
I suspect this is no help at all, but while taras was trying yet again to land the switch to GCC 4.5 this morning, he got a hang in dromaeo_sunspider during nsieve, and unlike us he got to have a stack. http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1289489121.1289492488.19861.gz
Blocks: gcc4.5
(In reply to comment #45)
> I suspect this is no help at all, but while taras was trying yet again to land
> the switch to GCC 4.5 this morning, he got a hang in dromaeo_sunspider during
> nsieve, and unlike us he got to have a stack.
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1289489121.1289492488.19861.gz

Hopefully this is the same bug, so a fix here is the final thing needed for gcc 4.5.
Is it reasonable to expect this bug to get fixed soon or should we put off GCC 4.5 deployment for a few weeks?
Gregor: could you comment on the prognosis here?
Assignee: general → anygregor
From the build provided for me in bug 612445, I'm attempting to isolate a freeze and grab a minidump.
Depends on: 612445
Not sure what to make of http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1290034182.1290037568.32085.gz since it's what appears from the outside to be the same hang, but on m-c on the rev *before* TM merged and we should be seeing this hang there.
And http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1290039066.1290042465.22039.gz is post-merge, so it should be this, but the stack in it doesn't look very obviously helpful.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1290041016.1290044664.31016.gz&fulltext=1
Summary: Frequent hangs in dromaeo_sunspider during sunspider-access-nsieve.html (after sunspider-access-nbody.html) on TraceMonkey → Frequent hangs in dromaeo_sunspider during sunspider-access-nsieve.html (after sunspider-access-nbody.html)
Gregor needs access to a windows machine. Filing an IT bug.
I filed an IT bug (bug 613402) for this. If there is no easy way for IT to do this, I can give you access to a box with vmware + vnc, but you have to install windows and the necessary tools.
I hadn't quite thought of it this way, but now that this has been merged to mozilla-central, this bug really does block bug 559964 even if there's no connection - we can't turn on GCC 4.5, which bounced out of mozilla-central because it made dromaeo_sunspider hang in sunspider-access-nsieve.html, until after we're not hanging in sunspider-access-nsieve.html from this, so we can tell whether or not we're hanging from that.
(In reply to comment #78)
> I filed an IT bug (bug 613402) for this. If there is no easy way for IT to do
> this, I can give you access to a box with vmware + vnc, but you have to install
> windows and the necessary tools.

Any progress here? You merged a very frequent intermittent orange to mozilla-central, comment 61 has a full memory dump that may assist in analysis, but there's been no visible progress here. This is not a great place to be, and someone needs to fix it.
blocking2.0: ? → beta9+
(In reply to comment #121)
> (In reply to comment #78)
> > I filed an IT bug (bug 613402) for this. If there is no easy way for IT to do
> > this, I can give you access to a box with vmware + vnc, but you have to install
> > windows and the necessary tools.
> 
> Any progress here? You merged a very frequent intermittent orange to
> mozilla-central, comment 61 has a full memory dump that may assist in analysis,
> but there's been no visible progress here. This is not a great place to be, and
> someone needs to fix it.

I am still waiting to get access to a windows machine. IT is working on it.
This is really driving people crazy on mozilla-central.  Raising the priority in hopes that it gets some review love soon.
Severity: normal → critical
(In reply to comment #150)
> This is really driving people crazy on mozilla-central.  Raising the priority
> in hopes that it gets some review love soon.

Let's land it, then, and we can try to figure out the underlying problem locally. Andreas, what do you think?
(In reply to comment #162)
> (In reply to comment #150)
> > This is really driving people crazy on mozilla-central.  Raising the priority
> > in hopes that it gets some review love soon.
> 
> Let's land it, then, and we can try to figure out the underlying problem
> locally. Andreas, what do you think?

Yeah I am also for this temporary solution.
It seems we get stuck in the cycle collector:

 	nspr4.dll!_PR_MD_WAIT_CV(_MDCVar * cv, _MDLock * lock, unsigned int timeout)  Line 280 + 0x14 bytes	C
 	nspr4.dll!_PR_WaitCondVar(PRThread * thread, PRCondVar * cvar, PRLock * lock, unsigned int timeout)  Line 204 + 0x17 bytes	C
 	nspr4.dll!PR_WaitCondVar(PRCondVar * cvar, unsigned int timeout)  Line 547 + 0x17 bytes	C
 	xul.dll!mozilla::CondVar::Wait(unsigned int interval)  Line 373 + 0x11 bytes	C++
>	xul.dll!nsCycleCollectorRunner::Collect(nsICycleCollectorListener * aListener)  Line 3362	C++
 	xul.dll!nsCycleCollector_collect(nsICycleCollectorListener * aListener)  Line 3473 + 0xf bytes	C++
 	xul.dll!nsJSContext::CC(nsICycleCollectorListener * aListener)  Line 3635 + 0x9 bytes	C++
 	xul.dll!nsDOMWindowUtils::GarbageCollect(nsICycleCollectorListener * aListener)  Line 657 + 0x9 bytes	C++
 	xul.dll!NS_InvokeByIndex_P(nsISupports * that, unsigned int methodIndex, unsigned int paramCount, nsXPTCVariant * params)  Line 103	C++
 	xul.dll!CallMethodHelper::Invoke()  Line 3058 + 0x1c bytes	C++
 	xul.dll!CallMethodHelper::Call()  Line 2320 + 0x8 bytes	C++
Comment on attachment 489816 [details] [diff] [review]
patch

(In reply to comment #174)
> (In reply to comment #162)
> > (In reply to comment #150)
> > > This is really driving people crazy on mozilla-central.  Raising the priority
> > > in hopes that it gets some review love soon.
> > 
> > Let's land it, then, and we can try to figure out the underlying problem
> > locally. Andreas, what do you think?
> 
> Yeah I am also for this temporary solution.

OK, let's do it.

We should probably create a new bug for the underlying problem--the comments here are a mess.
Attachment #489816 - Flags: review?(gal) → review+
FYI, I just told ehsan it would be fine to land this patch in m-c, since it's got review and the orange is driving people crazy and all.
http://hg.mozilla.org/mozilla-central/rev/5d4678e9fc37

Let's tentatively call this fixed, and reopen if it happens on future builds!
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Looks to me like it happened again:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1291169503.1291172990.13400.gz

On

http://hg.mozilla.org/mozilla-central/rev/824f8a023254

Which has that revision.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Backed out the patch for being in the range of bug 615736
This doesn't look like it affects gcc 4.5 deployment at all.  Can we remove this as a blocker for bug 578880?
(In reply to comment #197)
> This doesn't look like it affects gcc 4.5 deployment at all.  Can we remove
> this as a blocker for bug 578880?

The test went from intermittent orange to permaorange last time we tried to deploy it, and it's pretty hard to discern if there's a difference without this getting fixed first.
Did it go to permaorange, or was it the way I remember it, that we landed the switch to GCC 4.5, got a single build and a single test run, had a single hang, and backed out? Personally, if I was the one who wanted GCC 4.5, I'd be pushing the switch to the tryserver and then asking releng to run ten sets of dromaeo on it.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1291301879.1291305511.30391.gz
(In reply to comment #201)
> Personally, if I was the one who wanted GCC 4.5, I'd be pushing
> the switch to the tryserver and then asking releng to run ten sets of dromaeo
> on it.

Sure, that sounds great.  I don't really know how to push on that significantly, though.  If gcc 4.5 didn't cause this to go permaorange (as phil remembers, that we only had one build), I would say it's worth doing the switch again even on non-tryserver if it's possible.
(In reply to comment #205)
> (In reply to comment #201)
> > Personally, if I was the one who wanted GCC 4.5, I'd be pushing
> > the switch to the tryserver and then asking releng to run ten sets of dromaeo
> > on it.
> 
> Sure, that sounds great.  I don't really know how to push on that
> significantly, though.  If gcc 4.5 didn't cause this to go permaorange (as phil
> remembers, that we only had one build), I would say it's worth doing the switch
> again even on non-tryserver if it's possible.

GCC 4.5 triggered the same permaorange as was present on tracemonkey branch. We then backed out 4.5 and tracemonkey-merge cause mc to go permaorange.

I later checked on try and gcc 4.5(without tm) causes the same permaorange bug on try given even runs. https://bugzilla.mozilla.org/show_bug.cgi?id=590181#c101
I guess statistics was against me with this fix. I got 3 green tryserver runs with it.
We can also reset the MAX_MALLOC_BYTES to 64 again and take the SS regression if you want to be on the safe side with the GCC switch.
I got access to a windows machine yesterday but I can only start serious debugging next week.
If we're deadlocked in the CC, maybe bent can help before next week?  I think we have to take the regression for now, though.
(In reply to comment #176)
> It seems we get stuck in the cycle collector:

Since CC is off the main thread now we will block while we wait for CC to finish, so the stack you posted is expected. Is it really deadlocked? What is the CC thread doing?

Sorry, I wasn't CC'd so I didn't know this was happening until today.
I haven't seen the hang on tracemonkey for a few days now (there were about 13 builds). Phil you know more about the frequency of this bug. Do you think it's more luck or maybe it got fixed?
Maybe, but it's hard to be confident with the limited number of pushes (zero Sunday, one Saturday, though seven Friday). I'm trying to get a bunch of runs on the last finished TM rev, so we can see.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1291661655.1291665279.22944.gz
Fun fact: I accidentally failed to say that I wanted extra runs of Win7, so I got WinXP, which... hey, wait, why doesn't WinXP ever hang?

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1291664536.1291668261.3836.gz
We are able to reproduce this hang now. 
I noticed 3 different hangs:

1) The browser freezes in the nsieve benchmark. The main thread waits in the constructor of AutoGCSession. (Probably most common).

2) The dromaeo tab hangs in the nsieve benchmark and doesn't render properly any more but the browser is still responsive and I can open another tab for example.

3) The browser stops for a second at the nsieve benchmark and continues to always execute 2 benchmarks in parallel afterwards.
Depends on: 617505
Status: NEW → RESOLVED
Closed: 14 years ago14 years ago
No longer depends on: 612445, 617505
Resolution: --- → DUPLICATE
As per today's meeting, beta 9 will be a time-based release. Marking these all betaN+. Please move it back to beta9+ if you believe it MUST be in the next beta (ie: trunk is in an unshippable state without this)
No longer blocks: 438871, gcc4.5, 598650
blocking2.0: beta9+ → betaN+
Keywords: regression
Keywords: regression
Blocks: gcc4.5, 598650
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: