Closed Bug 591898 Opened 14 years ago Closed 12 years ago

[Linux SeaMonkey 2.1, crashtest] 457362-1.xhtml segfaults on tinderbox but not locally

Categories

(SeaMonkey :: Testing Infrastructure, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: iannbugzilla, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

On 13th August the SeaMonkey Linux comm-central-trunk debug test crashtest was switched from cn-sea-qm-centos5-01 to cb-seamonkey-linux-01, since then Crashtest 457362-1.xhtml segfaults on that tinderbox with:
TEST-UNEXPECTED-FAIL | file:///builds/slave/comm-central-trunk-linux-debug-unittest-crashtest/build/reftest/tests/layout/base/crashtests/457362-1.xhtml | Exited with code -11 during test run

If you run the tests locally with:
make crashtest TEST_PATH=layout/base/crashtests/crashtests.list
there is no segfault:
REFTEST INFO | Result summary:
REFTEST INFO | Successful: 305 (0 pass, 305 load only)
REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
REFTEST INFO | Known problems: 2 (0 known fail, 0 known asserts, 0 random, 2 skipped, 0 slow)
REFTEST INFO | Total canvas count = 0

The pushes that happened in the window between passing and failing are:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=29114207a571&tochange=e1d20276ef6d
http://hg.mozilla.org/comm-central/pushloghtml?fromchange=16d674d435cf&tochange=579f8b02ac29
Seems more like releng territory, punting to them.
Assignee: server-ops → nobody
Component: Server Operations: Tinderbox Maintenance → Release Engineering
QA Contact: mrz → release
(In reply to comment #0)
> On 13th August the SeaMonkey Linux comm-central-trunk debug test crashtest was
> switched from cn-sea-qm-centos5-01 to cb-seamonkey-linux-01,

It was never switched between any boxes, it just executes on whatever box is free for it, those two are just two random possibilities out of the five we have. If it only fails on a single box, it might be a box problem and we need to find it, else this is a code problem - which from all I've seen is way more likely.
Component: Release Engineering → Testing Infrastructure
Product: mozilla.org → SeaMonkey
QA Contact: release → testing-infrastructure
Version: other → Trunk
Summary: Crashtest 457362-1.xhtml segfaults on tinderbox but not locally → [SeaMonkey 2.1, crashtest] 457362-1.xhtml segfaults on tinderbox but not locally
It's just strange that the tests work locally but don't on tinderbox, someone with tinderbox access would need to debug the seg fault. Perhaps a package requirement introduced by one of the pushes in http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=29114207a571&tochange=e1d20276ef6d
(In reply to comment #3)
> It's just strange that the tests work locally but don't on tinderbox, someone
> with tinderbox access would need to debug the seg fault. Perhaps a package
> requirement introduced by one of the pushes in
> http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=29114207a571&tochange=e1d20276ef6d

I haven't seen an trace of any such thing at least for Linux. But then, all the OSes tested for Firefox are different than what we test right now, because we can't afford Talos minis and they run all their tests exclusively on such boxes, with different OSes than the builder machines, while we run them all on the builder machines. So if anything is needed on test boxes only, we don't see it, and have not much of a chance to get the right packages.

Oh, and I doubt anyone with debug knowledge and access to out build boxes exists, unless Callek know enough (I don't), and then we have few enough boxes that we can't just set one aside for a longer period to dig in very much.
I got a hang during that test, attached ddd to it and got the output in the attachment. I cut it after line #40 since it was endlessly going on with the same output around nsWindow::OnExposeEvent (hit ctrl+c at about #6000)

But it's not completely reproducable, I also got clean testruns whithout any crash or hang for that test.
I let the test run again and at some point, I've got the following output:

REFTEST TEST-START | file:///home/i6stud/sibresch/nobackup/seamonkey_hg/trees/comm-central/mozilla/layout/base/crashtests/457362-1.xhtml
++DOMWINDOW == 92 (0x2ab7fff84468) [serial = 1526] [outer = 0x2ab7f80b0c00]
WARNING: g_closure_ref: assertion `closure->ref_count < CLOSURE_MAX_REF_COUNT' failed: 'glib warning', file /home/i6stud/sibresch/nobackup/seamonkey_hg/trees/comm-central/mozilla/toolkit/xre/nsSigHandlers.cpp, line 193

(seamonkey-bin:31479): GLib-GObject-CRITICAL **: g_closure_ref: assertion `closure->ref_count < CLOSURE_MAX_REF_COUNT' failed
WARNING: g_closure_ref: assertion `closure->ref_count < CLOSURE_MAX_REF_COUNT' failed: 'glib warning', file /home/i6stud/sibresch/nobackup/seamonkey_hg/trees/comm-central/mozilla/toolkit/xre/nsSigHandlers.cpp, line 193

The second block was repeated endlessly again. I interrupted the test again and crashtest.log had 4.7 Million lines.
m-c rev 29114207a571 doesn't give me a hang, but m-c rev 9fd11a17eb1a does (all tests run with a 64bit Seamonkey debug build). I sometimes also get a warning after 457362-1.xhtml is loaded, but no idea if this is related:

WARNING: ContentViewer exists outside gHistoryMaxViewer range: '!viewer', file /home/i6stud/sibresch/nobackup/seamonkey_hg/trees/comm-central/mozilla/docshell/shistory/src/nsSHistory.cpp, line 846

The strange thing is, I don't see the issue, if I only run the crashtests from mozilla/layout/base/. I have to run make crashtest in mozilla/.

CCing Markus as the patch author of bug 506826 and roc/dbaron as the reviewers. Perhaps they can tell us more about this issue.
Additional info, every time I attach ddd I get a different beginning of the stack trace. Only the repeating OnExposeEvent calls are the same.
I've tested again on another machine with a newer kernel (2.6.31.12 instead of 2.6.27.45) and a newer Xorg (1.6.5 instead of 1.5.2) and can't see the problem there. So maybe it's an issue with the OS itself :/
(In reply to comment #9)
> I've tested again on another machine with a newer kernel (2.6.31.12 instead of
> 2.6.27.45) and a newer Xorg (1.6.5 instead of 1.5.2) and can't see the problem
> there. So maybe it's an issue with the OS itself :/

Make sure what you test is a debug build, as you found an assertion failure there, and assertions are fatal on debug but ignored on optimized builds.

It's of course entirely possible that the GTK version in CentOS 5 has some subtle problem we are running into there, but while we can upgrade it in some way as long as it doesn't harm the runtime requirements of the builds generated on the same machines, we'd need RPMs applicable to this OS. And we can't run tests on any other platform, as we can't afford the luxury of running Talos with a different set of machines and OSes like FF does.
I've tested again on a debug build, still not crashing.
My kernel is 2.6.32.19-163.fc12.i686
My xorg-x11-server-Xorg is 1.7.6-4.fc12.i686
My gtk2 is 2.18.9-3.fc12.i686
My gcc is 4.4.4 2010630 (Red Hat 4.4.4-10)
Summary: [SeaMonkey 2.1, crashtest] 457362-1.xhtml segfaults on tinderbox but not locally → [Linux SeaMonkey 2.1, crashtest] 457362-1.xhtml segfaults on tinderbox but not locally
Whiteboard: [orange]
Depends on: 587189
Mass marking whiteboard:[orange] bugs WFM (to clean up TBPL bug suggestions) that:
* Haven't changed in > 6months
* Whose whiteboard contains none of the strings: {disabled,marked,random,fuzzy,todo,fails,failing,annotated,leave open,time-bomb}
* Passed a (quick) manual inspection of bug summary/whiteboard to ensure they weren't a false positive.

I've also gone through and searched for cases where the whiteboard wasn't labelled correctly after test disabling, by using attachment description & basic comment searches. However if the test for which this bug was about has in fact been disabled/annotated/..., please accept my apologies & reopen/mark the whiteboard appropriately so this doesn't get re-closed in the future (and please ping me via IRC or email so I can try to tweak the saved searches to avoid more edge cases).

Sorry for the spam! Filter on: #FFA500
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: