Closed
Bug 431745
Opened 16 years ago
Closed 16 years ago
qm-centos5-02 is intermittently failing test_sleep_wake.js
Categories
(Toolkit :: Downloads API, defect, P2)
Tracking
()
RESOLVED
FIXED
People
(Reporter: Gavin, Assigned: sdwilsh)
References
Details
(Keywords: fixed1.9.0.2, intermittent-failure)
Attachments
(2 files, 2 obsolete files)
6.92 KB,
patch
|
Details | Diff | Splinter Review | |
2.64 KB,
patch
|
Details | Diff | Splinter Review |
qm-centos5-02 is failing, but qm-centos5-01 and qm-centos5-03 are green, so it seems likely that whatever's wrong is specific to that machine. Can it be restarted?
Comment 1•16 years ago
|
||
This machine is not on the Tier 1 support document. http://wiki.mozilla.org/Buildbot/IT_Support_Document
Assignee: server-ops → nobody
Component: Server Operations: Tinderbox Maintenance → Release Engineering
QA Contact: justin → release
Comment 2•16 years ago
|
||
(In reply to comment #1) > This machine is not on the Tier 1 support document. Filed bug 431784 on that documentation bug.
Updated•16 years ago
|
OS: Windows Vista → Linux
Comment 3•16 years ago
|
||
qm-centos-02 and qm-centos-03 were newly added, see bug#425791 for details. The idea was that so long as two-of-three-identical-machines were consistent, it reduced the overall need for tier1 pager support. I note that qm-centos-01 has also gone orange. We're investigating if this is really a unit test machine problem impacting multiple machines or if any code landings could be causing this.
Component: Release Engineering → Release Engineering: Maintenance
Priority: -- → P1
Reporter | ||
Comment 4•16 years ago
|
||
(In reply to comment #3) > I note that qm-centos-01 has also gone orange. We're investigating if this is > really a unit test machine problem impacting multiple machines or if any code > landings could be causing this. qm-centos5-01 wasn't orange when I filed this bug, and hasn't been consistently orange like qm-centos5-02 (it has been sporadically orange, but that's somewhat normal for unit test machines). Given comment 0, and the fact that it's been consistently orange for days, can we just reboot the machine? It's extremely unlikely that the test would be failing on only one of the 3 identical machines this consistently due to a code problem.
Updated•16 years ago
|
Assignee: nobody → ccooper
Comment 6•16 years ago
|
||
Slave restarted.
Unfortunately rebooting didn't seem to help
Reporter | ||
Comment 8•16 years ago
|
||
I've hidden the box from the waterfall since it's perma-orange is misleading people into thinking they can't check in. The other two machines are both green.
Comment 9•16 years ago
|
||
Adding to the fun was nthomas's discovery about that some drives were read-only this morning. See bug#432012. It had already also failed out with the "make check" errors, but this adds to the fun. To get around that, we've restarted the VM just now. Also, discovered qm-centos5-02 was configured with low RAM. We bumped it up from 512->1024 while we were rebooting.
Comment 10•16 years ago
|
||
Taking this bug, as coop is on leave. I note that the most recent runs on qm-centos5-02 are all failing out during "make -f client.mk checkout" with: /bin/sh: mozilla/.mozconfig.out: Read-only file system Adding client.mk options from /builds/slave/trunk_centos5_2/mozilla/.mozconfig: MOZ_CO_PROJECT=browser MOZ_OBJDIR=$(TOPSRCDIR)/objdir MOZ_CO_MODULE=mozilla/testing/tools rm: cannot remove `.mozconfig.out': Read-only file system make: *** [checkout] Error 1 program finished with exit code 2
Assignee: ccooper → joduinn
Status: ASSIGNED → NEW
Comment 11•16 years ago
|
||
Updating the kernel fixed the read-only drive, see details in https://bugzilla.mozilla.org/show_bug.cgi?id=407796#c64. The next run passed green, so I've added machine back onto tinderbox and closed this bug. Please reopen if this happens again.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 12•16 years ago
|
||
Unfortunately it started failing in make check on the second and subsequent runs: ../../../../_tests/xpcshell-simple/test_dm/unit/test_sleep_wake.js: command timed out: 2400 seconds without output, killing pid 2743 Reopening, and re-hidden from Firefox tree.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•16 years ago
|
||
ugh. On buildbot waterfall, I see this is running clean since at least 15:43 this afternoon, which is as far back as the waterfall page goes. I'm still tracking back looking for the builds that failed out. I wont put this back on tinderbox again, until I find the breaking build and see what happened.
Comment 14•16 years ago
|
||
Changing summary to match new symptoms, as previous problem now fixed.
Summary: qm-centos5-02 is failing |make check| → qm-centos5-02 is intermittently failing test_sleep_wake.js
Updated•16 years ago
|
Assignee: joduinn → rcampbell
Status: REOPENED → NEW
Comment 15•16 years ago
|
||
currently this vm is failing a different check: ../../../../_tests/xpcshell-simple/test_dm/unit/test_resume.js: /builds/slave/trunk_centos5_2/mozilla/tools/test-harness/xpcshell-simple/test_all.sh: line 111: 18021 Segmentation fault (core dumped) That's from the most-recent run.
Comment 16•16 years ago
|
||
from an irc conversation this morning. 08:11 < nthomas> all three of qm-centos5-01,02,03 are VM's. 02 is on netapp-b-vmware, the other two on netapp-d-fcal1 this is certainly one difference between the different VMs. Could it account for these failures? I have no idea.
Comment 17•16 years ago
|
||
since this was filed, we've seen this failure on a few other machines. Notably qm-centos5-moz2-01. Has anyone looked at the test code at all?
Assignee | ||
Comment 18•16 years ago
|
||
(In reply to comment #17) > Has anyone looked at the test code at all? Me! I don't see why on earth it'd be failing. It's a straightforward test case. The only issue would be if these tests time out, but I don't think xpcshell ones do...
Assignee | ||
Comment 19•16 years ago
|
||
Note: previous post was about test_sleep_wake.js. If it's about test_resume.js, it looks pretty straightforward, so a backtrace on that would be really helpful.
Comment 20•16 years ago
|
||
no, that's right. I was still asking about test_sleep_wake.js. I don't know why this could be failing either, but the only thing I can think of is that there's some weird timing interaction on VMs due to emulated clocks. It sounds wonky and I'd expect to see other failures if that were the case, but maybe this one test exercises the timers in some subtle way. Another question: Is this test valuable or can we mark it random on linux?
Assignee | ||
Comment 21•16 years ago
|
||
I think it's valuable, but then I'm the module owner and think all the tests in my module are valuable... That, and there's no way to mark xpcshell tests as random as far as I know.
Comment 22•16 years ago
|
||
yeah, I guess we'd have to do some makefile trickery to exclude it. We're going to try some real hardware and hopefully this problem will go away then. Sounds like there's nothing to do here until we get a different setup.
Assignee | ||
Comment 23•16 years ago
|
||
Just saw this. I'm not seeing the test fail, but I see gtk assertions: (process:31466): Gdk-CRITICAL **: gdk_screen_get_root_window: assertion `GDK_IS_SCREEN (screen)' failed (process:31466): Gdk-CRITICAL **: gdk_window_new: assertion `GDK_IS_WINDOW (parent)' failed (process:31466): Gdk-CRITICAL **: gdk_window_set_user_data: assertion `window != NULL' failed (process:31466): Gtk-CRITICAL **: gtk_style_attach: assertion `window != NULL' failed (process:31466): Gdk-CRITICAL **: gdk_window_set_back_pixmap: assertion `GDK_IS_WINDOW (window)' failed (process:31466): Gdk-CRITICAL **: gdk_drawable_get_colormap: assertion `GDK_IS_DRAWABLE (drawable)' failed (process:31466): Gdk-CRITICAL **: gdk_window_set_background: assertion `GDK_IS_WINDOW (window)' failed (process:31466): Gtk-CRITICAL **: gtk_style_set_background: assertion `GTK_IS_STYLE (style)' failed (process:31466): Gtk-CRITICAL **: _gtk_style_peek_property_value: assertion `GTK_IS_STYLE (style)' failed
Assignee | ||
Comment 24•16 years ago
|
||
I hit an interesting issue with this test locally the other day. I had a vm in the background compiling mozilla, and I was compiling it on the host machine. While I was doing this, I was playing with a test that is essentially the same as this that I'm adding, and (I'm guessing) since the computer was under such a load, the download actually finished before we could pause it, which caused a test failure. However, that doesn't seem to be what has been reported here, so I don't think that's what we are hitting.
Comment 25•16 years ago
|
||
Interesting. We could certainly fix up that failure condition by checking if we were able to pause and if not, retry the test a few times. The symptoms we're seeing on the VMs is a total freeze in the test. I've never actually observed one failing in realtime though. Not that there'd be much to see, I expect. The VMs definitely add some timing strangeness to these types of things.
Assignee | ||
Comment 26•16 years ago
|
||
well, when it fails the unit test actually fails - an exception is thrown that isn't caught.
Updated•16 years ago
|
Updated•16 years ago
|
Updated•16 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 27•16 years ago
|
||
My theory is that http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/toolkit/components/downloads/test/unit/test_sleep_wake.js&rev=1.1&mark=71,72#69 is just taking a very long time on the VM, causing the test to timeout. I'd like to land this patch temporarily to test that theory.
Reporter | ||
Comment 28•16 years ago
|
||
I wonder whether this failure might be caused by bug 443843.
Assignee | ||
Comment 29•16 years ago
|
||
Comment on attachment 328274 [details] [diff] [review] debugging patch r=sdwilsh for the download manager changes, but please fix the indentation (and I don't care if it stays enabled or not).
Attachment #328274 -
Flags: review+
Reporter | ||
Comment 30•16 years ago
|
||
Caught this on qm-centos5-03: http://pastebin.mozilla.org/479732 Log of a pass on my machine: http://pastebin.mozilla.org/479733 So it seems as though we're just not getting DOWNLOAD_FINISHED. I'm going to land: do_test_finished(); - } + } else + dump("%%%aDl.state: " + aDl.state + "\n"); and see what we are getting.
Reporter | ||
Comment 31•16 years ago
|
||
Caught this again, and it's getting DOWNLOAD_FAILED instead of DOWNLOAD_FINISHED in the failure case. Not sure how to proceed here... sdwilsh, any suggestions?
Reporter | ||
Comment 32•16 years ago
|
||
Oh, I missed the status in the output. It's failing because of NS_ERROR_ENTITY_CHANGED, which is one of the cases at: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/protocol/http/src/nsHttpChannel.cpp&rev=1.333#920
Assignee | ||
Comment 33•16 years ago
|
||
Are we hitting http://mxr.mozilla.org/mozilla-central/source/toolkit/components/downloads/test/unit/test_sleep_wake.js#91 perhaps?
Reporter | ||
Comment 34•16 years ago
|
||
Landed a couple more debugging patches: http://hg.mozilla.org/mozilla-central/index.cgi/rev/679d24253049 http://hg.mozilla.org/mozilla-central/index.cgi/rev/e6f889658bf5
Reporter | ||
Comment 35•16 years ago
|
||
(In reply to comment #33) > Are we hitting > http://mxr.mozilla.org/mozilla-central/source/toolkit/components/downloads/test/unit/test_sleep_wake.js#91 > perhaps? Yep. %%%meta.getHeader('Range'): bytes=101111- %%%from: 101111 %%%to: 101110 %%%data.length: 101111 %%% Returning early - from >= data.length From: http://hg.mozilla.org/mozilla-central/index.cgi/file/ad1fc6c8e351/toolkit/components/downloads/test/unit/test_sleep_wake.js#l84 So we're hitting: 1090239168[90ec548]: Unexpected response status while resuming, aborting [this=92d9900]
Reporter | ||
Comment 36•16 years ago
|
||
So does that mean that we're pausing the download after it's already complete, but before we received FINISHED?
Assignee | ||
Comment 37•16 years ago
|
||
I think so. Tomorrow I'll try to cook up a patch that just pumps data until the test is actually done.
Assignee | ||
Comment 38•16 years ago
|
||
OK, I stand corrected. Doesn't look like it's possible to keep pumping data until we are done. So...I guess we have two options: 1) Make the test not fail if it hits this condition. This kinda sucks since it seems to happen often on certain boxes. 2) Make the data that we send out bigger such that we have more time for the test to finish.
Reporter | ||
Comment 39•16 years ago
|
||
(In reply to comment #38) > 1) Make the test not fail if it hits this condition. This kinda sucks since it > seems to happen often on certain boxes. A tree that turns orange intermittently sucks a lot more. We'll still have the test coverage from all of the boxes that are passing now, so we'll be no worse off (in fact we'll likely be better off, since if the test fails for some other reason we'll notice the failure instead of assuming that it's the same known bustage).
Comment 40•16 years ago
|
||
I agree that slightly reduced test coverage is a fair tradeoff for less random orange. What you ought to do is file a new bug on improving the test, blocked on the httpd.js changes, and reference it in a comment in that test, so that at some point we can improve httpd.js and then make the test better.
Assignee | ||
Comment 41•16 years ago
|
||
This should do the trick.
Assignee: lukasblakk → sdwilsh
Attachment #328274 -
Attachment is obsolete: true
Attachment #328386 -
Flags: review?(gavin.sharp)
Assignee | ||
Updated•16 years ago
|
Whiteboard: [has patch][needs review gavin]
Reporter | ||
Comment 42•16 years ago
|
||
Comment on attachment 328386 [details] [diff] [review] v1.0 Can you revert the test_all.sh changes too?
Attachment #328386 -
Flags: review?(gavin.sharp) → review+
Comment 43•16 years ago
|
||
Comment on attachment 328386 [details] [diff] [review] v1.0 Just a drive by here, but +let doNotError = false; + dump("Returning early - from >= data.length. Not an error (bug 431745)\n"); + doNotError = true; + // this is only ok if we are not supposed to fail + do_check_true(doNotFail); is that last one supposed to be doNotError like the previous two?
Assignee | ||
Comment 44•16 years ago
|
||
(In reply to comment #43) > is that last one supposed to be doNotError like the previous two? Yeah...I'm having difficulties triggering this error locally clearly :)
Assignee | ||
Comment 45•16 years ago
|
||
Addresses comments.
Attachment #328386 -
Attachment is obsolete: true
Assignee | ||
Updated•16 years ago
|
Whiteboard: [has patch][needs review gavin] → [has patch][has review][can land]
Assignee | ||
Comment 46•16 years ago
|
||
I'm also going to land the test change on branch. It's a test change, so I don't need approval.
Assignee | ||
Comment 47•16 years ago
|
||
Pushed to mozilla-central: http://hg.mozilla.org/mozilla-central/index.cgi/rev/aadfa8776a70
Status: ASSIGNED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Whiteboard: [has patch][has review][can land]
Reporter | ||
Updated•16 years ago
|
Component: Release Engineering: Maintenance → Download Manager
Product: mozilla.org → Firefox
QA Contact: release → download.manager
Version: other → Trunk
Assignee | ||
Comment 48•16 years ago
|
||
Assignee | ||
Comment 49•16 years ago
|
||
Checking in toolkit/components/downloads/test/unit/test_sleep_wake.js; new revision: 1.2; previous revision: 1.1
Keywords: fixed1.9.0.2
Updated•16 years ago
|
Product: Firefox → Toolkit
Updated•15 years ago
|
Whiteboard: [orange]
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
You need to log in
before you can comment on or make changes to this bug.
Description
•