Closed Bug 431745 Opened 12 years ago Closed 12 years ago

qm-centos5-02 is intermittently failing test_sleep_wake.js

Categories

(Toolkit :: Downloads API, defect, P2, critical)

x86
Linux
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: Gavin, Assigned: sdwilsh)

References

Details

(Keywords: fixed1.9.0.2, intermittent-failure)

Attachments

(2 files, 2 obsolete files)

qm-centos5-02 is failing, but qm-centos5-01 and qm-centos5-03 are green, so it seems likely that whatever's wrong is specific to that machine. Can it be restarted?
This machine is not on the Tier 1 support document.
http://wiki.mozilla.org/Buildbot/IT_Support_Document
Assignee: server-ops → nobody
Component: Server Operations: Tinderbox Maintenance → Release Engineering
QA Contact: justin → release
Depends on: 431784
(In reply to comment #1)
> This machine is not on the Tier 1 support document.

Filed bug 431784 on that documentation bug.
OS: Windows Vista → Linux
qm-centos-02 and qm-centos-03 were newly added, see bug#425791 for details. The
idea was that so long as two-of-three-identical-machines were consistent, it reduced the overall need for tier1 pager support. 

I note that qm-centos-01 has also gone orange. We're investigating if this is
really a unit test machine problem impacting multiple machines or if any code landings could be causing this.
Component: Release Engineering → Release Engineering: Maintenance
Priority: -- → P1
(In reply to comment #3)
> I note that qm-centos-01 has also gone orange. We're investigating if this is
> really a unit test machine problem impacting multiple machines or if any code
> landings could be causing this.

qm-centos5-01 wasn't orange when I filed this bug, and hasn't been consistently orange like qm-centos5-02 (it has been sporadically orange, but that's somewhat normal for unit test machines).

Given comment 0, and the fact that it's been consistently orange for days, can we just reboot the machine? It's extremely unlikely that the test would be failing on only one of the 3 identical machines this consistently due to a code problem.
Assignee: nobody → ccooper
Rebooting now.
Status: NEW → ASSIGNED
Slave restarted.
Unfortunately rebooting didn't seem to help
I've hidden the box from the waterfall since it's perma-orange is misleading people into thinking they can't check in. The other two machines are both green.
Adding to the fun was nthomas's discovery about that some drives were read-only this morning. See bug#432012. It had already also failed out with the "make check" errors, but this adds to the fun. To get around that, we've restarted the VM just now.

Also, discovered qm-centos5-02 was configured with low RAM. We bumped it up from 512->1024 while we were rebooting. 
Taking this bug, as coop is on leave.

I note that the most recent runs on qm-centos5-02 are all failing out during 
"make -f client.mk checkout" with:

/bin/sh: mozilla/.mozconfig.out: Read-only file system
Adding client.mk options from /builds/slave/trunk_centos5_2/mozilla/.mozconfig:
    MOZ_CO_PROJECT=browser
    MOZ_OBJDIR=$(TOPSRCDIR)/objdir
    MOZ_CO_MODULE=mozilla/testing/tools
rm: cannot remove `.mozconfig.out': Read-only file system
make: *** [checkout] Error 1
program finished with exit code 2
Assignee: ccooper → joduinn
Status: ASSIGNED → NEW
Updating the kernel fixed the read-only drive, see details in https://bugzilla.mozilla.org/show_bug.cgi?id=407796#c64. The next run passed green, so I've added machine back onto tinderbox and closed this bug.

Please reopen if this happens again.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Unfortunately it started failing in make check on the second and subsequent runs:

../../../../_tests/xpcshell-simple/test_dm/unit/test_sleep_wake.js: 
command timed out: 2400 seconds without output, killing pid 2743

Reopening, and re-hidden from Firefox tree.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
ugh. On buildbot waterfall, I see this is running clean since at least 15:43 this afternoon, which is as far back as the waterfall page goes. I'm still tracking back looking for the builds that failed out.

I wont put this back on tinderbox again, until I find the breaking build and see what happened. 
Changing summary to match new symptoms, as previous problem now fixed.
Summary: qm-centos5-02 is failing |make check| → qm-centos5-02 is intermittently failing test_sleep_wake.js
Assignee: joduinn → rcampbell
Status: REOPENED → NEW
currently this vm is failing a different check:

../../../../_tests/xpcshell-simple/test_dm/unit/test_resume.js: /builds/slave/trunk_centos5_2/mozilla/tools/test-harness/xpcshell-simple/test_all.sh: line 111: 18021 Segmentation fault      (core dumped) 

That's from the most-recent run.
from an irc conversation this morning.

08:11 < nthomas> all three of qm-centos5-01,02,03 are VM's.  02 is on netapp-b-vmware, the other two on netapp-d-fcal1

this is certainly one difference between the different VMs. Could it account for these failures? I have no idea.
since this was filed, we've seen this failure on a few other machines. Notably qm-centos5-moz2-01.

Has anyone looked at the test code at all?
(In reply to comment #17)
> Has anyone looked at the test code at all?
Me!  I don't see why on earth it'd be failing.  It's a straightforward test case.

The only issue would be if these tests time out, but I don't think xpcshell ones do...
Note: previous post was about test_sleep_wake.js.  If it's about test_resume.js, it looks pretty straightforward, so a backtrace on that would be really helpful.
no, that's right. I was still asking about test_sleep_wake.js. I don't know why this could be failing either, but the only thing I can think of is that there's some weird timing interaction on VMs due to emulated clocks. It sounds wonky and I'd expect to see other failures if that were the case, but maybe this one test exercises the timers in some subtle way.

Another question: Is this test valuable or can we mark it random on linux?
I think it's valuable, but then I'm the module owner and think all the tests in my module are valuable...

That, and there's no way to mark xpcshell tests as random as far as I know.
yeah, I guess we'd have to do some makefile trickery to exclude it. We're going to try some real hardware and hopefully this problem will go away then. Sounds like there's nothing to do here until we get a different setup.
Just saw this.  I'm not seeing the test fail, but I see gtk assertions:
(process:31466): Gdk-CRITICAL **: gdk_screen_get_root_window: assertion `GDK_IS_SCREEN (screen)' failed

(process:31466): Gdk-CRITICAL **: gdk_window_new: assertion `GDK_IS_WINDOW (parent)' failed

(process:31466): Gdk-CRITICAL **: gdk_window_set_user_data: assertion `window != NULL' failed

(process:31466): Gtk-CRITICAL **: gtk_style_attach: assertion `window != NULL' failed

(process:31466): Gdk-CRITICAL **: gdk_window_set_back_pixmap: assertion `GDK_IS_WINDOW (window)' failed

(process:31466): Gdk-CRITICAL **: gdk_drawable_get_colormap: assertion `GDK_IS_DRAWABLE (drawable)' failed

(process:31466): Gdk-CRITICAL **: gdk_window_set_background: assertion `GDK_IS_WINDOW (window)' failed

(process:31466): Gtk-CRITICAL **: gtk_style_set_background: assertion `GTK_IS_STYLE (style)' failed

(process:31466): Gtk-CRITICAL **: _gtk_style_peek_property_value: assertion `GTK_IS_STYLE (style)' failed
I hit an interesting issue with this test locally the other day.  I had a vm in the background compiling mozilla, and I was compiling it on the host machine.  While I was doing this, I was playing with a test that is essentially the same as this that I'm adding, and (I'm guessing) since the computer was under such a load, the download actually finished before we could pause it, which caused a test failure.  However, that doesn't seem to be what has been reported here, so I don't think that's what we are hitting.
Interesting. We could certainly fix up that failure condition by checking if we were able to pause and if not, retry the test a few times. The symptoms we're seeing on the VMs is a total freeze in the test. I've never actually observed one failing in realtime though. Not that there'd be much to see, I expect.

The VMs definitely add some timing strangeness to these types of things.
well, when it fails the unit test actually fails - an exception is thrown that isn't caught.
Assignee: rcampbell → lukasblakk
Depends on: 438871
Priority: P1 → P2
Blocks: 438871
No longer depends on: 438871
Status: NEW → ASSIGNED
Attached patch debugging patch (obsolete) — Splinter Review
My theory is that http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/toolkit/components/downloads/test/unit/test_sleep_wake.js&rev=1.1&mark=71,72#69 is just taking a very long time on the VM, causing the test to timeout. I'd like to land this patch temporarily to test that theory.
I wonder whether this failure might be caused by bug 443843.
Comment on attachment 328274 [details] [diff] [review]
debugging patch

r=sdwilsh for the download manager changes, but please fix the indentation (and I don't care if it stays enabled or not).
Attachment #328274 - Flags: review+
Caught this on qm-centos5-03:
http://pastebin.mozilla.org/479732

Log of a pass on my machine:
http://pastebin.mozilla.org/479733

So it seems as though we're just not getting DOWNLOAD_FINISHED. I'm going to land:

         do_test_finished();
-      }
+      } else
+        dump("%%%aDl.state: " + aDl.state + "\n");

and see what we are getting.
Caught this again, and it's getting DOWNLOAD_FAILED instead of DOWNLOAD_FINISHED in the failure case. Not sure how to proceed here... sdwilsh, any suggestions?
Oh, I missed the status in the output. It's failing because of NS_ERROR_ENTITY_CHANGED, which is one of the cases at:

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/netwerk/protocol/http/src/nsHttpChannel.cpp&rev=1.333#920
(In reply to comment #33)
> Are we hitting
> http://mxr.mozilla.org/mozilla-central/source/toolkit/components/downloads/test/unit/test_sleep_wake.js#91
> perhaps?

Yep.

%%%meta.getHeader('Range'): bytes=101111-
%%%from: 101111
%%%to: 101110
%%%data.length: 101111
%%% Returning early - from >= data.length

From:
http://hg.mozilla.org/mozilla-central/index.cgi/file/ad1fc6c8e351/toolkit/components/downloads/test/unit/test_sleep_wake.js#l84

So we're hitting:
1090239168[90ec548]: Unexpected response status while resuming, aborting [this=92d9900]
So does that mean that we're pausing the download after it's already complete, but before we received FINISHED?
I think so.  Tomorrow I'll try to cook up a patch that just pumps data until the test is actually done.
OK, I stand corrected.  Doesn't look like it's possible to keep pumping data until we are done.  So...I guess we have two options:
1) Make the test not fail if it hits this condition.  This kinda sucks since it seems to happen often on certain boxes.
2) Make the data that we send out bigger such that we have more time for the test to finish.
(In reply to comment #38)
> 1) Make the test not fail if it hits this condition.  This kinda sucks since it
> seems to happen often on certain boxes.

A tree that turns orange intermittently sucks a lot more. We'll still have the test coverage from all of the boxes that are passing now, so we'll be no worse off (in fact we'll likely be better off, since if the test fails for some other reason we'll notice the failure instead of assuming that it's the same known bustage).
I agree that slightly reduced test coverage is a fair tradeoff for less random orange. What you ought to do is file a new bug on improving the test, blocked on the httpd.js changes, and reference it in a comment in that test, so that at some point we can improve httpd.js and then make the test better.
Attached patch v1.0 (obsolete) — Splinter Review
This should do the trick.
Assignee: lukasblakk → sdwilsh
Attachment #328274 - Attachment is obsolete: true
Attachment #328386 - Flags: review?(gavin.sharp)
Whiteboard: [has patch][needs review gavin]
Comment on attachment 328386 [details] [diff] [review]
v1.0

Can you revert the test_all.sh changes too?
Attachment #328386 - Flags: review?(gavin.sharp) → review+
Comment on attachment 328386 [details] [diff] [review]
v1.0

Just a drive by here, but 

+let doNotError = false;

+        dump("Returning early - from >= data.length.  Not an error (bug 431745)\n");
+        doNotError = true;

+        // this is only ok if we are not supposed to fail
+        do_check_true(doNotFail);

is that last one supposed to be doNotError like the previous two?
(In reply to comment #43)
> is that last one supposed to be doNotError like the previous two?
Yeah...I'm having difficulties triggering this error locally clearly :)
Attached patch v1.1Splinter Review
Addresses comments.
Attachment #328386 - Attachment is obsolete: true
Whiteboard: [has patch][needs review gavin] → [has patch][has review][can land]
I'm also going to land the test change on branch.  It's a test change, so I don't need approval.
Pushed to mozilla-central:
http://hg.mozilla.org/mozilla-central/index.cgi/rev/aadfa8776a70
Status: ASSIGNED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Whiteboard: [has patch][has review][can land]
Component: Release Engineering: Maintenance → Download Manager
Product: mozilla.org → Firefox
QA Contact: release → download.manager
Version: other → Trunk
Attached patch branch versionSplinter Review
Checking in toolkit/components/downloads/test/unit/test_sleep_wake.js;
new revision: 1.2; previous revision: 1.1
Keywords: fixed1.9.0.2
Product: Firefox → Toolkit
Whiteboard: [orange]
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.