unit testers might be updating to newer source before running tests?



10 years ago
6 years ago


(Reporter: dbaron, Unassigned)


Firefox Tracking Flags

(Not tracked)


When tracking down a regression in unit test results on mozilla-central I hit what appears to be a problem with the unit test infrastructure, although I can't prove that it actually happened (partly since the package of tests the builder produced has already been deleted from stage).

What happened is as follows.  In the late evening hours (PDT) of July 30 or the early morning hours (PDT) of July 31, there was a regression on mozilla-central, of which the primary symptoms were that mochitest-browser-chrome had a timeout and failure on browser/components/sessionstore/test/browser/browser_491168.js , usually accompanied by leaks.  This orange was ***intermittent*** -- it did not occur every cycle, although I think it was occurring in more than half.

Looking at tinderboxpushlog (both the "U" machines that build and test and the "E" machines which download the build/tests from the "U" machine and then run their own tests), which matches up (on the ones I checked) with the "rev:" printed in the unit test logs, this failure first occurred on most machines on the second cycle of builds that built http://hg.mozilla.org/mozilla-central/rev/8cd49a8cbb88 although on the Linux "U" machine it occurred on the first cycle that tested that changeset.

However, backing out http://hg.mozilla.org/mozilla-central/rev/8cd49a8cbb88 did not cause the problem to go away, but backing out the ***following*** changeset, http://hg.mozilla.org/mozilla-central/rev/6a5f22ccbe0e did.  (That following changeset seems an expected cause of such a regression as well, since it modified the failing test.)

This suggests to me that a number of unit test boxes somehow updated the tests that they were running to newer source code after identifying what revision they were building.  If this is the case, it would be a serious bug in our test infrastructure.

The builds that showed this problem, i.e., showed the regression but were allegedly building the changeset prior to the one that caused the regression, were the following.  Note that 8cd49... was the tip-most changeset for long enough that it got two cycles of unit tests on all platforms.

Linux mozilla-central unit test on 2009/07/30 22:27:22  
(first cycle)

Linux mozilla-central unit test on 2009/07/31 00:27:22  
(second cycle)

OS X 10.5.2 mozilla-central unit test on 2009/07/31 00:27:22  
(second cycle)

WINNT 5.2 mozilla-central unit test on 2009/07/31 00:27:22  
(second cycle)

Linux mozilla-central test everythingelse on 2009/07/31 01:23:11  
(second cycle)

OS X 10.5.2 mozilla-central test everythingelse on 2009/07/31 00:38:22  
(second cycle)

WINNT 5.2 mozilla-central test everythingelse on 2009/07/31 01:28:17  
(second cycle)

I took a close look at the log of the first of these seven.  The hg clone command pulled the correct number of changesets for the revision labeled, the hg update command updated to the correct revision, and I didn't see anything else in the log that would have indicated updating files.  (Note that for this to be a problem, the update could be happening after the build; it would just need to happen before (or during) the packaging.  Or it could be because multiple builds were somehow operating on the same tree.)  I tried to download the build that it uploaded, but that build is no longer on the FTP server (purged after 1 day).  So I have no explanation for why this would happen, but it's the only explanation I have for when the orange started.
My first thought is that this was another manifestation of bug 499161, but that was happening only on Linux and Mac, and only on unittests on packaged builds.

This looks like it's happening on all platforms, and on unittest build+test jobs as well.  Two of the build+test jobs above were using a clean working directory as well (free space clobber), so it doesn't look like hg misbehaving is the issue here.

Something we can do first is to re-build the revisions in question, and save the builds for later examination.
Assignee: nobody → lsblakk
I haven't even looked and this and don't plan to soon - throwing back in the pool for now.
Assignee: lsblakk → nobody
Another possibility here is that it was some sort of network issue with a test server hanging around and a test that's somehow using the wrong port number when the server enters the case where the first port number it tries is busy.  (At least that's what reftest does to work around problems like this:
Ah, so the hypothesis is that the test server from an old build is running on the machine when the new build starts up to be tested?

How do we check for this happening?

Comment 5

10 years ago
If the issue is an old test server running, we are now rebooting after every test/build run on production, and this should be fixed.

Has anyone seen this recently? (past couple weeks?)

Comment 6

10 years ago
If we don't see this again, we can close this during a future Future triage session.
Component: Release Engineering → Release Engineering: Future
(In reply to comment #6)
> If we don't see this again, we can close this during a future Future triage
> session.

Haven't seen it again.
Last Resolved: 9 years ago
Resolution: --- → FIXED
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.