Closed Bug 425987 Opened 16 years ago Closed 16 years ago

image reftest fails occasionally on test boxes

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: MatsPalmgren_bugz, Assigned: Dolske)

References

()

Details

Attachments

(1 file)

jpeg image reftest fails occasionally on qm-xserve01:

REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_osx/mozilla/modules/libpr0n/test/reftest/jpeg/jpg-size-32x32.jpg

5 of the last 17 'qm-xserve01' Tinderbox runs were orange with this error,
although the exact file that failed varies.
FWIW, 'qm-centos5-01' has a few similar test failures over the past few days:

REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_centos5/mozilla/modules/libpr0n/test/reftest/pngsuite-gamma/g04n0g16.png
The latest 'qm-xserve01' failure:
REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_osx/mozilla/modules/libpr0n/test/reftest/pngsuite-zlib/z00n2c08.png
Could it perhaps be a problem with the reftest framework or Tinderboxes?
karlt pointed out that there was was a PNG failure on qm-centos5-01 (Linux) today as well...

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1207177226.1207180464.32340.gz

The last qm-xserve-01 failure happened after the box was clobbered, but this qm-centros5-01 failure happened before the clobber (albeit with other test failures).

If there aren't random failures in other reftests, maybe this is something unique to image documents? </random guess>
Summary: jpeg image reftest fails occasionally on qm-xserve01 → image reftest fails occasionally on test boxes
This is no longer intermittent. qm-centos5-01 has failed with these libpng reftest failures for every run today since gerv checked in bug 424841 (licensing file changes).
I see a failure loading g03n3p04.html in one recent run (on the Linux box) -- this is just a big HTML <table>, no images. That strikes me as odd, because if this is happening to HTML files there should be reftest failures all over the tree.

More guesses:

* Maybe there's a bug that only gets triggered once? The image reftests are the earliest batch of tests to run (after a handful of reftest self-checks), so maybe whatever's broken here is just hitting whoever happens to come along first.

* Maybe the pngsuite-corrupted tests are getting something into a bad state, and causing a following test to fail? [Might be possible to eliminate this if any of the Tinderbox failures are in the reftests prior to this.] Not sure how this would break loading an HTML page, though.

Actually, hmm, wasn't there a recent bugfix having to do with onerror events? *pokes* Ah, yes, bug 421602. "Image onload fails to fire intermittently", checked in ~3/20. Has anyone tracked back when this failure first started appearing? Perhaps this is a regression from that?

I'm also now reminded of the maybe-similar issue I hit with bug 414185, where I couldn't get a new reftest to work with a particular corrupt image because it never fired a load event.
I would be pretty surprised if the patch from bug 421602 led to intermittent failures like that...  All it does is keep an extra ref to the image element until the image is done loading.
(In reply to comment #6)
> Maybe there's a bug that only gets triggered once?

I see some logs with more than one "REFTEST UNEXPECTED FAIL (LOADING)",
although usually it's only one.

The first occurrence on 'qm-centos5-01' appears to be 2008/03/16 03:42:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1205664160.1205668602.1858.gz
The first occurrence on 'qm-xserve01' appears to be 2008/03/20 21:50:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1206076781.1206079774.4845.gz

What's really odd is that the failures sometimes are synchronised
between 'qm-centos5-01' and 'qm-xserve01' - compare:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1206075050.1206076553.29462.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1206076781.1206079774.4845.gz

Is the source checked out in a central location and then distributed
to multiple Tinderboxes?  If so, maybe some files are simply missing
in the source package occasionally?  (failing disk?)
Are we risking training people to ignore orange right now?  I'm sure these test failures are pointing to a real problem, but their usefulness as tests right now seems compromised. If someone checked in code that, for instance, regressed our PNG behaviour in a way that the g04n0g16 would catch, people are likely to just ignore it.

Is it appropriate to pull these tests while we figure out what's going on?
Well, one problem is that the earlier failures Mats found (comment 8) are not in the libpr0n reftests at all (various layout reftests). So we could end up playing whack-a-mole disabling tests. :(
I went through the last month of runs on qm-centos5-01 (this afternoon through March 6th). My notes, a summarization of the tinderbox logs, is attached.

* Although there are a few other "FAIL (LOADING)" failures in this period, the recent libpr0n image reftests seem to be the vast majority. I'd suspect that the few others are unrelated to whatever's going on here. Might be possible to confirm that by cross referencing those failures to checkins (to see if they were caused by a bad patch that was the fixed or backed out).

* The earliest failure is on 2008/03/28 at 10:05:20.

2008/03/28 10:05:20 (failed)
...6 good runs...
2008/03/28 16:34:56 (failed)
...4 good runs...
2008/03/29 04:48:00 (failed)
...27 good runs...
2008/03/30 11:18:25 (failed)
...3 good runs...
2008/03/30 23:26:00 (failed)
...5 good runs...
2008/03/31 08:36:14 (failed)
...1 good run...
2008/03/31 10:27:45 (failed)

and so on.

Checkins for the 2 days before the first failure:

http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2008-03-28+11%3A00&maxdate=2008-03-26+02%3A00&cvsroot=%2Fcvsroot

No smoking gun, though.


More things to try:

* Perhaps a problem on the box itself? Bug 392788 describes a vaguely similar problem on this box, which was caused by excessive load on the VM host. We should check what the load's like on this box, any maybe try increasing the timeout to see if that's the problem. The box sometimes has long stretches of green runs at night (midnite - 9am), which might suggest the problem is made worse by daytime load?

* Related to above: I notice on OS X nightly builds that I often get the pinwheel of death shortly for a long moment shortly after launching the browser. First GC? Loading places? Something else? If this mini-hang is also happening on this box (and perhaps made worse by a slow VM, etc), then this could be something that would cause tests running shortly after the browser starts to be at risk of timeout errors.

* Investigate qm-xserve01, see if data points there help show what might be up.

* Try disabling the pngsuite-corrupted tests to rule them out

* Try disabling all of pngsuite and see if the problem shifts to later tests.
I disabled pngsuite-corrupted last night, but that had no effect.

I then disabled all of the libpr0n reftests, and then the next run failed with:

REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_centos5/mozilla/layout/reftests/bugs/40596-1f.html

Note that this test is now one of the first batch to run... The previous run failed on f04n0g08.png (the 80th reftest). With libpr0n tests disabled, 40596-1f.html is now the 84th reftest to run. The tree is currently closed, and there have been no other checkins, so it's unlikely something else is causing that test to fail.

I'll let it run a cycle a couple more times, before reenabling the libpr0n tests (since disabling them didn't help, and it's at least a recognizable failure :).

The next step is see if increasing the test timeout helps, although that's not a desirable long-term fix.
Have we verified that the file that fails is actually there (and not corrupt)?
Can anyone explain the synchronized failures? (comment 8)
(In reply to comment #13)
> Have we verified that the file that fails is actually there (and not corrupt)?

I don't think that's likely, because the tests fail at different spots in libpr0n tests... Grepping my tinderbox notes attachment shows that it's failed in 20 different files (1 at a time), although some are more common than others.

> Can anyone explain the synchronized failures? (comment 8)

Not sure. I've mostly been looking at Linux, because it seemed to be failing most consistently.
Results from today:

...
2008/04/05 11:45:56 failed in .../pngsuite-filtering/f04n0g08.png
2008/04/05 13:10    (checkin to reenable pngsuite-corrupted)
2008/04/05 13:19:51 failed in .../pngsuite-filtering/f04n0g08.png
2008/04/05 13:20    (checkin to disable all libpr0n reftests)
2008/04/05 14:16:34 failed in layout/reftests/bugs/40596-1f.html
2008/04/05 15:23:36 failed in layout/reftests/bugs/40596-1e.html
2008/04/05 16:50:00 failed in layout/reftests/bugs/40596-1d.html
2008/04/05 16:55    (checkin to reenable all libpr0n reftests)

With the last checkin, I also increased the timeout from 30 seconds to 45 seconds. Hopefully that will make things go green (and get the tree back open). We'll still need to figure out what's causing such a long delay, though.
With the 45 second timeout, the Linux box reftests passed (although the mochitests crapped out halfway though... another problem?), but the OS X box still failed. So I've doubled the timeout to 90 seconds to see what happens. :(
Ugh, this has apparently started happening again, even with the 90 second timeout. This bandaid isn't working, and whatever's causing things to hang up at the beginning of a test run is probably getting worse.

That would tend to imply there's something specific to the box itself causing the problem. EG, excessive load, accumulated junk in the profile, etc. But still very odd, since this is occurring cross-platform.

Is there a way to get access to qm-xserve01 or qm-centos5-01 (or a clone thereof) to take a look at what might be going on? This is sufficiently mysterious that I'd like to actually diagnose what the problem is.
Assignee: nobody → dolske
OS: Mac OS X → All
Hardware: PC → All
...and now this problem is affecting Windows. qm-win2k3-pgo01:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1209261354.1209270427.18708.gz

It seems suspicious to me that a problem would be happening on all 3 platforms (and apparently getting worse), but doesn't seem to be reproducible locally. Can someone check out these boxes when they're running the reftests and see if there's unusual load / disk / network perf problems? Do all these machines use shared network storage? That would be a common factor to explain the failures...
Assignee: dolske → nobody
Severity: normal → major
Component: ImageLib → Release Engineering: Maintenance
Product: Core → mozilla.org
QA Contact: imagelib → release
Version: Trunk → other
I just bumped the test timeout to 120 seconds, since the Linux boxes now appear to be in a perma-orange state due to this problem.

joduinn: can we get someone to help out with the last two comments here?
Could this be related to running the tests in VMs on increasingly-overloaded hosts?  (Is that what we're doing?)
(In reply to comment #20)
> Could this be related to running the tests in VMs on increasingly-overloaded
> hosts?  (Is that what we're doing?)

I believe the Windows boxes are physical hardware (as are the Macs, of course), but the Centos machines are VMs, which does explain why we see the timeouts there most often.

Justin: I have no problem giving you access to diagnose. We're not fond of the perma-orange status either.

I bumped the timeout to 150 seconds (!!) because there are still timeouts.

I'm on qm-centos5-02 right now, just in time to catch a reftest run. With the system monitor running, I saw *low* CPU usage, and no notable network/memory spikes. [Although this is all in a VM; I'm not familiar with how the guest looks when the host is under load.]

vmstat has some interesting output, here's the snippet from when the 

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
...
 1  1  41140  11152  87372 198836    0    0   104     6 1185 1558  4 14  0 82  0
 0  1  41140  10644  86808 198836    0    0   113     5 1181 1513  4 14  0 82  0
 0  1  41140  11512  85528 198224    0    0    96     4 1186 1537  4 14  0 82  0
 0  1  41140  11496  84948 198388    0    0   140     2 1269 1656  4 14  0 82  0
 0  1  41140  11108  83604 198276    0    0   121     4 1267 1565  5 16  0 79  0
 0  1  41140  11868  82148 197912    0    0   200     0 1322 1843  5 17  0 78  0
 3  1  41140  13060  78168 197456    0    0    90    61 1421 2233 44 32  0 24  0
 1  0  41140  13356  78232 197552    0    0    20    91 1218 2707 61 35  0  4  0
 1  0  41140  12896  78304 198020    0    0    97   142 1202 2243 61 32  0  7  0
 1  0  41140  14188  77332 198472    0    0    95    60 1237 2167 60 31  0  9  0
 1  0  41140  13272  77412 198920    0    0    94    82 1167 1964 63 32  0  5  0

Things to note:

* CPU usage was *low* (System Monitor showed a load ~ 20% overall) while waiting for the timeout, and then spikes again when testing resumes. 
* Process time was mostly blocked in IO wait ("wa"). si/so are 0 so no swapping, bi/bo implies the IO wait was likely due to reads, with not much being written.
* buff (file buffers) is dropping, which would imply increasing memory usage by processes? Stale buffers being released?

Looking in the FF profile:

* The reftests are run on an existing profile that's nearly a year old (oldest files are July 2007).
* 133MB places.sqlite file! (my own never-cleared DB is only 45MB)
* 2.6MB urlclassifier2.sqlite, last updated Jul 2007
* 9.2KB urlclassifier3.sqlite, last updated Feb 2008
* prefs.js doesn't have anything disabling safebrowsing
* The json backups in $profile/bookmarkbackups/ are ~120K. That's a little strange, because in *my* profile (with actual bookmarks), these file are only ~60K.

I'll try to catch the next reftest run with strace/lsof, and see what file IO it's doing during this period. I wouldn't be surprised if the problem ends up being some combination of:

(1) 133MB places DB blowing out caches
(2) Safebrowsing updates (eg, bug 430530)

I'm a little confused why the urlclassifier DBs don't seem to be updating. Are the servers blocked by our firewalls? Or is there a failure mode that could be causing them to not update the file?
Looks like it's Places.

strace shows a flood of output like:

read(32, "\r\17\256\0\217\1B\0\17\312\17\224\17z\17`\17F\17,"..., 4096) = 4096
_llseek(32, 28454912, [28454912], SEEK_SET) = 0
read(32, "\r\16\370\0\217\1B\0\17\346\17\314\17\262\17\230"..., 4096) = 4096
_llseek(32, 28229632, [28229632], SEEK_SET) = 0
read(32, "\r\f,\0\210\0014\0\17\346\17\314\17\262\17\230\17~"..., 4096) = 4096
_llseek(32, 28954624, [28954624], SEEK_SET) = 0
read(32, "\r\f\374\0\210\0014\0\17\346\17\314\17\262\17\230"..., 4096) = 4096
_llseek(32, 29855744, [29855744], SEEK_SET) = 0
read(32, "\r\n>\0\210\0014\0\17\346\17\314\17\262\17\230\17~"..., 4096) = 4096
_llseek(32, 30208000, [30208000], SEEK_SET) = 0
read(32, "\r\tT\0\210\0014\0\17\346\17\314\17\262\17\230\17~"..., 4096) = 4096
_llseek(32, 30593024, [30593024], SEEK_SET) = 0
...

lsof shows that fd32 is places.sqlite:

...
firefox-b 18848 buildbot   32uw  REG        8,1 133189632  134322 /home/buildbot/.mozilla/firefox/vrttezm8.default/places.sqlite
...

So, nuking places.sqlite should be a short-term fix for the problem. [Dietrich, would you like a copy of it to see if there's something else broken to make it so large?]

Longer term, the problem might just repeat. Should buildbot be creating a fresh new profile for each run? The downside of that is it's not as representative of real-world usage as a "dirty" profile, but perhaps that's an issue to deal with elsewhere.
(I guess I own this by now :-)
Assignee: nobody → dolske
(In reply to comment #23)

> Longer term, the problem might just repeat. Should buildbot be creating a fresh
> new profile for each run? The downside of that is it's not as representative of
> real-world usage as a "dirty" profile, but perhaps that's an issue to deal with
> elsewhere.
> 
My vote would be that unit tests should be being run against a static configuration so I would say a fresh new profile.  On the other hand I think performance tests should run both ways.  Against a fresh new profile, to identify performance regressions caused by chekcins, and against an old profile, to identify performance issues that appear after running awhile with an already existing profile.

But, my vote probably does not count for much. :-)
(In reply to comment #23)
> So, nuking places.sqlite should be a short-term fix for the problem.

This worries me since I've seen users complain about large places.sqlite
slowing things down too.
We used to delete places.sqlite explicitly as its own step in buildbot, but now that's supposed to happen as part of the "create profile" step, i.e. the whole profile is meant to be nuked and recreated before every run. What I'm hearing is that that step is failing to work as advertised.

Not that that makes the real world case of having a largish places.sqlite file any less worrisome, but our test infrastructure should at least be performing as expected/advertised.
Additional note:

I landed bug 431435 for measuring the slowest reftest in a run...

centos5-01 and centos5-03 are experiencing long pauses (91s and 59s, respectively, for one run). Presumably that's the same issue as here. But the slowest tests qm-xserve01 and qm-win2k3-01/02/03 are SVG tests, at 2-7 seconds. That's a little surprising given the history of this bug. Either they're only sporadically hitting this issue, or they've recently been fixed (new profile, or something).
I deleted the places.sqlite earlier this evening on qm-centos5-01:

18:53 - 19:28 = 35 min, (timeout)
20:11 - 20:42 = 31 min, (timeout)
...cleared...
21:22 - 21:41 = 19 min, (ok, slowest = 2018ms)
00:38 - 00:59 = 21 min, (ok, slowest = 2013ms)

Nice that the reftests now take 40% less time to run, although that could just be load variations on the VM. [Although the associated mochitest runs stayed about the same, 15 -> 13 minutes]

Also, Dietrich took at look at the 133MB places DB:

  the db is full of reftest URLs. it's *definitely* behind on
  expiration, as there are 1.6 million visits. some URLs have 24k+
  visits. we expire 500 at exit, so we clearly add more than that for
  each test run.

Sounds like the cause is understood. So, we should go ahead and blow away any such grotesquely large places.sqlite, and ensure that profiles are being created anew for each test run.
Depends on: 431494
(In reply to comment #27)
> We used to delete places.sqlite explicitly as its own step in buildbot, but now
> that's supposed to happen as part of the "create profile" step, i.e. the whole
> profile is meant to be nuked and recreated before every run. What I'm hearing
> is that that step is failing to work as advertised.

Yeah. I've spun this off into bug 431494. Deleting places.sqlite on all the unit test boxes experiencing this problem should be enough to fix the immediate problem, but if we can go ahead an fix the profile creation step, that would be best.
(random data point, I've seen pretty long hd thrashing and some lag poking at the urlclassifier file, recently)
The fix for bug 431494 landed this morning. I've checked the log for the test runs since then -- on all platforms, the slowest test takes about 2 seconds.

The reftest timeout was 30s when this problem first hit, and had previously been bumped from 10s last August (for reasons lost to the sands of time). So, I've rolled the timeout all the way back to 10s now, on the theory that a short timeout may help catch real problems in the future. If that proves to be too short for normal VM load glitches, then we can increase it. dbaron noted that running reftests under valgrind may need a higher timeout, but such developers can easily just increase the timeout locally.

(In reply to comment #31)
> (random data point, I've seen pretty long hd thrashing and some lag poking at
> the urlclassifier file, recently)

Probably bug 430530.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Maintenance → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: