425987 - image reftest fails occasionally on test boxes

Reporter

Description

•

17 years ago

jpeg image reftest fails occasionally on qm-xserve01: REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_osx/mozilla/modules/libpr0n/test/reftest/jpeg/jpg-size-32x32.jpg 5 of the last 17 'qm-xserve01' Tinderbox runs were orange with this error, although the exact file that failed varies.

Mats Palmgren (inactive)

Reporter

Comment 1

•

17 years ago

FWIW, 'qm-centos5-01' has a few similar test failures over the past few days: REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_centos5/mozilla/modules/libpr0n/test/reftest/pngsuite-gamma/g04n0g16.png

Mats Palmgren (inactive)

Reporter

Comment 2

•

17 years ago

The latest 'qm-xserve01' failure: REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_osx/mozilla/modules/libpr0n/test/reftest/pngsuite-zlib/z00n2c08.png

Mats Palmgren (inactive)

Reporter

Comment 3

•

17 years ago

Could it perhaps be a problem with the reftest framework or Tinderboxes?

Justin Dolske [:Dolske]

Assignee

Comment 4

•

17 years ago

karlt pointed out that there was was a PNG failure on qm-centos5-01 (Linux) today as well... http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1207177226.1207180464.32340.gz The last qm-xserve-01 failure happened after the box was clobbered, but this qm-centros5-01 failure happened before the clobber (albeit with other test failures). If there aren't random failures in other reftests, maybe this is something unique to image documents? </random guess>

Summary: jpeg image reftest fails occasionally on qm-xserve01 → image reftest fails occasionally on test boxes

Dietrich Ayala (:dietrich)

Comment 5

•

17 years ago

This is no longer intermittent. qm-centos5-01 has failed with these libpng reftest failures for every run today since gerv checked in bug 424841 (licensing file changes).

Justin Dolske [:Dolske]

Assignee

Comment 6

•

17 years ago

I see a failure loading g03n3p04.html in one recent run (on the Linux box) -- this is just a big HTML <table>, no images. That strikes me as odd, because if this is happening to HTML files there should be reftest failures all over the tree. More guesses: * Maybe there's a bug that only gets triggered once? The image reftests are the earliest batch of tests to run (after a handful of reftest self-checks), so maybe whatever's broken here is just hitting whoever happens to come along first. * Maybe the pngsuite-corrupted tests are getting something into a bad state, and causing a following test to fail? [Might be possible to eliminate this if any of the Tinderbox failures are in the reftests prior to this.] Not sure how this would break loading an HTML page, though. Actually, hmm, wasn't there a recent bugfix having to do with onerror events? *pokes* Ah, yes, bug 421602. "Image onload fails to fire intermittently", checked in ~3/20. Has anyone tracked back when this failure first started appearing? Perhaps this is a regression from that? I'm also now reminded of the maybe-similar issue I hit with bug 414185, where I couldn't get a new reftest to work with a particular corrupt image because it never fired a load event.

Boris Zbarsky [:bzbarsky]

Comment 7

•

17 years ago

I would be pretty surprised if the patch from bug 421602 led to intermittent failures like that... All it does is keep an extra ref to the image element until the image is done loading.

Mats Palmgren (inactive)

Reporter

Comment 8

•

17 years ago

(In reply to comment #6) > Maybe there's a bug that only gets triggered once? I see some logs with more than one "REFTEST UNEXPECTED FAIL (LOADING)", although usually it's only one. The first occurrence on 'qm-centos5-01' appears to be 2008/03/16 03:42: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1205664160.1205668602.1858.gz The first occurrence on 'qm-xserve01' appears to be 2008/03/20 21:50: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1206076781.1206079774.4845.gz What's really odd is that the failures sometimes are synchronised between 'qm-centos5-01' and 'qm-xserve01' - compare: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1206075050.1206076553.29462.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1206076781.1206079774.4845.gz Is the source checked out in a central location and then distributed to multiple Tinderboxes? If so, maybe some files are simply missing in the source package occasionally? (failing disk?)

Johnathan Nightingale [:johnath]

Comment 9

•

17 years ago

Are we risking training people to ignore orange right now? I'm sure these test failures are pointing to a real problem, but their usefulness as tests right now seems compromised. If someone checked in code that, for instance, regressed our PNG behaviour in a way that the g04n0g16 would catch, people are likely to just ignore it. Is it appropriate to pull these tests while we figure out what's going on?

Justin Dolske [:Dolske]

Assignee

Comment 10

•

17 years ago

Well, one problem is that the earlier failures Mats found (comment 8) are not in the libpr0n reftests at all (various layout reftests). So we could end up playing whack-a-mole disabling tests. :(

Justin Dolske [:Dolske]

Assignee

Comment 11

•

17 years ago

Attached file History of qm-centos5-01 — Details

I went through the last month of runs on qm-centos5-01 (this afternoon through March 6th). My notes, a summarization of the tinderbox logs, is attached. * Although there are a few other "FAIL (LOADING)" failures in this period, the recent libpr0n image reftests seem to be the vast majority. I'd suspect that the few others are unrelated to whatever's going on here. Might be possible to confirm that by cross referencing those failures to checkins (to see if they were caused by a bad patch that was the fixed or backed out). * The earliest failure is on 2008/03/28 at 10:05:20. 2008/03/28 10:05:20 (failed) ...6 good runs... 2008/03/28 16:34:56 (failed) ...4 good runs... 2008/03/29 04:48:00 (failed) ...27 good runs... 2008/03/30 11:18:25 (failed) ...3 good runs... 2008/03/30 23:26:00 (failed) ...5 good runs... 2008/03/31 08:36:14 (failed) ...1 good run... 2008/03/31 10:27:45 (failed) and so on. Checkins for the 2 days before the first failure: http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2008-03-28+11%3A00&maxdate=2008-03-26+02%3A00&cvsroot=%2Fcvsroot No smoking gun, though. More things to try: * Perhaps a problem on the box itself? Bug 392788 describes a vaguely similar problem on this box, which was caused by excessive load on the VM host. We should check what the load's like on this box, any maybe try increasing the timeout to see if that's the problem. The box sometimes has long stretches of green runs at night (midnite - 9am), which might suggest the problem is made worse by daytime load? * Related to above: I notice on OS X nightly builds that I often get the pinwheel of death shortly for a long moment shortly after launching the browser. First GC? Loading places? Something else? If this mini-hang is also happening on this box (and perhaps made worse by a slow VM, etc), then this could be something that would cause tests running shortly after the browser starts to be at risk of timeout errors. * Investigate qm-xserve01, see if data points there help show what might be up. * Try disabling the pngsuite-corrupted tests to rule them out * Try disabling all of pngsuite and see if the problem shifts to later tests.

Justin Dolske [:Dolske]

Assignee

Comment 12

•

17 years ago

I disabled pngsuite-corrupted last night, but that had no effect. I then disabled all of the libpr0n reftests, and then the next run failed with: REFTEST UNEXPECTED FAIL (LOADING): file:///builds/slave/trunk_centos5/mozilla/layout/reftests/bugs/40596-1f.html Note that this test is now one of the first batch to run... The previous run failed on f04n0g08.png (the 80th reftest). With libpr0n tests disabled, 40596-1f.html is now the 84th reftest to run. The tree is currently closed, and there have been no other checkins, so it's unlikely something else is causing that test to fail. I'll let it run a cycle a couple more times, before reenabling the libpr0n tests (since disabling them didn't help, and it's at least a recognizable failure :). The next step is see if increasing the test timeout helps, although that's not a desirable long-term fix.

Mats Palmgren (inactive)

Reporter

Comment 13

•

17 years ago

Have we verified that the file that fails is actually there (and not corrupt)? Can anyone explain the synchronized failures? (comment 8)

Justin Dolske [:Dolske]

Assignee

Comment 14

•

17 years ago

(In reply to comment #13) > Have we verified that the file that fails is actually there (and not corrupt)? I don't think that's likely, because the tests fail at different spots in libpr0n tests... Grepping my tinderbox notes attachment shows that it's failed in 20 different files (1 at a time), although some are more common than others. > Can anyone explain the synchronized failures? (comment 8) Not sure. I've mostly been looking at Linux, because it seemed to be failing most consistently.

Justin Dolske [:Dolske]

Assignee

Comment 15

•

17 years ago

Results from today: ... 2008/04/05 11:45:56 failed in .../pngsuite-filtering/f04n0g08.png 2008/04/05 13:10 (checkin to reenable pngsuite-corrupted) 2008/04/05 13:19:51 failed in .../pngsuite-filtering/f04n0g08.png 2008/04/05 13:20 (checkin to disable all libpr0n reftests) 2008/04/05 14:16:34 failed in layout/reftests/bugs/40596-1f.html 2008/04/05 15:23:36 failed in layout/reftests/bugs/40596-1e.html 2008/04/05 16:50:00 failed in layout/reftests/bugs/40596-1d.html 2008/04/05 16:55 (checkin to reenable all libpr0n reftests) With the last checkin, I also increased the timeout from 30 seconds to 45 seconds. Hopefully that will make things go green (and get the tree back open). We'll still need to figure out what's causing such a long delay, though.

Justin Dolske [:Dolske]

Assignee

Comment 16

•

17 years ago

With the 45 second timeout, the Linux box reftests passed (although the mochitests crapped out halfway though... another problem?), but the OS X box still failed. So I've doubled the timeout to 90 seconds to see what happens. :(

Justin Dolske [:Dolske]

Assignee

Comment 17

•

17 years ago

Ugh, this has apparently started happening again, even with the 90 second timeout. This bandaid isn't working, and whatever's causing things to hang up at the beginning of a test run is probably getting worse. That would tend to imply there's something specific to the box itself causing the problem. EG, excessive load, accumulated junk in the profile, etc. But still very odd, since this is occurring cross-platform. Is there a way to get access to qm-xserve01 or qm-centos5-01 (or a clone thereof) to take a look at what might be going on? This is sufficiently mysterious that I'd like to actually diagnose what the problem is.

Justin Dolske [:Dolske]

Assignee

Updated

•

17 years ago

Assignee: nobody → dolske

OS: Mac OS X → All

Hardware: PC → All

Justin Dolske [:Dolske]

Assignee

Comment 18

•

17 years ago

...and now this problem is affecting Windows. qm-win2k3-pgo01: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1209261354.1209270427.18708.gz It seems suspicious to me that a problem would be happening on all 3 platforms (and apparently getting worse), but doesn't seem to be reproducible locally. Can someone check out these boxes when they're running the reftests and see if there's unusual load / disk / network perf problems? Do all these machines use shared network storage? That would be a common factor to explain the failures...

Assignee: dolske → nobody

Severity: normal → major

Component: ImageLib → Release Engineering: Maintenance

Product: Core → mozilla.org

QA Contact: imagelib → release

Version: Trunk → other

Justin Dolske [:Dolske]

Assignee

Comment 19

•

17 years ago

I just bumped the test timeout to 120 seconds, since the Linux boxes now appear to be in a perma-orange state due to this problem. joduinn: can we get someone to help out with the last two comments here?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 20

•

17 years ago

Could this be related to running the tests in VMs on increasingly-overloaded hosts? (Is that what we're doing?)

Chris Cooper [:coop] (he/him)

Comment 21

•

17 years ago

(In reply to comment #20) > Could this be related to running the tests in VMs on increasingly-overloaded > hosts? (Is that what we're doing?) I believe the Windows boxes are physical hardware (as are the Macs, of course), but the Centos machines are VMs, which does explain why we see the timeouts there most often. Justin: I have no problem giving you access to diagnose. We're not fond of the perma-orange status either.

Justin Dolske [:Dolske]

Assignee

Comment 22

•

17 years ago

I bumped the timeout to 150 seconds (!!) because there are still timeouts. I'm on qm-centos5-02 right now, just in time to catch a reftest run. With the system monitor running, I saw *low* CPU usage, and no notable network/memory spikes. [Although this is all in a VM; I'm not familiar with how the guest looks when the host is under load.] vmstat has some interesting output, here's the snippet from when the $ vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st ... 1 1 41140 11152 87372 198836 0 0 104 6 1185 1558 4 14 0 82 0 0 1 41140 10644 86808 198836 0 0 113 5 1181 1513 4 14 0 82 0 0 1 41140 11512 85528 198224 0 0 96 4 1186 1537 4 14 0 82 0 0 1 41140 11496 84948 198388 0 0 140 2 1269 1656 4 14 0 82 0 0 1 41140 11108 83604 198276 0 0 121 4 1267 1565 5 16 0 79 0 0 1 41140 11868 82148 197912 0 0 200 0 1322 1843 5 17 0 78 0 3 1 41140 13060 78168 197456 0 0 90 61 1421 2233 44 32 0 24 0 1 0 41140 13356 78232 197552 0 0 20 91 1218 2707 61 35 0 4 0 1 0 41140 12896 78304 198020 0 0 97 142 1202 2243 61 32 0 7 0 1 0 41140 14188 77332 198472 0 0 95 60 1237 2167 60 31 0 9 0 1 0 41140 13272 77412 198920 0 0 94 82 1167 1964 63 32 0 5 0 Things to note: * CPU usage was *low* (System Monitor showed a load ~ 20% overall) while waiting for the timeout, and then spikes again when testing resumes. * Process time was mostly blocked in IO wait ("wa"). si/so are 0 so no swapping, bi/bo implies the IO wait was likely due to reads, with not much being written. * buff (file buffers) is dropping, which would imply increasing memory usage by processes? Stale buffers being released? Looking in the FF profile: * The reftests are run on an existing profile that's nearly a year old (oldest files are July 2007). * 133MB places.sqlite file! (my own never-cleared DB is only 45MB) * 2.6MB urlclassifier2.sqlite, last updated Jul 2007 * 9.2KB urlclassifier3.sqlite, last updated Feb 2008 * prefs.js doesn't have anything disabling safebrowsing * The json backups in $profile/bookmarkbackups/ are ~120K. That's a little strange, because in *my* profile (with actual bookmarks), these file are only ~60K. I'll try to catch the next reftest run with strace/lsof, and see what file IO it's doing during this period. I wouldn't be surprised if the problem ends up being some combination of: (1) 133MB places DB blowing out caches (2) Safebrowsing updates (eg, bug 430530) I'm a little confused why the urlclassifier DBs don't seem to be updating. Are the servers blocked by our firewalls? Or is there a failure mode that could be causing them to not update the file?

Justin Dolske [:Dolske]

Assignee

Comment 23

•

17 years ago

Looks like it's Places. strace shows a flood of output like: read(32, "\r\17\256\0\217\1B\0\17\312\17\224\17z\17`\17F\17,"..., 4096) = 4096 _llseek(32, 28454912, [28454912], SEEK_SET) = 0 read(32, "\r\16\370\0\217\1B\0\17\346\17\314\17\262\17\230"..., 4096) = 4096 _llseek(32, 28229632, [28229632], SEEK_SET) = 0 read(32, "\r\f,\0\210\0014\0\17\346\17\314\17\262\17\230\17~"..., 4096) = 4096 _llseek(32, 28954624, [28954624], SEEK_SET) = 0 read(32, "\r\f\374\0\210\0014\0\17\346\17\314\17\262\17\230"..., 4096) = 4096 _llseek(32, 29855744, [29855744], SEEK_SET) = 0 read(32, "\r\n>\0\210\0014\0\17\346\17\314\17\262\17\230\17~"..., 4096) = 4096 _llseek(32, 30208000, [30208000], SEEK_SET) = 0 read(32, "\r\tT\0\210\0014\0\17\346\17\314\17\262\17\230\17~"..., 4096) = 4096 _llseek(32, 30593024, [30593024], SEEK_SET) = 0 ... lsof shows that fd32 is places.sqlite: ... firefox-b 18848 buildbot 32uw REG 8,1 133189632 134322 /home/buildbot/.mozilla/firefox/vrttezm8.default/places.sqlite ... So, nuking places.sqlite should be a short-term fix for the problem. [Dietrich, would you like a copy of it to see if there's something else broken to make it so large?] Longer term, the problem might just repeat. Should buildbot be creating a fresh new profile for each run? The downside of that is it's not as representative of real-world usage as a "dirty" profile, but perhaps that's an issue to deal with elsewhere.

Justin Dolske [:Dolske]

Assignee

Comment 24

•

17 years ago

(I guess I own this by now :-)

Assignee: nobody → dolske

Bill Gianopoulos [:WG9s]

Comment 25

•

17 years ago

(In reply to comment #23) > Longer term, the problem might just repeat. Should buildbot be creating a fresh > new profile for each run? The downside of that is it's not as representative of > real-world usage as a "dirty" profile, but perhaps that's an issue to deal with > elsewhere. > My vote would be that unit tests should be being run against a static configuration so I would say a fresh new profile. On the other hand I think performance tests should run both ways. Against a fresh new profile, to identify performance regressions caused by chekcins, and against an old profile, to identify performance issues that appear after running awhile with an already existing profile. But, my vote probably does not count for much. :-)

Mats Palmgren (inactive)

Reporter

Comment 26

•

17 years ago

(In reply to comment #23) > So, nuking places.sqlite should be a short-term fix for the problem. This worries me since I've seen users complain about large places.sqlite slowing things down too.

Chris Cooper [:coop] (he/him)

Comment 27

•

17 years ago

We used to delete places.sqlite explicitly as its own step in buildbot, but now that's supposed to happen as part of the "create profile" step, i.e. the whole profile is meant to be nuked and recreated before every run. What I'm hearing is that that step is failing to work as advertised. Not that that makes the real world case of having a largish places.sqlite file any less worrisome, but our test infrastructure should at least be performing as expected/advertised.

Justin Dolske [:Dolske]

Assignee

Comment 28

•

17 years ago

Additional note: I landed bug 431435 for measuring the slowest reftest in a run... centos5-01 and centos5-03 are experiencing long pauses (91s and 59s, respectively, for one run). Presumably that's the same issue as here. But the slowest tests qm-xserve01 and qm-win2k3-01/02/03 are SVG tests, at 2-7 seconds. That's a little surprising given the history of this bug. Either they're only sporadically hitting this issue, or they've recently been fixed (new profile, or something).

Justin Dolske [:Dolske]

Assignee

Comment 29

•

17 years ago

I deleted the places.sqlite earlier this evening on qm-centos5-01: 18:53 - 19:28 = 35 min, (timeout) 20:11 - 20:42 = 31 min, (timeout) ...cleared... 21:22 - 21:41 = 19 min, (ok, slowest = 2018ms) 00:38 - 00:59 = 21 min, (ok, slowest = 2013ms) Nice that the reftests now take 40% less time to run, although that could just be load variations on the VM. [Although the associated mochitest runs stayed about the same, 15 -> 13 minutes] Also, Dietrich took at look at the 133MB places DB: the db is full of reftest URLs. it's *definitely* behind on expiration, as there are 1.6 million visits. some URLs have 24k+ visits. we expire 500 at exit, so we clearly add more than that for each test run. Sounds like the cause is understood. So, we should go ahead and blow away any such grotesquely large places.sqlite, and ensure that profiles are being created anew for each test run.

Justin Dolske [:Dolske]

Assignee

Updated

•

17 years ago

Depends on: 431494

Justin Dolske [:Dolske]

Assignee

Comment 30

•

17 years ago

(In reply to comment #27) > We used to delete places.sqlite explicitly as its own step in buildbot, but now > that's supposed to happen as part of the "create profile" step, i.e. the whole > profile is meant to be nuked and recreated before every run. What I'm hearing > is that that step is failing to work as advertised. Yeah. I've spun this off into bug 431494. Deleting places.sqlite on all the unit test boxes experiencing this problem should be enough to fix the immediate problem, but if we can go ahead an fix the profile creation step, that would be best.

Tuukka Tolvanen (sp3000)

Comment 31

•

17 years ago

(random data point, I've seen pretty long hd thrashing and some lag poking at the urlclassifier file, recently)

Justin Dolske [:Dolske]

Assignee

Comment 32

•

17 years ago

The fix for bug 431494 landed this morning. I've checked the log for the test runs since then -- on all platforms, the slowest test takes about 2 seconds. The reftest timeout was 30s when this problem first hit, and had previously been bumped from 10s last August (for reasons lost to the sands of time). So, I've rolled the timeout all the way back to 10s now, on the theory that a short timeout may help catch real problems in the future. If that proves to be too short for normal VM load glitches, then we can increase it. dbaron noted that running reftests under valgrind may need a higher timeout, but such developers can easily just increase the timeout locally. (In reply to comment #31) > (random data point, I've seen pretty long hd thrashing and some lag poking at > the urlclassifier file, recently) Probably bug 430530.

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

timeless

Updated

•

16 years ago

Component: Release Engineering: Maintenance → Release Engineering

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering