Closed Bug 933090 Opened 11 years ago Closed 11 years ago

Clobberer doesn't clobber right somehow

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 934166

People

(Reporter: philor, Assigned: jhopkins)

Details

(Keywords: sheriffing-P1)

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=29901037&tree=Mozilla-Inbound is a failure that should have long since been wiped away by clobbering, and then by probably a dozen more clobbers for other reasons later on.

In the clobber step, it claims

m-in-l64-pgo-00000000000000000:Our last clobber date:  2013-10-30 12:00:31
m-in-l64-pgo-00000000000000000:Server clobber date:    2013-10-15 19:47:52

which is certainly not intended in the case of the server clobber date, because we've done what we think should cause it to be clobbered in the clobberer UI over and over and over since 10-15. According to the build log, it was not in fact a clobber, so it has two problems: it lies about having clobbered, and the server failed to note our desire over and over to have it clobber.
catlee, this has pretty serious implications for sheriffing and meaning we may make incorrect conclusions as to the cause of breakage - could you find someone to take a look? :-)
Flags: needinfo?(catlee)
Keywords: sheriffing-P1
Assignee: nobody → jhopkins
The build in comment 0 checked clobber times at 2013-10-30 18:01:21.485014
for build slave bld-linux64-ec2-305
and branch mozilla-inbound
and build dir m-in-l64-pgo-00000000000000000

Looking at the clobber history for that combination:

mysql> select id,master,branch,builddir,slave,from_unixtime(lastclobber) from clobber_times where branch = 'mozilla-inbound' and builddir='m-in-l64-pgo-00000000000000000' and slave='bld-linux64-ec2-305' order by id desc limit 10;
+----------+--------+-----------------+--------------------------------+---------------------+----------------------------+
| id       | master | branch          | builddir                       | slave               | from_unixtime(lastclobber) |
+----------+--------+-----------------+--------------------------------+---------------------+----------------------------+
| 17204301 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-30 20:42:54        |
| 17160059 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-30 19:37:18        |
| 15856267 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 19:47:52        |
| 15837531 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 18:01:39        |
| 15807291 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 13:08:07        |
| 15774199 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 12:50:31        |
| 15727175 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 10:56:03        |
| 15668139 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-14 20:07:50        |
| 15636077 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-14 12:11:20        |
| 15603421 | NULL   | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-14 11:24:44        |
+----------+--------+-----------------+--------------------------------+---------------------+----------------------------+
10 rows in set (0.00 sec)

I can see that the clobber time for that combination - at the time it was checked - was indeed 2013-10-15 19:47:52
Next time you clobber, can you check whether there are multiple dates shown on the clobber page for "Last clobbered"?  I'm wondering whether there have been script timeouts preventing all slaves from getting an updated clobber time.

For example, I see multiple "last clobbered" dates on the mozilla-inbound clobber page for quite a few builders.
Yes, there are almost always multiple dates shown if it's been more than a few hours since the last clobber, this was just the first time we've had a really obvious "it's busted because it didn't clobber even though it's been multiple clobbers since we did clobber" thing to correlate with that.
RyanVM usually re-clobbers until the multiple entries have gone away and says he hasn't noticed any misbehaving clobbers after doing so.

philor: can you please try using that workflow and reopen if you see the problem reappear?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Sorry, I think we were misunderstanding each other. Philor's point that there were dozen's more clobbers pretty much points to this problem not being fixed by that workflow. Sorry :(
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
I haven't ever seen the odd dates persist past a clobbering, anyway; what they do is reappear afterward. My guess without having looked at the code was that despite looking like you are clobbering "everything that runs Linux64 PGO on this branch" you are actually clobbering "the set of slaves which the clobberer thinks ought to be clobbered" and that that set of slaves is wrong, so that bld-linux64-ec2-305 wasn't ever included in any of the clobberer uses that happened in between.

Guessing from view-source, if I clobbered inbound right now, I'd clobber Linux64 PGO for
 "783403|779367|757481|761127|803615|804559|726463|805991|805813|771511|792723|737043|781539|766373|809385|812787|808199|807225|815097|814133|815461|809381|814665|817683|813735|809679|807361|736017|760525|787085|797241|797133|774499|793017|796895|805719|823299|764091|764701|766797|767109|772159|765399|808455|793649|793479|798143|754603|786839|753737|794205|755853|779865|731099|783587|794567|787385|787913|792349|741923|776561|785829|773963|704263|795995|752367|792079|790371|801037|825579|823771|753279|789793|765793|775149|818397|764269|762319|780721|753869|785347|752991|780257|730861|810367|816867|791533|741919|763435|823967|755191|777189|802033|784143|797041|805511|760243|804885|793913|759939|758545|758133|805035|794699|799371|733771|736693|789463|734085|733891|782527|622333|811115|778527|796441|822089|767819|802839|776313|763465|782759" 

whatever slaves those might be. Is that actually the set of slaves which have a Linux64 PGO inbound objdir, or is there a way for a slave to currently have a m-in-l64-pgo-00000000000000000 without being in that set?
The six-to-last id in your list is 822089. The matching id in the builds table of the clobberer db is bld-linux64-ec2-305 for 'Linux x86-64 mozilla-inbound pgo-build' in 'm-in-l64-pgo-00000000000000000'. So I think a clobber would have an effect for this slave, if you were to do one now.

The clobberer output in the log in comment #0 is misleading, ideally it would be
  m-in-l64-pgo-00000000000000000:Our last clobber date:  2013-10-30 12:00:31    ## this has not changed
  m-in-l64-pgo-00000000000000000:Server clobber date:    2013-10-30 11:49:15    ## this line has changed

In a perfect world, using the clobberer webUI should set some kind of 'every slave' row in the db, rather than just the slaves it knows did builds recently. And the webUI table would show something like 
   all: <some_time>, <specific_slave>: <some_later_time>
(where the specific_slave stuff comes from cancelled builds, and we ignore any of those older than <some_time>). 

That the build at 18:00 failed is either a build system problem or a slave-specific one, because there was no clobber between the 12:00 and 18:00 builds that I can find.


More details:
bld-linux64-ec2-305 has this history on this job:

Builder name	                       Build #  Master		Result	Start time (sorted desc)
Linux x86-64 mozilla-inbound pgo-build	209	bm58-build1	2	2013-10-30 18:00:15
Linux x86-64 mozilla-inbound pgo-build	208	bm58-build1	0	2013-10-30 12:00:13
Linux x86-64 mozilla-inbound pgo-build	212	bm64-build1	0	2013-09-25 00:09:56

There's a cron job that trims the clobberer db of rows which are more than 21 days old, so the 09-25 reference was gone when the slave came back to this work on 10-30. That doesn't matter because 21 days is greater than the 7 day limit for stale objdirs, and the job at 12:00 on the 30th was a purged clobber.

From comment #2 it looks like the next clobber was at 19:37, which is confirmed by:
mysql> select distinct from_unixtime(lastclobber) as c_time from clobber_times where builddir='m-in-l64-pgo-00000000000000000' order by c_time desc limit 6;
+---------------------+
| c_time              |
+---------------------+
| 2013-10-31 13:10:29 |
| 2013-10-31 13:09:24 |
| 2013-10-30 20:42:54 |
| 2013-10-30 19:37:18 |
| 2013-10-30 11:49:15 |
| 2013-10-29 14:59:01 |
+---------------------+
6 rows in set (0.03 sec)

ie purged clobber when it built at 12:00, no clobber set before job at 18:00, build blew up all by itself.
The ids in this SQL select were copied from the HTML.  Note the differing builddir.
Differing builddir? I filed a dupe of my own bug 626164 from 2011?
Flags: needinfo?(catlee)
And, to compound my awesome, the thing I filed this about turns out to be bug 934166, not that we were failing to clobber but just that the (vanishingly rare) non-clobber PGO builds were in fact busted.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → DUPLICATE
Well, we *did* have builds that were supposed to be clobbered that weren't.
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: