Closed
Bug 933090
Opened 11 years ago
Closed 11 years ago
Clobberer doesn't clobber right somehow
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 934166
People
(Reporter: philor, Assigned: jhopkins)
Details
(Keywords: sheriffing-P1)
Attachments
(1 file)
90.83 KB,
text/plain
|
Details |
https://tbpl.mozilla.org/php/getParsedLog.php?id=29901037&tree=Mozilla-Inbound is a failure that should have long since been wiped away by clobbering, and then by probably a dozen more clobbers for other reasons later on. In the clobber step, it claims m-in-l64-pgo-00000000000000000:Our last clobber date: 2013-10-30 12:00:31 m-in-l64-pgo-00000000000000000:Server clobber date: 2013-10-15 19:47:52 which is certainly not intended in the case of the server clobber date, because we've done what we think should cause it to be clobbered in the clobberer UI over and over and over since 10-15. According to the build log, it was not in fact a clobber, so it has two problems: it lies about having clobbered, and the server failed to note our desire over and over to have it clobber.
Comment 1•11 years ago
|
||
catlee, this has pretty serious implications for sheriffing and meaning we may make incorrect conclusions as to the cause of breakage - could you find someone to take a look? :-)
Flags: needinfo?(catlee)
Keywords: sheriffing-P1
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → jhopkins
Assignee | ||
Comment 2•11 years ago
|
||
The build in comment 0 checked clobber times at 2013-10-30 18:01:21.485014 for build slave bld-linux64-ec2-305 and branch mozilla-inbound and build dir m-in-l64-pgo-00000000000000000 Looking at the clobber history for that combination: mysql> select id,master,branch,builddir,slave,from_unixtime(lastclobber) from clobber_times where branch = 'mozilla-inbound' and builddir='m-in-l64-pgo-00000000000000000' and slave='bld-linux64-ec2-305' order by id desc limit 10; +----------+--------+-----------------+--------------------------------+---------------------+----------------------------+ | id | master | branch | builddir | slave | from_unixtime(lastclobber) | +----------+--------+-----------------+--------------------------------+---------------------+----------------------------+ | 17204301 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-30 20:42:54 | | 17160059 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-30 19:37:18 | | 15856267 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 19:47:52 | | 15837531 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 18:01:39 | | 15807291 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 13:08:07 | | 15774199 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 12:50:31 | | 15727175 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-15 10:56:03 | | 15668139 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-14 20:07:50 | | 15636077 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-14 12:11:20 | | 15603421 | NULL | mozilla-inbound | m-in-l64-pgo-00000000000000000 | bld-linux64-ec2-305 | 2013-10-14 11:24:44 | +----------+--------+-----------------+--------------------------------+---------------------+----------------------------+ 10 rows in set (0.00 sec) I can see that the clobber time for that combination - at the time it was checked - was indeed 2013-10-15 19:47:52
Assignee | ||
Comment 3•11 years ago
|
||
Next time you clobber, can you check whether there are multiple dates shown on the clobber page for "Last clobbered"? I'm wondering whether there have been script timeouts preventing all slaves from getting an updated clobber time. For example, I see multiple "last clobbered" dates on the mozilla-inbound clobber page for quite a few builders.
Reporter | ||
Comment 4•11 years ago
|
||
Yes, there are almost always multiple dates shown if it's been more than a few hours since the last clobber, this was just the first time we've had a really obvious "it's busted because it didn't clobber even though it's been multiple clobbers since we did clobber" thing to correlate with that.
Assignee | ||
Comment 5•11 years ago
|
||
RyanVM usually re-clobbers until the multiple entries have gone away and says he hasn't noticed any misbehaving clobbers after doing so. philor: can you please try using that workflow and reopen if you see the problem reappear?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Comment 6•11 years ago
|
||
Sorry, I think we were misunderstanding each other. Philor's point that there were dozen's more clobbers pretty much points to this problem not being fixed by that workflow. Sorry :(
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Reporter | ||
Comment 7•11 years ago
|
||
I haven't ever seen the odd dates persist past a clobbering, anyway; what they do is reappear afterward. My guess without having looked at the code was that despite looking like you are clobbering "everything that runs Linux64 PGO on this branch" you are actually clobbering "the set of slaves which the clobberer thinks ought to be clobbered" and that that set of slaves is wrong, so that bld-linux64-ec2-305 wasn't ever included in any of the clobberer uses that happened in between. Guessing from view-source, if I clobbered inbound right now, I'd clobber Linux64 PGO for "783403|779367|757481|761127|803615|804559|726463|805991|805813|771511|792723|737043|781539|766373|809385|812787|808199|807225|815097|814133|815461|809381|814665|817683|813735|809679|807361|736017|760525|787085|797241|797133|774499|793017|796895|805719|823299|764091|764701|766797|767109|772159|765399|808455|793649|793479|798143|754603|786839|753737|794205|755853|779865|731099|783587|794567|787385|787913|792349|741923|776561|785829|773963|704263|795995|752367|792079|790371|801037|825579|823771|753279|789793|765793|775149|818397|764269|762319|780721|753869|785347|752991|780257|730861|810367|816867|791533|741919|763435|823967|755191|777189|802033|784143|797041|805511|760243|804885|793913|759939|758545|758133|805035|794699|799371|733771|736693|789463|734085|733891|782527|622333|811115|778527|796441|822089|767819|802839|776313|763465|782759" whatever slaves those might be. Is that actually the set of slaves which have a Linux64 PGO inbound objdir, or is there a way for a slave to currently have a m-in-l64-pgo-00000000000000000 without being in that set?
Comment 8•11 years ago
|
||
The six-to-last id in your list is 822089. The matching id in the builds table of the clobberer db is bld-linux64-ec2-305 for 'Linux x86-64 mozilla-inbound pgo-build' in 'm-in-l64-pgo-00000000000000000'. So I think a clobber would have an effect for this slave, if you were to do one now. The clobberer output in the log in comment #0 is misleading, ideally it would be m-in-l64-pgo-00000000000000000:Our last clobber date: 2013-10-30 12:00:31 ## this has not changed m-in-l64-pgo-00000000000000000:Server clobber date: 2013-10-30 11:49:15 ## this line has changed In a perfect world, using the clobberer webUI should set some kind of 'every slave' row in the db, rather than just the slaves it knows did builds recently. And the webUI table would show something like all: <some_time>, <specific_slave>: <some_later_time> (where the specific_slave stuff comes from cancelled builds, and we ignore any of those older than <some_time>). That the build at 18:00 failed is either a build system problem or a slave-specific one, because there was no clobber between the 12:00 and 18:00 builds that I can find. More details: bld-linux64-ec2-305 has this history on this job: Builder name Build # Master Result Start time (sorted desc) Linux x86-64 mozilla-inbound pgo-build 209 bm58-build1 2 2013-10-30 18:00:15 Linux x86-64 mozilla-inbound pgo-build 208 bm58-build1 0 2013-10-30 12:00:13 Linux x86-64 mozilla-inbound pgo-build 212 bm64-build1 0 2013-09-25 00:09:56 There's a cron job that trims the clobberer db of rows which are more than 21 days old, so the 09-25 reference was gone when the slave came back to this work on 10-30. That doesn't matter because 21 days is greater than the 7 day limit for stale objdirs, and the job at 12:00 on the 30th was a purged clobber. From comment #2 it looks like the next clobber was at 19:37, which is confirmed by: mysql> select distinct from_unixtime(lastclobber) as c_time from clobber_times where builddir='m-in-l64-pgo-00000000000000000' order by c_time desc limit 6; +---------------------+ | c_time | +---------------------+ | 2013-10-31 13:10:29 | | 2013-10-31 13:09:24 | | 2013-10-30 20:42:54 | | 2013-10-30 19:37:18 | | 2013-10-30 11:49:15 | | 2013-10-29 14:59:01 | +---------------------+ 6 rows in set (0.03 sec) ie purged clobber when it built at 12:00, no clobber set before job at 18:00, build blew up all by itself.
Assignee | ||
Comment 9•11 years ago
|
||
The ids in this SQL select were copied from the HTML. Note the differing builddir.
Reporter | ||
Comment 10•11 years ago
|
||
Differing builddir? I filed a dupe of my own bug 626164 from 2011?
Updated•11 years ago
|
Flags: needinfo?(catlee)
Reporter | ||
Comment 11•11 years ago
|
||
And, to compound my awesome, the thing I filed this about turns out to be bug 934166, not that we were failing to clobber but just that the (vanishingly rare) non-clobber PGO builds were in fact busted.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → DUPLICATE
Comment 12•11 years ago
|
||
Well, we *did* have builds that were supposed to be clobbered that weren't.
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•