Closed
Bug 608886
Opened 15 years ago
Closed 15 years ago
Multiple jobs run for some try server pushes
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ehsan.akhgari, Assigned: dustin)
Details
Today I pushed c1e8a6dca18a to the try server, with the following trychooser syntax:
try: -b d -p macosx64 -m none -u mochitest-o -t none
which should only do a macosx64 build and run mochitest-other on it. However, I got two build jobs and two test runs, one of each being entirely useless to me, and just wasting resources.
There also seems to be two build jobs on my other push with the same trychooser string (f8bf2e7c8eb9), and looking at the try server in general, this seems to be happening for some other pushes as well.
I also have two build and test logs in <http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/eakhgari@mozilla.com-c1e8a6dca18a/tryserver-macosx64-debug/>, but it seems that the second build uploaded has overwritten the first one.
Comment 1•15 years ago
|
||
The two debug compile jobs for c1e8a6dca18a have this data
Submitted Started Finished
2010-11-01 12:55:28 2010-11-01 12:56:20 2010-11-01 14:54:52
2010-11-01 12:55:28 2010-11-01 13:56:21 2010-11-01 15:48:44
They were both done on production-master02, with reason 'scheduler'. Strange they started almost exactly an hour apart; the clocks on that master and production-master are both correct as of now. I can't see anything in the logs for the try or scheduler masters that helps explain what's going on.
On f8bf2e7c8eb9, there is
Submitted Started Finished
2010-11-01 17:23:28 2010-11-01 17:23:57 2010-11-01 19:18:18
2010-11-01 17:23:28 2010-11-01 18:24:20
and interestingly the unfinished build already had a buildrequests.result of 0. And on d56744be8596 there are now four windows builds running.
A regression from the changes deployed this morning ?
Comment 2•15 years ago
|
||
Doesn't look like neither pm01:9009 (scheduler) or pm02:8011 (try) was reconfigured this morning, so shouldn't be a regression. It's more like something thinks the job is dead and reschedules it, running on an hourly schedule.
Summary: Double jobs run for some try server pushes → Multiple jobs run for some try server pushes
| Assignee | ||
Updated•15 years ago
|
Assignee: nobody → dustin
| Assignee | ||
Comment 3•15 years ago
|
||
Nick's diagnosis is exactly right - Buildbot has code to restart "old" buildrequests -- but never refreshes the claimed_at timestamp after initially claiming it. The real fix is to refresh these timestamps - http://buildbot.net/trac/ticket/1035.
The more immediate fix within Mozilla is just to increase the RECLAIM_INTERVAL to something much larger than one hour. In the short term, I think the additional cost of redundant builds every hour is higher than the cost of "lost" builds in the db.
I propose we do this at the next restart.
Status: NEW → ASSIGNED
Comment 4•15 years ago
|
||
The first time I see multiple builds is http://hg.mozilla.org/try/rev/80fda7cced59, pushed at 8:03 today. An earlier (all platforms) push, http://hg.mozilla.org/try/rev/6e99f7c6b6b1 at 4:21 didn't have a problem. Strangely it doesn't happen all the time, and happens more often for windows builds.
| Assignee | ||
Comment 5•15 years ago
|
||
Catlee pointed out that my assessment was incorrect: builders do try to update the claimed_at counter every 10 minutes. We should try to replicate this in a more controlled circumstance to narrow down the cause.
Comment 6•15 years ago
|
||
http://production-master02.build.mozilla.org:8011/builders/Linux%20x86-64%20tryserver%20build/builds/4216
http://production-master02.build.mozilla.org:8011/builders/Linux%20x86-64%20tryserver%20build/builds/4218
The second started while the first was in the middle of compiling...
Comment 7•15 years ago
|
||
Lots of occurrences of this since November 1st. None from Oct 20th to the 1st.
Comment 8•15 years ago
|
||
So the cause of this was an exception raised inside the reclaimAllBuilds call when DNS went away temporarily, the master couldn't resolve the hostname of the database server. The uncaught exception broke the TimerService.
2010-11-01 11:27:16-0700 [-] Unhandled Error
Traceback (most recent call last):
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/base.py", line 1166, in run
self.mainLoop()
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/base.py", line 1175, in mainLoop
self.runUntilCurrent()
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/base.py", line 779, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/task.py", line 194, in __call__
d = defer.maybeDeferred(self.f, *self.a, **self.kw)
--- <exception caught here> ---
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 102, in maybeDeferred
result = f(*args, **kw)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/process/builder.py", line 653, in reclaimAllBuilds
self.master_incarnation, brids)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 822, in claim_buildrequests
now, master_name, master_incarnation, brids)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow
return self._runInteractionNow(interaction, *args, **kwargs)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 234, in _runInteractionNow
conn = self.get_sync_connection()
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 228, in get_sync_connection
self._nonpool = self._spec.get_sync_connection()
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/dbspec.py", line 250, in get_sync_connection
conn = dbapi.connect(*self.connargs, **connkw)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-i686.egg/MySQLdb/__init__.py", line 81, in Connect
return Connection(*args, **kwargs)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-i686.egg/MySQLdb/connections.py", line 188, in __init__
super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2005, "Unknown MySQL server host 'tm-b01-master01.mozilla.org' (2)")
Comment 9•15 years ago
|
||
reconfiguring the try master to see if that fixes it (it should re-instantiate the TimerService objects...)
Comment 10•15 years ago
|
||
The reconfig seems to have fixed it.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•