Closed Bug 608886 Opened 15 years ago Closed 15 years ago

Multiple jobs run for some try server pushes

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: dustin)

Details

Today I pushed c1e8a6dca18a to the try server, with the following trychooser syntax: try: -b d -p macosx64 -m none -u mochitest-o -t none which should only do a macosx64 build and run mochitest-other on it. However, I got two build jobs and two test runs, one of each being entirely useless to me, and just wasting resources. There also seems to be two build jobs on my other push with the same trychooser string (f8bf2e7c8eb9), and looking at the try server in general, this seems to be happening for some other pushes as well. I also have two build and test logs in <http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/eakhgari@mozilla.com-c1e8a6dca18a/tryserver-macosx64-debug/>, but it seems that the second build uploaded has overwritten the first one.
The two debug compile jobs for c1e8a6dca18a have this data Submitted Started Finished 2010-11-01 12:55:28 2010-11-01 12:56:20 2010-11-01 14:54:52 2010-11-01 12:55:28 2010-11-01 13:56:21 2010-11-01 15:48:44 They were both done on production-master02, with reason 'scheduler'. Strange they started almost exactly an hour apart; the clocks on that master and production-master are both correct as of now. I can't see anything in the logs for the try or scheduler masters that helps explain what's going on. On f8bf2e7c8eb9, there is Submitted Started Finished 2010-11-01 17:23:28 2010-11-01 17:23:57 2010-11-01 19:18:18 2010-11-01 17:23:28 2010-11-01 18:24:20 and interestingly the unfinished build already had a buildrequests.result of 0. And on d56744be8596 there are now four windows builds running. A regression from the changes deployed this morning ?
Doesn't look like neither pm01:9009 (scheduler) or pm02:8011 (try) was reconfigured this morning, so shouldn't be a regression. It's more like something thinks the job is dead and reschedules it, running on an hourly schedule.
Summary: Double jobs run for some try server pushes → Multiple jobs run for some try server pushes
Assignee: nobody → dustin
Nick's diagnosis is exactly right - Buildbot has code to restart "old" buildrequests -- but never refreshes the claimed_at timestamp after initially claiming it. The real fix is to refresh these timestamps - http://buildbot.net/trac/ticket/1035. The more immediate fix within Mozilla is just to increase the RECLAIM_INTERVAL to something much larger than one hour. In the short term, I think the additional cost of redundant builds every hour is higher than the cost of "lost" builds in the db. I propose we do this at the next restart.
Status: NEW → ASSIGNED
The first time I see multiple builds is http://hg.mozilla.org/try/rev/80fda7cced59, pushed at 8:03 today. An earlier (all platforms) push, http://hg.mozilla.org/try/rev/6e99f7c6b6b1 at 4:21 didn't have a problem. Strangely it doesn't happen all the time, and happens more often for windows builds.
Catlee pointed out that my assessment was incorrect: builders do try to update the claimed_at counter every 10 minutes. We should try to replicate this in a more controlled circumstance to narrow down the cause.
Lots of occurrences of this since November 1st. None from Oct 20th to the 1st.
So the cause of this was an exception raised inside the reclaimAllBuilds call when DNS went away temporarily, the master couldn't resolve the hostname of the database server. The uncaught exception broke the TimerService. 2010-11-01 11:27:16-0700 [-] Unhandled Error Traceback (most recent call last): File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/base.py", line 1166, in run self.mainLoop() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/base.py", line 1175, in mainLoop self.runUntilCurrent() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/base.py", line 779, in runUntilCurrent call.func(*call.args, **call.kw) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/task.py", line 194, in __call__ d = defer.maybeDeferred(self.f, *self.a, **self.kw) --- <exception caught here> --- File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 102, in maybeDeferred result = f(*args, **kw) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/process/builder.py", line 653, in reclaimAllBuilds self.master_incarnation, brids) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 822, in claim_buildrequests now, master_name, master_incarnation, brids) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow return self._runInteractionNow(interaction, *args, **kwargs) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 234, in _runInteractionNow conn = self.get_sync_connection() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/connector.py", line 228, in get_sync_connection self._nonpool = self._spec.get_sync_connection() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.1-py2.6.egg/buildbot/db/dbspec.py", line 250, in get_sync_connection conn = dbapi.connect(*self.connargs, **connkw) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-i686.egg/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-i686.egg/MySQLdb/connections.py", line 188, in __init__ super(Connection, self).__init__(*args, **kwargs2) _mysql_exceptions.OperationalError: (2005, "Unknown MySQL server host 'tm-b01-master01.mozilla.org' (2)")
reconfiguring the try master to see if that fixes it (it should re-instantiate the TimerService objects...)
The reconfig seems to have fixed it.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.