Closed
Bug 728104
Opened 12 years ago
Closed 12 years ago
AggregatingScheduler resets its state on reconfig
Categories
(Release Engineering :: General, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: rail)
Details
(Whiteboard: [automation])
Attachments
(2 files)
4.74 KB,
patch
|
bhearsum
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
1.17 KB,
patch
|
catlee
:
review+
dustin
:
feedback+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
During 11.0b3 release we needed to reconfig the master to start another release. Linux build was finished before reconfig happened. Post signing steps are triggered after all builds and l10n repacks are done, but this never happened. After a quick investigation I saw that the scheduler DB was in a weird state: | 3746 | release-mozilla-beta-firefox_deliverables_ready | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-firefox_source", "release-mozilla-beta-linux_build", "release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.24244, "lastCheck": 1329404507.84547} | | 3752 | release-mozilla-beta-firefox_signing_done | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-linux_build", "release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.224206, "lastCheck": 1329404507.889406} To work around I ran the following SQL statements: UPDATE schedulers SET state='{"remainingBuilders": ["release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.224206, "lastCheck": 1329405704.6713779}' WHERE name='release-mozilla-beta-firefox_signing_done'; UPDATE schedulers SET state='{"remainingBuilders": ["release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.24244, "lastCheck": 1329406064.67486}' WHERE name='release-mozilla-beta-firefox_deliverables_ready'; and triggered release-mozilla-beta-linux_repack_complete to make it work. It looks like there is a logic mistake in the scheduler code, which resets the state in case if reconfig happens in the middle of the process. Patch incoming
Assignee | ||
Comment 1•12 years ago
|
||
Attachment #598100 -
Flags: review?(bhearsum)
Assignee | ||
Comment 2•12 years ago
|
||
Comment on attachment 598100 [details] [diff] [review] smarter reconfig Review of attachment 598100 [details] [diff] [review]: ----------------------------------------------------------------- This patch incorporates "more logging" patch from bug 727849. It passes minimal staging testing. ::: scheduler.py @@ +389,5 @@ > state['remainingBuilders'].remove(b) > # Add new upstream builders > for b in self.upstreamBuilders: > + if b not in state.get('upstreamBuilders', []) and \ > + b not in state['remainingBuilders']: Check if builders from self.upstreamBuilders are already in one of the state['upstreamBuilders'] or state['remainingBuilders']. I had to use state.get('upstreamBuilders', []) here because we have no such variable in the current state. @@ +394,2 @@ > state['remainingBuilders'].append(b) > + state['upstreamBuilders'] = self.upstreamBuilders Reset it to the new value.
Comment 3•12 years ago
|
||
Comment on attachment 598100 [details] [diff] [review] smarter reconfig Review of attachment 598100 [details] [diff] [review]: ----------------------------------------------------------------- r=me with the comment addition! ::: scheduler.py @@ +389,5 @@ > state['remainingBuilders'].remove(b) > # Add new upstream builders > for b in self.upstreamBuilders: > + if b not in state.get('upstreamBuilders', []) and \ > + b not in state['remainingBuilders']: Can you add a comment to the code about why we only do this conditionally?
Attachment #598100 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 4•12 years ago
|
||
Comment on attachment 598100 [details] [diff] [review] smarter reconfig http://hg.mozilla.org/build/buildbotcustom/rev/8d4e66f79f5d
Attachment #598100 -
Flags: checked-in+
Assignee | ||
Comment 5•12 years ago
|
||
Saw this in the logs: 2012-02-21 08:13:23-0800 [-] Unhandled Error Traceback (most recent call last): File "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap self.__bootstrap_inner() File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "/usr/lib64/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) --- <exception caught here> --- File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker result = context.call(ctx, function, *args, **kwargs) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction result = interaction(trans, *args, **kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbotcustom/scheduler.py", line 448, in _run File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbotcustom/scheduler.py", line 455, in processRequest propfuncs = pf File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/schedulers/base.py", lin e 90, in get_state return self.parent.db.scheduler_get_state(self.schedulerid, t) File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 6 57, in scheduler_get_state return json.loads(state_json) File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode obj, end = self._scanner.iterscan(s, **kw).next() File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib64/python2.6/json/decoder.py", line 183, in JSONObject value, end = iterscan(s, idx=end, context=context).next() File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib64/python2.6/json/decoder.py", line 217, in JSONArray value, end = iterscan(s, idx=end, context=context).next() File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib64/python2.6/json/decoder.py", line 155, in JSONString return scanstring(match.string, match.end(), encoding, strict) exceptions.ValueError: Unterminated string starting at: line 1 column 991 (char 991) It turns out that mysql silently trims the state column to 1024 characters: | 3731 | release-mozilla-release-firefox_deliverables_ready | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-release-updates", "release-mozilla-release-firefox_source", "release-mozilla-release-linux_build", "release-mozilla-release-linux_repack_complete", "release-mozilla-release-linux64_build", "release-mozilla-release-win32_build", "release-mozilla-release-macosx64_build", "release-mozilla-release-macosx64_repack_complete", "release-mozilla-release-linux_partner_repack", "release-mozilla-release-linux64_repack_complete", "release-mozilla-release-win32_repack_complete", "release-mozilla-release-linux64_partner_repack", "release-mozilla-release-win32_partner_repack", "release-mozilla-release-macosx64_partner_repack"], "upstreamBuilders": ["release-mozilla-release-firefox_source", "release-mozilla-release-linux_build", "release-mozilla-release-linux_repack_complete", "release-mozilla-release-linux64_build", "release-mozilla-release-linux64_repack_complete", "release-mozilla-release-win32_build", "release-mozilla-release-win32_repack_complete", "release-mozilla-release-macosx64 |
Assignee | ||
Comment 6•12 years ago
|
||
Dustin, do you think that increasing schedulers.state field (VARCHAR(4096) ?) is the right way to go?
Comment 7•12 years ago
|
||
That seems reasonable, although it may have adverse consequences on the DB. Sheeri, any particular reason not to do so (using MyISAM, btw)?
Comment 8•12 years ago
|
||
Sure, that's fine. If you're worried about having to do it again, if things increase past 4k, you might want to just change it to a TEXT type - it's also variable length, and the overhead is the same - 2 bytes per field, per row. http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
Assignee | ||
Comment 9•12 years ago
|
||
I'm going to test this in stating with the following changes: ALTER TABLE schedulers CHANGE COLUMN state state TEXT;
Attachment #601002 -
Flags: feedback?(dustin)
Comment 10•12 years ago
|
||
Comment on attachment 601002 [details] [diff] [review] sql schema changes wow, that's a blast from the past! :)
Attachment #601002 -
Flags: feedback?(dustin) → feedback+
Comment 11•12 years ago
|
||
Do you want this table to be InnoDB or MyISAM? And what's the default charset for the table, utf8 or latin1? It'd be great to have that explicitly set, just in case the defaults aren't what you want.
Assignee | ||
Comment 12•12 years ago
|
||
I have no preferences here. Dustin?
Comment 13•12 years ago
|
||
Buildbot requires MyISAM due to index-length limits in InnoDB. I have no idea what teh charset is. For new installs of Buildbot, it's utf8. I wouldn't worry about that in tables.sql, though - IIRC that file needs to work properly for sqlite and mysql.
Assignee | ||
Comment 14•12 years ago
|
||
As a temporary solution for one busted scheduler I ran this: Select all schedulers where state field doesn't have a closing '}' mysql> select schedulerid, name, class_name from schedulers where name like 'release-mozilla-%' and substring(state, -1) != '}'; +-------------+----------------------------------------------------+-----------------------------------------------+ | schedulerid | name | class_name | +-------------+----------------------------------------------------+-----------------------------------------------+ | 3731 | release-mozilla-release-firefox_deliverables_ready | buildbotcustom.scheduler.AggregatingScheduler | +-------------+----------------------------------------------------+-----------------------------------------------+ 1 row in set (0.00 sec) "Fix" the state by putting there fake entries. I used one of the real builders to prevent state reset on reconfig (reconfig compares real builder names and the ones in the state column and removes not real ones). mysql> UPDATE schedulers SET state='{"remainingBuilders": ["release-mozilla-release-updates"], "upstreamBuilders": ["release-mozilla-release-updates"], "lastReset": 1329372510.640672, "lastCheck": 1329840293.9109211}' WHERE schedulerid=3731; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 The state of this particular scheduler will be explicitly reset as a part of 11.0 automation.
Assignee | ||
Comment 15•12 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #9) > I'm going to test this in stating with the following changes: > > ALTER TABLE schedulers CHANGE COLUMN state state TEXT; Sheeri, do you think that running the statement above requires a tree closure? We have 3933 entries in schedulers table.
Assignee | ||
Comment 16•12 years ago
|
||
So far everything looks nice in staging. I converted the column without any data loss. The scheduler work fine with the new column type. Upgrade scenario: 0) Do it early morning EST 1) backup data: mysqldump -h db_host -u buildbot -p db_name schedulers> schedulers.sql 2) ALTER TABLE schedulers CHANGE COLUMN state state TEXT; 3) Watch twistd.log for exceptions on all masters If something goes wrong: * Revert column conversion: ALTER TABLE schedulers CHANGE COLUMN state state VARCHAR(1024); ** Inspect state column data *** If something looks not good try to restore that particular row from the dump * Repeat the same using VARCHAR(4096)
Comment 17•12 years ago
|
||
I don't think running the statement will take very long with <4000 rows. No tree closure needed. Just to be on the safe side, I'd specifically put UTF8 in the definition - just in case: ALTER TABLE schedulers CHANGE COLUMN state state TEXT CHARACTER SET utf8;
Status: NEW → ASSIGNED
Assignee | ||
Comment 18•12 years ago
|
||
on bm07 1) dumped schedulers to ~/bug728104_schedulers_dump.sql 2) mysql> ALTER TABLE schedulers CHANGE COLUMN state state TEXT CHARACTER SET utf8; Query OK, 3935 rows affected (0.14 sec) Records: 3935 Duplicates: 0 Warnings: 0
Assignee | ||
Updated•12 years ago
|
Attachment #601002 -
Flags: review?(catlee)
Updated•12 years ago
|
Attachment #601002 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 19•12 years ago
|
||
Comment on attachment 601002 [details] [diff] [review] sql schema changes http://hg.mozilla.org/build/buildbot/rev/df68f41e240f
Attachment #601002 -
Flags: checked-in+
Assignee | ||
Comment 20•12 years ago
|
||
All done here.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•