Closed Bug 728104 Opened 12 years ago Closed 12 years ago

AggregatingScheduler resets its state on reconfig

Categories

(Release Engineering :: General, defect, P1)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: rail)

Details

(Whiteboard: [automation])

Attachments

(2 files)

During 11.0b3 release we needed to reconfig the master to start another release. Linux build was finished before reconfig happened. Post signing steps are triggered after all builds and l10n repacks are done, but this never happened. After a quick investigation I saw that the scheduler DB was in a weird state:

|        3746 | release-mozilla-beta-firefox_deliverables_ready  | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-firefox_source", "release-mozilla-beta-linux_build", "release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.24244, "lastCheck": 1329404507.84547}                                                                                                                                    |
|        3752 | release-mozilla-beta-firefox_signing_done        | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-linux_build", "release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.224206, "lastCheck": 1329404507.889406} 

To work around I ran the following SQL statements:
UPDATE schedulers SET state='{"remainingBuilders": ["release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.224206, "lastCheck": 1329405704.6713779}' WHERE name='release-mozilla-beta-firefox_signing_done';

UPDATE schedulers SET state='{"remainingBuilders": ["release-mozilla-beta-linux_repack_complete"], "lastReset": 1329372616.24244, "lastCheck": 1329406064.67486}' WHERE name='release-mozilla-beta-firefox_deliverables_ready';

and triggered release-mozilla-beta-linux_repack_complete to make it work.

It looks like there is a logic mistake in the scheduler code, which resets the state in case if reconfig happens in the middle of the process.

Patch incoming
Attached patch smarter reconfigSplinter Review
Attachment #598100 - Flags: review?(bhearsum)
Comment on attachment 598100 [details] [diff] [review]
smarter reconfig

Review of attachment 598100 [details] [diff] [review]:
-----------------------------------------------------------------

This patch incorporates "more logging" patch from bug 727849. It passes minimal staging testing.

::: scheduler.py
@@ +389,5 @@
>                  state['remainingBuilders'].remove(b)
>          # Add new upstream builders
>          for b in self.upstreamBuilders:
> +            if b not in state.get('upstreamBuilders', []) and \
> +               b not in state['remainingBuilders']:

Check if builders from self.upstreamBuilders are already in one of the state['upstreamBuilders'] or state['remainingBuilders']. I had to use state.get('upstreamBuilders', []) here because we have no such variable in the current state.

@@ +394,2 @@
>                  state['remainingBuilders'].append(b)
> +        state['upstreamBuilders'] = self.upstreamBuilders

Reset it to the new value.
Comment on attachment 598100 [details] [diff] [review]
smarter reconfig

Review of attachment 598100 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the comment addition!

::: scheduler.py
@@ +389,5 @@
>                  state['remainingBuilders'].remove(b)
>          # Add new upstream builders
>          for b in self.upstreamBuilders:
> +            if b not in state.get('upstreamBuilders', []) and \
> +               b not in state['remainingBuilders']:

Can you add a comment to the code about why we only do this conditionally?
Attachment #598100 - Flags: review?(bhearsum) → review+
Saw this in the logs:

2012-02-21 08:13:23-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
            self.run()
          File "/usr/lib64/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
          File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction
            result = interaction(trans, *args, **kw)
          File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbotcustom/scheduler.py", line 448, in _run

          File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbotcustom/scheduler.py", line 455, in processRequest
            propfuncs = pf
          File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/schedulers/base.py", lin
e 90, in get_state
            return self.parent.db.scheduler_get_state(self.schedulerid, t)
          File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 6
57, in scheduler_get_state
            return json.loads(state_json)
          File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
            return _default_decoder.decode(s)
          File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
            obj, end = self.raw_decode(s, idx=_w(s, 0).end())
          File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode
            obj, end = self._scanner.iterscan(s, **kw).next()
          File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
            rval, next_pos = action(m, context)
          File "/usr/lib64/python2.6/json/decoder.py", line 183, in JSONObject
            value, end = iterscan(s, idx=end, context=context).next()
          File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
            rval, next_pos = action(m, context)
          File "/usr/lib64/python2.6/json/decoder.py", line 217, in JSONArray
            value, end = iterscan(s, idx=end, context=context).next()
          File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
            rval, next_pos = action(m, context)
          File "/usr/lib64/python2.6/json/decoder.py", line 155, in JSONString
            return scanstring(match.string, match.end(), encoding, strict)
        exceptions.ValueError: Unterminated string starting at: line 1 column 991 (char 991)



It turns out that mysql silently trims the state column to 1024 characters:
|        3731 | release-mozilla-release-firefox_deliverables_ready | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-release-updates", "release-mozilla-release-firefox_source", "release-mozilla-release-linux_build", "release-mozilla-release-linux_repack_complete", "release-mozilla-release-linux64_build", "release-mozilla-release-win32_build", "release-mozilla-release-macosx64_build", "release-mozilla-release-macosx64_repack_complete", "release-mozilla-release-linux_partner_repack", "release-mozilla-release-linux64_repack_complete", "release-mozilla-release-win32_repack_complete", "release-mozilla-release-linux64_partner_repack", "release-mozilla-release-win32_partner_repack", "release-mozilla-release-macosx64_partner_repack"], "upstreamBuilders": ["release-mozilla-release-firefox_source", "release-mozilla-release-linux_build", "release-mozilla-release-linux_repack_complete", "release-mozilla-release-linux64_build", "release-mozilla-release-linux64_repack_complete", "release-mozilla-release-win32_build", "release-mozilla-release-win32_repack_complete", "release-mozilla-release-macosx64 |
Dustin, do you think that increasing schedulers.state field (VARCHAR(4096) ?) is the right way to go?
That seems reasonable, although it may have adverse consequences on the DB.  Sheeri, any particular reason not to do so (using MyISAM, btw)?
Sure, that's fine. If you're worried about having to do it again, if things increase past 4k, you might want to just change it to a TEXT type - it's also variable length, and the overhead is the same - 2 bytes per field, per row. http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
I'm going to test this in stating with the following changes:

ALTER TABLE schedulers CHANGE COLUMN state state TEXT;
Attachment #601002 - Flags: feedback?(dustin)
Comment on attachment 601002 [details] [diff] [review]
sql schema changes

wow, that's a blast from the past! :)
Attachment #601002 - Flags: feedback?(dustin) → feedback+
Do you want this table to be InnoDB or MyISAM? And what's the default charset for the table, utf8 or latin1? It'd be great to have that explicitly set, just in case the defaults aren't what you want.
I have no preferences here. Dustin?
Buildbot requires MyISAM due to index-length limits in InnoDB.  I have no idea what teh charset is.  For new installs of Buildbot, it's utf8.  I wouldn't worry about that in tables.sql, though - IIRC that file needs to work properly for sqlite and mysql.
As a temporary solution for one busted scheduler I ran this:

Select all schedulers where state field doesn't have a closing '}'

mysql> select schedulerid, name, class_name from schedulers where name like 'release-mozilla-%' and substring(state, -1) != '}';
+-------------+----------------------------------------------------+-----------------------------------------------+
| schedulerid | name                                               | class_name                                    |
+-------------+----------------------------------------------------+-----------------------------------------------+
|        3731 | release-mozilla-release-firefox_deliverables_ready | buildbotcustom.scheduler.AggregatingScheduler | 
+-------------+----------------------------------------------------+-----------------------------------------------+
1 row in set (0.00 sec)



"Fix" the state by putting there fake entries. I used one of the real builders to prevent state reset on reconfig (reconfig compares real builder names and the ones in the state column and removes not real ones).

mysql> UPDATE schedulers SET state='{"remainingBuilders": ["release-mozilla-release-updates"], "upstreamBuilders": ["release-mozilla-release-updates"], "lastReset": 1329372510.640672, "lastCheck": 1329840293.9109211}' WHERE schedulerid=3731;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0


The state of this particular scheduler will be explicitly reset as a part of 11.0 automation.
(In reply to Rail Aliiev [:rail] from comment #9)
> I'm going to test this in stating with the following changes:
> 
> ALTER TABLE schedulers CHANGE COLUMN state state TEXT;

Sheeri, do you think that running the statement above requires a tree closure? We have 3933 entries in schedulers table.
So far everything looks nice in staging. I converted the column without any data loss. The scheduler work fine with the new column type.

Upgrade scenario:

0) Do it early morning EST
1) backup data: mysqldump -h db_host -u buildbot -p db_name schedulers> schedulers.sql
2) ALTER TABLE schedulers CHANGE COLUMN state state TEXT;
3) Watch twistd.log for exceptions on all masters

If something goes wrong:

* Revert column conversion: ALTER TABLE schedulers CHANGE COLUMN state state VARCHAR(1024);
  ** Inspect state column data
    *** If something looks not good try to restore that particular row from the dump
* Repeat the same using VARCHAR(4096)
I don't think running the statement will take very long with <4000 rows. No tree closure needed.

Just to be on the safe side, I'd specifically put UTF8 in the definition - just in case:

ALTER TABLE schedulers CHANGE COLUMN state state TEXT CHARACTER SET utf8;
Status: NEW → ASSIGNED
on bm07
1) dumped schedulers to ~/bug728104_schedulers_dump.sql

2)
mysql> ALTER TABLE schedulers CHANGE COLUMN state state TEXT CHARACTER SET utf8;
Query OK, 3935 rows affected (0.14 sec)
Records: 3935  Duplicates: 0  Warnings: 0
Attachment #601002 - Flags: review?(catlee)
Attachment #601002 - Flags: review?(catlee) → review+
All done here.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: