reconfigs should not rely on seta server

RESOLVED FIXED

Status

defect
--
major
RESOLVED FIXED
4 years ago
Last year

People

(Reporter: rail, Assigned: kmoir)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 12 obsolete attachments)

997 bytes, patch
jlund
: review+
kmoir
: review+
Details | Diff | Splinter Review
3.35 KB, patch
kmoir
: checked-in+
Details | Diff | Splinter Review
3.55 KB, patch
kmoir
: checked-in+
Details | Diff | Splinter Review
3.48 KB, patch
Details | Diff | Splinter Review
Reporter

Description

4 years ago
Hit an issue wit a release reconfig when the seta server was returning 500 errors:

Requested: make checkconfig
Executed: /bin/bash -l -c "cd /builds/buildbot/tests_scheduler && make checkconfig"

=============================== Standard output ===============================

cd master && /builds/buildbot/tests_scheduler/bin/buildbot checkconfig
HTTPError = 500
Traceback (most recent call last):
  File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/runner.py", line 
1042, in doCheckConfig
    ConfigLoader(configFileName=configFileName)
  File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/checkconfig.py", 
line 31, in __init__
    self.loadConfig(configFile, check_synchronously_only=True)
  File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 652, in 
loadConfig
    exec f in localDict
  File "/builds/buildbot/tests_scheduler/master/master.cfg", line 7, in <module>
    import config
  File "/tmp/tmpQRFxAr/config.py", line 1966, in <module>
  File "/tmp/tmpQRFxAr/config_seta.py", line 113, in loadSkipConfig
  File "/tmp/tmpQRFxAr/config_seta.py", line 31, in get_seta_platforms
  File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/tools/python27/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/tools/python27/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/tools/python27/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/tools/python27/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
make: *** [checkconfig] Error 1
Reporter

Comment 1

4 years ago
Tried to checkconfig manually:

[cltbld@buildbot-master81.bb.releng.scl3.mozilla.com tests_scheduler]$ make checkconfig
cd master && /builds/buildbot/tests_scheduler/bin/buildbot checkconfig
HTTPError = 500
Traceback (most recent call last):
  File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/runner.py", line 1042, in doCheckConfig
    ConfigLoader(configFileName=configFileName)
  File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/checkconfig.py", line 31, in __init__
    self.loadConfig(configFile, check_synchronously_only=True)
  File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 652, in loadConfig
    exec f in localDict
  File "/builds/buildbot/tests_scheduler/master/master.cfg", line 7, in <module>
    import config
  File "/tmp/tmpJkacwm/config.py", line 1966, in <module>
  File "/tmp/tmpJkacwm/config_seta.py", line 113, in loadSkipConfig
  File "/tmp/tmpJkacwm/config_seta.py", line 31, in get_seta_platforms
  File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/tools/python27/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/tools/python27/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/tools/python27/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/tools/python27/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
make: *** [checkconfig] Error 1
Bug 1176802 for the alertmanager.allizom.org outage.

It would be good if we had a copy on disk, or in a repo, so that we have something to fall back on if the server is down.
See Also: → 1176802
So what if we had a cron job, which polls seta and lands into buildbot-configs (if it gets a 200 with valid data). Then the automated reconfigs deploy to the masters on the hour. This would give us an audit trail and a way to rollback the seta config. Would have been helpful when 20k pending jobs turned up today.
Flags: needinfo?(kmoir)
Assignee

Comment 4

4 years ago
Sorry I didn't see this bug until today. And I didn't realize that it had caused problems in the past. I'll look at how to make this more resilient.

Didn't realize alertmanager isn't managed by moc either and that we don't have a nagios alert on it.  Will look into this.
Assignee: nobody → kmoir
Flags: needinfo?(kmoir)
Assignee

Comment 5

4 years ago
Just as an fyi, I checked the logs on a Linux test master yesterday
A reconfig occurred at 17:00
2015-08-31 17:00:01 - INFO  - Checking whether we need to reconfig...
2015-08-31 17:00:05 - INFO  - buildbotcustom: production-0.8 tag has moved - old rev: b2ecb14104783c60f9a4276f3031213aa7634a9a; new rev: 730073773a8c85759a7c592f7287e885845052f6
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-08-31 17:00:08 - INFO  - buildbot-configs: production tag has moved - old rev: 7d72b9711f492a4c99fc84dd97ef0c76ed0ebec1; new rev: 84ef93b01b62774d9066908176fbf8103b5f8971
5 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-08-31 17:00:11 - INFO  - tools: default tag has moved - old rev: df4e897f9bc7fb877dfcacf19d150543af619589; new rev: 487bb16f9bdfd8c05301692e4de3e2b1a2a105cc
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-08-31 17:00:11 - INFO  - Starting reconfig. - 1441065601
cd master && /builds/buildbot/tests1-linux/bin/buildbot checkconfig
Config file is good!
2015-08-31 17:01:29 - INFO  - Reconfig completed successfuly. - 1441065601

This corresponds to when the scheduling started skipping again on the scheduling master 
2015-08-31 17:03:48-0700 [-] tests-fx-team-win8_64-debug-unittest-7-3600: skipping with 1/7 important changes since only 79/3600s have elapsed
2015-08-31 17:05:53-0700 [-] tests-fx-team-win8_64-debug-unittest-7-3600: skipping with 1/7 important changes since only 204/3600s have elapsed
2015-08-31 17:05:54-0700 [-] t

Looking at the logs on the scheduling master no jobs were skipped on Sunday Aug 30 (but I think trees were mostly closed due to downtime).  And no jobs were skipped until 2015-08-31 17:03.  So my suspicion is that when the buildbot master were brought up after the downtime, the SETA server was unavailable (although I can't see that in the logs) and thus the masters didn't have any SETA data until the reconfig at 17:00 on August 31.
Assignee

Updated

4 years ago
Duplicate of this bug: 1199347
Assignee

Updated

4 years ago
Depends on: 1200838
Assignee

Comment 7

4 years ago
I was working on this for the past few days with another approach that didn't work for various reasons.

I have a python script that I used before to generate seta configs before I changed it to manipulate the BRANCHES config itself. I can use it to generate the config files via cron, and update the configs to load these files instead of changing the branches config.

My question is what credentials should I use to land this in bb-configs - ffxbld?  Are there other examples of scripts that we use to land content in bbconfigs besides tagging for releases etc?
Flags: needinfo?(nthomas)
We also land into gecko repos in http://hg.mozilla.org/build/tools/file/default/scripts/periodic_file_updates/periodic_file_updates.sh. That and tagging use the ffxbld ssh key to push to ssh://hg.m.o.

We could implement this as a cron that runs on bm81, or schedule jobs on the builders themselves. Either way you'd have access to the key you need.
Flags: needinfo?(nthomas)
not sure if this is the place to put this but travis says 15 test masters are failing right now:

```
2015-09-16 21:45:30,881 - Couldn't load test-output/bm109-tests1-windows/master.cfg

Traceback (most recent call last):

  File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 111, in dump_master

    c = loadMaster(path)

  File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 26, in loadMaster

    execfile(path, g, g)

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/master.cfg", line 10, in <module>

    import config

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config.py", line 2152, in <module>

    loadSkipConfig(BRANCHES,"desktop")

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 115, in loadSkipConfig

    define_configs(b, platforms, BRANCHES)

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 98, in define_configs

    platform = seta_platforms[p][0]

KeyError: 'Windows 8'

Traceback (most recent call last):

  File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 165, in <module>

    main()

  File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 146, in main

    dump = dump_master(args.masters[0])

  File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 111, in dump_master

    c = loadMaster(path)

  File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 26, in loadMaster

    execfile(path, g, g)

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/master.cfg", line 10, in <module>

    import config

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config.py", line 2152, in <module>

    loadSkipConfig(BRANCHES,"desktop")

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 115, in loadSkipConfig

    define_configs(b, platforms, BRANCHES)

  File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 98, in define_configs

    platform = seta_platforms[p][0]

KeyError: 'Windows 8'
```


taking a look, it seems like http://alertmanager.allizom.org/data/setadetails/?date=2015-09-16&buildbot=1&branch=mozilla-inbound&inactive=1 today is including a key 'Windows 8' that I guess https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config_seta.py?offset=1800#14 is not happy about.

In [30]: url = "http://alertmanager.allizom.org/data/setadetails/?date=" + today + "&buildbot=1&branch=" + 'fx-team' + "&inactive=1"

In [31]: content = json.load(urllib2.urlopen(url))

In [32]: c = {}

In [33]: c['jobtypes'] = content['jobtypes']

# copy code from config_seta.py
In [34]:     for p in c['jobtypes'][today]:
            platform = ' '.join(p.encode('utf-8').split()[0:-4])

            if platform not in platforms:
                    platforms.append(platform)
   ....:

In [35]: platforms
Out[35]:
['Windows 8 64-bit',
 'Windows 8',
 'Rev5 MacOSX Yosemite 10.10',
 'Rev5 MacOSX Yosemite',
 'Ubuntu VM 12.04',
 'Ubuntu VM',
 'Rev4 MacOSX Snow Leopard 10.6',
 'Windows 7',
 'Windows 7 32-bit',
 'Ubuntu VM 12.04 x64',
 'Ubuntu ASAN VM 12.04 x64',
 'Windows XP 32-bit',
 'Windows XP',
 'android-2-3-armv7-api9',
 'android-4-3-armv7-api11']
let me know if I can change SETA at all.
We're wondering if SETA changed today, either code, data format, or data itself, and if that might be causing this.
For example, did we add talos jobs like 'Windows 8 64-bit fx-team talos g2-e10s' for the first time today ?

There's code which does 
  platform = ' '.join(p.encode('utf-8').split()[0:-4])
which turns unittest style jobs into 'Windows 8 64-bit', but talos into 'Windows 8', and the latter isn't in seta_platforms at http://hg.mozilla.org/build/buildbot-configs/annotate/c71f4b72e5db/mozilla-tests/config_seta.py#l14
Flags: needinfo?(jmaher)
This reduces the set of platforms to
['Windows 8 64-bit',
 'Rev5 MacOSX Yosemite 10.10',
 'Ubuntu VM 12.04',
 'Rev4 MacOSX Snow Leopard 10.6',
 'Windows 7 32-bit',
 'Ubuntu VM 12.04 x64',
 'Ubuntu ASAN VM 12.04 x64',
 'Windows XP 32-bit',
 'android-2-3-armv7-api9',
 'android-4-3-armv7-api11']
which are all in seta_platforms.

This will fix up master reconfigs, regardless of whether talos is meant to be in there or not (seems odd to me, but I'm lacking SETA context). It's obviously just as fragile as what it replaces, so consider it a short-term hack.
Attachment #8662124 - Flags: review?(jlund)
Assignee

Updated

4 years ago
Attachment #8662124 - Flags: review+
Comment on attachment 8662124 [details] [diff] [review]
[buildbot-configs] Handle talos builders

Review of attachment 8662124 [details] [diff] [review]:
-----------------------------------------------------------------

makes sense. r+
Attachment #8662124 - Flags: review?(jlund) → review+
yes, we added talos to it in preparation for upcoming work.  This is easy to revert on the SETA side- quite possibly we need a live and staging API that SETA publishes?
Flags: needinfo?(jmaher)
Assignee

Updated

4 years ago
Severity: critical → major
Depends on: 1208292
We had bm121 hang in a reconfig today, it was waiting on a socket to the SETA server. We could mitigate that by adding at
  http://hg.mozilla.org/build/buildbot-configs/file/fc34e111082d/mozilla-tests/config_seta.py#l31
bm67 got itself into a funk today while reconfiging and having socket errors while reading seta config

twistd.log:
226118 2016-01-04 15:02:56-0800 [-] generic exception: Traceback (most recent call last):
226119 2016-01-04 15:02:56-0800 [-]   File "/builds/buildbot/tests1-linux64/master/config_seta.py", line 45, in get_seta_platforms
226120 2016-01-04 15:02:56-0800 [-]     response = urllib2.urlopen(url)
226121 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen
226122 2016-01-04 15:02:56-0800 [-]     return _opener.open(url, data, timeout)
226123 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/urllib2.py", line 400, in open
226124 2016-01-04 15:02:56-0800 [-]     response = self._open(req, data)
226125 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/urllib2.py", line 418, in _open
226126 2016-01-04 15:02:56-0800 [-]     '_open', req)
226127 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain
226128 2016-01-04 15:02:56-0800 [-]     result = func(*args)
226129 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/urllib2.py", line 1207, in http_open
226130 2016-01-04 15:02:56-0800 [-]     return self.do_open(httplib.HTTPConnection, req)
226131 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/urllib2.py", line 1180, in do_open
226132 2016-01-04 15:02:56-0800 [-]     r = h.getresponse(buffering=True)
226133 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/httplib.py", line 1030, in getresponse
226134 2016-01-04 15:02:56-0800 [-]     response.begin()
226135 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/httplib.py", line 407, in begin
226136 2016-01-04 15:02:56-0800 [-]     version, status, reason = self._read_status()
226137 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/httplib.py", line 365, in _read_status
226138 2016-01-04 15:02:56-0800 [-]     line = self.fp.readline()
226139 2016-01-04 15:02:56-0800 [-]   File "/tools/python27/lib/python2.7/socket.py", line 447, in readline
226140 2016-01-04 15:02:56-0800 [-]     data = self._sock.recv(self._rbufsize)
226141 2016-01-04 15:02:56-0800 [-] error: [Errno 104] Connection reset by peer
226142 2016-01-04 15:02:56-0800 [-]
226143 2016-01-04 15:02:56-0800 [-] error while parsing config file
226144 2016-01-04 15:02:56-0800 [-] error during loadConfig
226145 2016-01-04 15:02:56-0800 [-] Unhandled Error

irc log:
17:56:54 <hwine> KWierso|afk: looks like issues, but not mine - I'll find someone
18:00:21 <relengbot> [sns alert] Mon 18:01:07 PST buildbot-master67.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes.
18:03:38 — jlund looks
18:07:03 <jlund> I wonder if kim and releaserunner's reconfig requests walked over each other
18:10:16 <jlund> hmm, nope. kim's finished in minutes. this looks like the 43.0.4 releaserunner reconfig never finished:
18:10:19 <jlund> [cltbld@buildbot-master67.bb.releng.use1.mozilla.com tests1-linux64]$ grep -r 'configuration' master/twistd.log.1 master/twistd.log
18:10:19 <jlund> master/twistd.log.1:2016-01-04 11:03:52-0800 [-] configuration update started
18:10:19 <jlund> master/twistd.log.1:2016-01-04 11:05:57-0800 [-] configuration update complete
18:10:19 <jlund> master/twistd.log:2016-01-04 15:00:31-0800 [-] loading configuration from /builds/buildbot/tests1-linux64/master/master.cfg
18:11:41 <jlund> 226190 2016-01-04 15:02:56-0800 [-] The new config file is unusable, so I'll ignore it.
18:11:41 <jlund> 226191 2016-01-04 15:02:56-0800 [-] I will keep using the previous config file instead.
18:13:00 <jlund> I think seta raised an error while reading the config:
18:13:01 <jlund> 2016-01-04 15:02:56-0800 [-]   File "/builds/buildbot/tests1-linux64/master/config_seta.py", line 45, in get_seta_platforms
18:13:13 <jlund> 2016-01-04 15:02:56-0800 [-] generic exception: Traceback
18:13:51 <jlund> lost a socket connection. probably needs a beefier retry logic
18:14:12 <jlund> I'll try running a reconfig again since it looks like buildbot gave up trying and got itself out of sync with the rest oft he masters
18:15:14 — hwine thanks jlund and goes to update docs with bad information
18:16:55 <jlund> np
18:19:57 <jlund> looks like stacktraces have stopped after:
18:19:59 <jlund> master/twistd.log:2016-01-04 18:18:07-0800 [-] configuration update complete
Flags: needinfo?(kmoir)
also, for aid in updating docs, I'll brain dump what I did here via history cmds

  635  cd /builds/buildbot/tests1-linux64/

  642  grep -r 'configuration' master/twistd.log.1 master/twistd.log
master/twistd.log.1:2016-01-04 11:03:52-0800 [-] configuration update started
master/twistd.log.1:2016-01-04 11:05:57-0800 [-] configuration update complete
master/twistd.log:2016-01-04 15:00:31-0800 [-] loading configuration from /builds

  643  vim master/twistd.log  # found out why 15:00 reconfig never finished via log output in comment 17

  644  source bin/activate  # use buildbot venv

  645  make checkconfig  # check if current repos look good
  # also good to check buildbot-configs and custom are up to date

  646  rm reconfig.lock # safe bc now we know that the reconfig gave up after log line "The new config file is unusable, so I'll ignore it."

  647  make reconfig  # reconfig to get master in sync with other masters
I think the best thing to do here is have a .json file checked into the tree which we can pull from.  In that regard we would need to adjust the seta tools slightly since the .json file has hardcoded branch names in it.  I could land the SETA change in tree when needed and the reconfig will pick up whenever it is done, even if there are no changes.
Assignee

Comment 20

4 years ago
I revisited my patches to update the configs in tree/buildbot-configs today so we don't have problems with reconfigs if the seta server is unavailable, making progress.
Flags: needinfo?(kmoir)
We've been getting this occasionally:

2016-03-21 18:00:52-0700 [-] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/application/app.py", line 311, in runReactorWithLogging
            reactor.run()
          File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 1165, in run
            self.mainLoop()
          File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 1174, in mainLoop
            self.runUntilCurrent()
          File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent
            call.func(*call.args, **call.kw)
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/master.py", line 628, in loadTheConfigFile
            d = self.loadConfig(f)
          File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/master.py", line 652, in loadConfig
            exec f in localDict
          File "/builds/buildbot/tests1-linux32/master/master.cfg", line 21, in <module>
            reload(mobile_config)
          File "/builds/buildbot/tests1-linux32/master/mobile_config.py", line 3014, in <module>
            loadSkipConfig(BRANCHES, "mobile")
          File "/builds/buildbot/tests1-linux32/master/config_seta.py", line 145, in loadSkipConfig
            platforms = get_seta_platforms(b, platform_filter)
          File "/builds/buildbot/tests1-linux32/master/config_seta.py", line 70, in get_seta_platforms
            data = json.loads(response.read())
          File "/tools/python27/lib/python2.7/socket.py", line 351, in read
            data = self._sock.recv(rbufsize)
          File "/tools/python27/lib/python2.7/httplib.py", line 561, in read
            s = self.fp.read(amt)
          File "/tools/python27/lib/python2.7/socket.py", line 380, in read
            data = self._sock.recv(left)
        socket.error: [Errno 104] Connection reset by peer
2016-03-21 18:00:52-0700 [-] The new config file is unusable, so I'll ignore it.
2016-03-21 18:00:52-0700 [-] I will keep using the previous config file instead.
in Q2 we are looking to moving SETA to Heroku and make it more reliable.  In addition work will be done to make SETA data useful for taskcluster- ideally the target state for SETA to be in taskcluster and work smoother there.

If there is something we could do outside of moving SETA to Heroku, please ask for it either here or in a new bug.
Assignee

Comment 23

3 years ago
Joel: is there a bug tracking the move of seta data to Heroku?
Flags: needinfo?(jmaher)
and we do have a bug 1253020 just for that!
Flags: needinfo?(jmaher)
Do you have a rough timeline for that move ? In the meantime we may need to add a retry to the buildbot code, or beef up the SETA AWS instance to handle more concurrent connctions.
Turns out we have retries, but they're not catching the socket.error exceptions. Bug 1259325 to fix that up. Need to do that because a failing reconfig leaves the master in a state where it burns builds in a few seconds, and they stay permapending on treeherder.
Blocks: 1264618
SETA is returning consistent 500 errors, which blocks reconfigs on test masters (repos get updated, checkconfig fails, reconfig never happens). IIRC they don't have the previous config to fall back on.
Assignee

Comment 28

3 years ago
The seta server is available again, looking at the logs on a test master it appears it was an intermittent error.
Depends on: 1286358
$ curl -I "http://alertmanager.allizom.org/data/setadetails/?date=2016-08-08&buildbot=1&branch=mozilla-inbound&inactive=1"
HTTP/1.1 500 Internal Server Error
Date: Tue, 09 Aug 2016 02:19:21 GMT
Server: Apache/2.4.7 (Ubuntu)
Connection: close
Content-Type: text/html; charset=iso-8859-1

Looks pretty consistent, and I would guess for 5+ hours based on some stuck reconfigs.
Assignee

Comment 30

3 years ago
Have a script in progess to wget seta data, save it locally and push to bbconfigs for storage. Each branch we run seta on needs a different copy of the data, working on an elegant way to store that now. My plan is to run this script via cron from the scheduling test master once a day, (config in puppet) since seta is only updated once a day.  I'm thinking that we would just land the data on default because merging to production would trigger an unneccessary reconfig.
(In reply to Kim Moir [:kmoir] from comment #30)
> Have a script in progess to wget seta data, save it locally and push to
> bbconfigs for storage. Each branch we run seta on needs a different copy of
> the data, working on an elegant way to store that now. My plan is to run
> this script via cron from the scheduling test master once a day, (config in
> puppet) since seta is only updated once a day.  I'm thinking that we would
> just land the data on default because merging to production would trigger an
> unneccessary reconfig.

I'm confused about the approach here. Is the thing we're storing in buildbot-configs an ultimate fallback if we can't find anything else?

Here's what I envisage as the SETA process fallthrough:

* buildbot master starts or reconfig is triggered
** download new SETA data from server
*** if server unreachable, fallback to existing SETA data from last run (triggers warning)
**** if no previous run, fallback to archived copy of SETA data in buildbot-configs (triggers louder warning)

We'd run Kim's script on some reasonable cadence to refresh the in-tree SETA data.

Is this accurate?

Comment 32

3 years ago
Side topic, as part of the SETA re-write I'm doing (bug 1306709), I hope to upload SETA information to an S3 archive.
Assignee

Comment 33

3 years ago
coop: 

Yes, this is the same approach I had envisioned.  I'll look at this bug again now that the nightly tcmigration stuff on my plate is mostly done.
Assignee

Comment 34

3 years ago
Posted patch wip patches (obsolete) — Splinter Review
Assignee

Comment 35

3 years ago
Posted patch wip patches (obsolete) — Splinter Review
Assignee

Comment 36

3 years ago
Alin: is buildbot-master69.bb.releng.use1.mozilla.com still available for testing?  I'd like to test these patches on it via my puppet testing instance. I see that it is up but not running jobs.
Flags: needinfo?(aselagea)
(In reply to Kim Moir [:kmoir] from comment #36)
> Alin: is buildbot-master69.bb.releng.use1.mozilla.com still available for
> testing?  I'd like to test these patches on it via my puppet testing
> instance. I see that it is up but not running jobs.

Yes, it's still available.
Flags: needinfo?(aselagea)
Assignee

Comment 38

3 years ago
Posted file bug1176784puppet.patch (obsolete) —
Attachment #8809512 - Attachment is obsolete: true
Assignee

Comment 39

3 years ago
Posted patch bug1176784tools.patch (obsolete) — Splinter Review
Attachment #8809513 - Attachment is obsolete: true
Assignee

Comment 40

3 years ago
Posted patch bug1176784puppet2.patch (obsolete) — Splinter Review
Attachment #8810667 - Attachment is obsolete: true
Assignee

Comment 41

3 years ago
Posted patch bug1176784puppet2.patch (obsolete) — Splinter Review
Attachment #8810951 - Attachment is obsolete: true
Comment on attachment 8810942 [details] [diff] [review]
bug1176784tools.patch

>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
...
>+        with open(temp_file, 'wt') as f:
>+            json.dump(data, f)

How would you feel about adding 'indent=4' to the dump(), and also sorting the list of builders ? That would give us nice diffs in hg to track changes over time.
Assignee

Comment 43

3 years ago
good suggestion nick, I'll do that and update the patch
Assignee

Comment 44

3 years ago
Posted patch bug1176784bb.patch (obsolete) — Splinter Review
Assignee

Comment 45

3 years ago
Posted patch bug1176784tools2.patch (obsolete) — Splinter Review
Attachment #8810942 - Attachment is obsolete: true
Assignee

Updated

3 years ago
Attachment #8811020 - Flags: feedback?(nthomas)
Comment on attachment 8811020 [details] [diff] [review]
bug1176784tools2.patch

>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
>+# main
>+today = date.today().strftime("%Y-%m-%d")
>+remote = "ssh://hg.mozilla.org/build/buildbot-configs"
>+ssh_key = "/home/cltbld/.ssh/ffxbuild_rsa"
>+ssh_username = "ffxbuild"

A little safety net until this is ready for primetime ?

>+revision = "default"
>+localrepo = "/tmp/buildbot-configs"
>+configs_path = localrepo + "/mozilla-tests/"
>+msg = "updating seta data for " + today

msg seems unused, there's something very similar in the commit() though.

>+if os.path.exists(localrepo):
>+    shutil.rmtree(localrepo)
>+os.mkdir(localrepo)
>+clone(remote, localrepo,revision)

purge() is an alternative for an existing repo, but for buildbot-configs it doesn't really matter.

>+#assume data could not be fetched
>+status = False
>+for branch in seta_branches:
>+    status = update_seta_data(branch, configs_path)
>+    if status:
>+    #add files
>+        cmd = ['hg', 'add', '.']
>+    #commit files
>+value = run_cmd(cmd, cwd=localrepo)

So we'll have the .old file committed in the repo ? Does that give us something over the hg history of the main file ? hg add is a bit dangerous for committing temp files by accident.
Attachment #8811020 - Flags: feedback?(nthomas) → feedback+
Comment on attachment 8811015 [details] [diff] [review]
bug1176784bb.patch

>diff --git a/mozilla-tests/config_seta.py b/mozilla-tests/config_seta.py
--- a/mozilla-tests/config_seta.py
+++ b/mozilla-tests/config_seta.py
@@ -72,7 +72,14 @@ def get_seta_platforms(branch, platform_
...
>     c['jobtypes'] = data.get('jobtypes', None)
>     platforms = []
>     for p in c['jobtypes'][today]:

This last line could be a problem when we fall back to the in-repo data. today may well not match the date when the data was cached.
Assignee

Comment 48

3 years ago
Posted patch bug1176784tools3.patch (obsolete) — Splinter Review
Attachment #8811020 - Attachment is obsolete: true
Assignee

Comment 49

3 years ago
Posted patch bug1176784bb2.patch (obsolete) — Splinter Review
Attachment #8811015 - Attachment is obsolete: true
Attachment #8811919 - Flags: review?(nthomas)
Assignee

Updated

3 years ago
Attachment #8811794 - Flags: review?(nthomas)
Assignee

Updated

3 years ago
Attachment #8810961 - Flags: feedback?(nthomas)
Comment on attachment 8811919 [details] [diff] [review]
bug1176784bb2.patch

>diff --git a/mozilla-tests/config_seta.py b/mozilla-tests/config_seta.py
>@@ -71,9 +70,21 @@ def get_seta_platforms(branch, platform_
>     if os.environ.get('DISABLE_SETA'):
>         return []
>
>+    global today
>+    today = date.today().strftime("%Y-%m-%d")

Doesn't look like this needs to be a global.

>+    if data == "":
>+        path = os.path.join(path, branch + "-seta.json")
>+        with open(path, 'r') as f:
>+            data = json.load(f)

Please add a log message when we fallback to disk data. We're not pumping the master logs into papertrail so there's no easy alerting, but a message in twistd.log would be useful if the sheriffs get in touch about SETA behaving strangely.
Attachment #8811919 - Flags: review?(nthomas) → review+
Comment on attachment 8811794 [details] [diff] [review]
bug1176784tools3.patch

>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
...
+sys.path.append("/builds/buildbot/tests_scheduler/tools/lib/python")

We often do this relative to the script, to make it more portable. eg 
sys.path.append(os.path.join(os.path.dirname(__file__), "../../lib/python"))

>+        backup_file = configs_path + branch  + "-seta.json.old"

No longer used ?

>+ssh_key = "/home/cltbld/.ssh/ffxbuild_rsa"
>+ssh_username = "ffxbuild"

s/ffxbuild/ffxbld/ please.

>+revision = "default"

If we merge infrequently from default to production, will the seta config changes be picked up "soon enough" ?

>+clone(remote, localrepo,revision)

Nit, whitespace missing.

>+#assume data could not be fetched
>+status = False
>+for branch in seta_branches:
>+    status = update_seta_data(branch, configs_path)

Did you mean to have a status check here ?

>+revision = commit(localrepo, msg, user=ssh_username)
>+push_cmd = ['hg', 'push']
>+push_value = run_cmd(push_cmd, cwd=localrepo)

You've imported push from util.hg, did it not work out ?
Attachment #8811794 - Flags: review?(nthomas) → review+
Comment on attachment 8810961 [details] [diff] [review]
bug1176784puppet2.patch

This looks plausible, but a puppet expert I'm not.

>diff --git a/modules/buildmaster/manifests/seta_update.pp b/modules/buildmaster/manifests/seta_update.pp
>+# this class manages cleanup functionality for the buildbot databases

Showing its origin there.

>+    python::virtualenv {
>+        "$seta_update_dir":
>+            python  => "${packages::mozilla::python27::python}",
>+            require => Class['packages::mozilla::python27'],
>+            user    => "${users::builder::username}",
>+            group   => "${users::builder::group}",
>+            packages => [
>+                "SQLAlchemy==0.7.9",
>+                "MySQL-python==1.2.3",

Are these packages really required ?

>diff --git a/modules/buildmaster/templates/buildmaster-seta-update.erb b/modules/buildmaster/templates/buildmaster-seta-update.erb
>+@weekly <%=scope.lookupvar('users::builder::username')%> <%=@seta_update_dir%>/bin/python /builds/buildbot/tests_scheduler/tools/buildfarm/maintenance/update_seta.py -l <%=@seta_update_dir%>/update-seta.log

You have a tools checkout in <%=@seta_update_dir%>, so should use that I think. Does anything keep that up to date, or do we deploy changes with a manual hg pull ?

I like the idea of a log, but no sign of -l argument in the script.
Attachment #8810961 - Flags: feedback?(nthomas) → feedback+
Assignee

Comment 53

3 years ago
I made the variable "today" a global because if the local seta server is not available, it's value is overwritten by the date value in the local copy of the seta data in the json file. If this is the case, then we need to use this value in the other methods, so I thought this was a better approach.
Attachment #8811919 - Attachment is obsolete: true
Assignee

Comment 54

3 years ago
Posted patch bug1176784tools4.patch (obsolete) — Splinter Review
Attachment #8811794 - Attachment is obsolete: true
Assignee

Comment 55

3 years ago
Posted patch bug1176784puppet3.patch (obsolete) — Splinter Review
Attachment #8810961 - Attachment is obsolete: true
Assignee

Comment 56

3 years ago
Comment on attachment 8813281 [details] [diff] [review]
bug1176784tools4.patch

r? on this since I only asked for f? before
Attachment #8813281 - Flags: review?(nthomas)
(In reply to Kim Moir [:kmoir] from comment #53)
> I made the variable "today" a global because if the local seta server is not
> available, it's value is overwritten by the date value in the local copy of
> the seta data in the json file. If this is the case, then we need to use
> this value in the other methods, so I thought this was a better approach.

Ah, I see what you mean. Another approach would be strip the date part out, and only work with data['jobtypes'][<date>].
Comment on attachment 8813281 [details] [diff] [review]
bug1176784tools4.patch

>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
>+import logging
>+log = logging.getLogger(__name__)

I'm a convert to sending logs into papertail. You could have a look at aws_watch_pending.py for an example of setting up a StreamHandler to send output to stdout/stderr, while also keeping a local log. Then you can append ' 2>&1 | logger -t update_seta' when you call the python script.

>+        except socket.error, e:
>+            log.warming("Socket error when accessing %s: %s" % (url, str(e)))

Nit, warming log, must be on fire ;-)

>+        print("Retrying")

Nit, missed one print -> log.<level> change.

>+    data = wfetch(url)
>+    if data:
>+        #test if data cannot be fetched

Nit, old comment or slight the wrong place ?

>+    parser = OptionParser()
>+    parser.set_defaults(
>+        filename=None,
>+        loglevel=logging.INFO,
>+        logfile=None,
>+        skip_orphans=False,

Nit, unused filename and skip_orphans.
Attachment #8813281 - Flags: review?(nthomas) → review+
Assignee

Comment 59

3 years ago
Attachment #8813281 - Attachment is obsolete: true
Assignee

Comment 60

3 years ago
Attachment #8813283 - Attachment is obsolete: true
Assignee

Updated

3 years ago
Attachment #8814977 - Flags: checked-in+
Assignee

Updated

3 years ago
Attachment #8813277 - Flags: checked-in+

Comment 62

2 years ago
Correct if not fixed.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.