Multiple jobs failing with "Error contacting server for clobberer information."

RESOLVED FIXED

Status

Release Engineering
Platform Support
--
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: emorley, Unassigned)

Tracking

({intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildduty])

Sigh, more infra issues :-(

Seem to be all EC2.

https://tbpl.mozilla.org/php/getParsedLog.php?id=28287334&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28287337&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28287339&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28287336&tree=Mozilla-Inbound

{
08:20:43     INFO - #####
08:20:43     INFO - ##### Running clobber step.
08:20:43     INFO - #####
08:20:43     INFO - Running main action method: clobber
08:20:43     INFO - rmtree: /builds/slave/b2g_m-in_emu_dep-0000000000000/build/upload
08:20:43     INFO - rmtree: /builds/slave/b2g_m-in_emu_dep-0000000000000/build/testdata
08:20:43     INFO - Running command: ['/builds/slave/b2g_m-in_emu_dep-0000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-t', '168', 'http://clobberer.pvt.build.mozilla.org/index.php', u'mozilla-inbound', u'b2g_mozilla-inbound_emulator_dep', u'b2g_m-in_emu_dep-0000000000000', u'bld-linux64-ec2-154', u'http://buildbot-master61.srv.releng.use1.mozilla.com:8001/'] in /builds/slave
08:20:43     INFO - Copy/paste: /builds/slave/b2g_m-in_emu_dep-0000000000000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -t 168 http://clobberer.pvt.build.mozilla.org/index.php mozilla-inbound b2g_mozilla-inbound_emulator_dep b2g_m-in_emu_dep-0000000000000 bld-linux64-ec2-154 http://buildbot-master61.srv.releng.use1.mozilla.com:8001/
08:21:14     INFO -  Checking clobber URL: http://clobberer.pvt.build.mozilla.org/index.php?master=http%3A%2F%2Fbuildbot-master61.srv.releng.use1.mozilla.com%3A8001%2F&slave=bld-linux64-ec2-154&builddir=b2g_m-in_emu_dep-0000000000000&branch=mozilla-inbound&buildername=b2g_mozilla-inbound_emulator_dep
08:21:14    ERROR -  Error contacting server
08:21:14    ERROR -  Error contacting server for clobberer information.
08:21:14    ERROR - Return code: 1
08:21:14    FATAL - failed to clobber build
08:21:14    FATAL - Running post_fatal callback...
08:21:14    FATAL - Exiting 2
}
Haven't seen any more since.

Would still quite like us to make either the clobberer or our EC2 VPNs more resilient...
Severity: blocker → critical

Comment 2

4 years ago
This does not seem to be a tree closing issue but intermittent.

Adding people who have touched the file.

Failure times:
2013-09-24 08:20:37 PDT for push 28316508a37f
2013-09-24 08:20:36 PDT for push 28316508a37f
2013-09-24 08:20:38 PDT for push 28316508a37f
2013-09-24 08:20:40 PDT for push 28316508a37f

Hosts that have failed:
bld-linux64-ec2-154
bld-linux64-ec2-039
bld-linux64-ec2-037
bld-linux64-ec2-040

The failing code is in here:
http://mxr.mozilla.org/build/source/mozharness/external_tools/clobberer.py#141

Should we setup a clobberer host on EC2 and talk to it for EC2 instances?

Should we clobber on failure to reach the host? (Very expensive if this was going to happen many times but would be the fail gracefully approach)
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
We could also retry the request to the clobberer server, like we do have elsewhere in the infra.
adding buildduty to make the guys aware
Whiteboard: [buildduty]
Happening again:
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458314&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458318&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458324&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458321&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458327&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458331&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458319&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458317&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458328&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458322&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458315&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458311&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458312&tree=B2g-Inbound
Depends on: 916692
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458313&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458309&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458308&tree=Mozilla-Inbound
Keywords: intermittent-failure
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
https://tbpl.mozilla.org/php/getParsedLog.php?id=28664735&tree=B2g-Inbound&full=1 is from bld-centos6-hp-008, so this isn't specific to EC2 hosts. Looks like clobberer gets sleepy every morning.

Dustin, Amy - any light you can shed on this?
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Dustin has already left for the summit and won't be back till the 3rd.  I don't know of any changes that were made around the 24th, but webops (the folks who manage the web cluster and load balancer might have a better idea).
Flags: needinfo?(arich)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Is there any way to get a more detailed error message than "Error contacting server"?

For the error in comment 39, I see it in the clobberer logs, with a 200 result:

10.22.81.211 - - [02/Oct/2013:05:24:52 -0700] "GET /index.php?master=http%3A%2F%2Fbuildbot-master58.srv.releng.usw2.mozilla.com%3A8001%2F&slave=bld-linux64-ec2-355&builddir=b2g_b2g-in_unagi_eng_dep-00000&branch=b2g-inbound&buildername=b2g_b2g-inbound_unagi_eng_dep HTTP/1.1" 200 4485 "-" "Python-urllib/2.6"

Note, however, that the timings in the log list a 30-second timeout:

05:24:52     INFO - Copy/paste: /builds/slave/b2g_b2g-in_unagi_eng_dep-00000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -t 168 http://clobberer.pvt.build.mozilla.org/index.php b2g-inbound b2g_b2g-inbound_unagi_eng_dep b2g_b2g-in_unagi_eng_dep-00000 bld-linux64-ec2-355 http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/
05:25:22     INFO -  Checking clobber URL: http://clobberer.pvt.build.mozilla.org/index.php?master=http%3A%2F%2Fbuildbot-master58.srv.releng.usw2.mozilla.com%3A8001%2F&slave=bld-linux64-ec2-355&builddir=b2g_b2g-in_unagi_eng_dep-00000&branch=b2g-inbound&buildername=b2g_b2g-inbound_unagi_eng_dep
05:25:22    ERROR -  Error contacting server

Apache doesn't reflect the request duration in its logs, but if curl or ZLB timed out, it's quite likely Apache would still log the 200 response.

So: I think this is most likely a 30-second timeout.  If I recall, John landed some clobberer changes in late September - maybe those are related?
Flags: needinfo?(dustin) → needinfo?(jhopkins)
The 30s timeout is set at:
https://hg.mozilla.org/build/tools/file/f5eef9660426/clobberer/clobberer.py#l151
...and the mozharness copy of the same file:
https://hg.mozilla.org/build/mozharness/file/76d08a2d15f6/external_tools/clobberer.py#l151
So, adding better logging of that timeout would be a good idea, and increasing it would probably stop the immediate pain, but HTTP requests really shouldn't be taking that long.
Yeah currently the traceback is only output if the verbose option is set (for some reason?!):
https://hg.mozilla.org/build/mozharness/file/76d08a2d15f6/external_tools/clobberer.py#l198
The patch I added in [1] makes single-slave requests more expensive which could be the root of the problem here - it was optimized for fetching clobberer details for multiple slaves by prefetching other slaves having the same builddir/branch/master which is wasteful in the case when only a single build slave is needed.

We could modify the script to pass in a boolean "prefetch" argument to getClobberTime and set it to False when doing a single-slave request.

[1] http://hg.mozilla.org/build/tools/rev/1ba62dad0efa
Flags: needinfo?(jhopkins)

Updated

4 years ago
Depends on: 927525
Comment hidden (Treeherder Robot)
This should be fixed by bug 927525.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.