Closed
Bug 920078
Opened 11 years ago
Closed 11 years ago
Multiple jobs failing with "Error contacting server for clobberer information."
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Unassigned)
References
Details
(Keywords: intermittent-failure, Whiteboard: [buildduty])
Sigh, more infra issues :-( Seem to be all EC2. https://tbpl.mozilla.org/php/getParsedLog.php?id=28287334&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28287337&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28287339&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28287336&tree=Mozilla-Inbound { 08:20:43 INFO - ##### 08:20:43 INFO - ##### Running clobber step. 08:20:43 INFO - ##### 08:20:43 INFO - Running main action method: clobber 08:20:43 INFO - rmtree: /builds/slave/b2g_m-in_emu_dep-0000000000000/build/upload 08:20:43 INFO - rmtree: /builds/slave/b2g_m-in_emu_dep-0000000000000/build/testdata 08:20:43 INFO - Running command: ['/builds/slave/b2g_m-in_emu_dep-0000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-t', '168', 'http://clobberer.pvt.build.mozilla.org/index.php', u'mozilla-inbound', u'b2g_mozilla-inbound_emulator_dep', u'b2g_m-in_emu_dep-0000000000000', u'bld-linux64-ec2-154', u'http://buildbot-master61.srv.releng.use1.mozilla.com:8001/'] in /builds/slave 08:20:43 INFO - Copy/paste: /builds/slave/b2g_m-in_emu_dep-0000000000000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -t 168 http://clobberer.pvt.build.mozilla.org/index.php mozilla-inbound b2g_mozilla-inbound_emulator_dep b2g_m-in_emu_dep-0000000000000 bld-linux64-ec2-154 http://buildbot-master61.srv.releng.use1.mozilla.com:8001/ 08:21:14 INFO - Checking clobber URL: http://clobberer.pvt.build.mozilla.org/index.php?master=http%3A%2F%2Fbuildbot-master61.srv.releng.use1.mozilla.com%3A8001%2F&slave=bld-linux64-ec2-154&builddir=b2g_m-in_emu_dep-0000000000000&branch=mozilla-inbound&buildername=b2g_mozilla-inbound_emulator_dep 08:21:14 ERROR - Error contacting server 08:21:14 ERROR - Error contacting server for clobberer information. 08:21:14 ERROR - Return code: 1 08:21:14 FATAL - failed to clobber build 08:21:14 FATAL - Running post_fatal callback... 08:21:14 FATAL - Exiting 2 }
Reporter | ||
Comment 1•11 years ago
|
||
Haven't seen any more since. Would still quite like us to make either the clobberer or our EC2 VPNs more resilient...
Severity: blocker → critical
Comment 2•11 years ago
|
||
This does not seem to be a tree closing issue but intermittent. Adding people who have touched the file. Failure times: 2013-09-24 08:20:37 PDT for push 28316508a37f 2013-09-24 08:20:36 PDT for push 28316508a37f 2013-09-24 08:20:38 PDT for push 28316508a37f 2013-09-24 08:20:40 PDT for push 28316508a37f Hosts that have failed: bld-linux64-ec2-154 bld-linux64-ec2-039 bld-linux64-ec2-037 bld-linux64-ec2-040 The failing code is in here: http://mxr.mozilla.org/build/source/mozharness/external_tools/clobberer.py#141 Should we setup a clobberer host on EC2 and talk to it for EC2 instances? Should we clobber on failure to reach the host? (Very expensive if this was going to happen many times but would be the fail gracefully approach)
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
Comment 3•11 years ago
|
||
We could also retry the request to the clobberer server, like we do have elsewhere in the infra.
Reporter | ||
Comment 5•11 years ago
|
||
Happening again: https://tbpl.mozilla.org/php/getParsedLog.php?id=28458314&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458318&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458324&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458321&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458327&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458331&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458319&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458317&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458328&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458322&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458315&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458311&tree=B2g-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458312&tree=B2g-Inbound
Depends on: 916692
Reporter | ||
Comment 6•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=28458313&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458309&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28458308&tree=Mozilla-Inbound
Reporter | ||
Updated•11 years ago
|
Keywords: intermittent-failure
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 14•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=28664735&tree=B2g-Inbound&full=1 is from bld-centos6-hp-008, so this isn't specific to EC2 hosts. Looks like clobberer gets sleepy every morning. Dustin, Amy - any light you can shed on this?
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 21•11 years ago
|
||
Dustin has already left for the summit and won't be back till the 3rd. I don't know of any changes that were made around the 24th, but webops (the folks who manage the web cluster and load balancer might have a better idea).
Flags: needinfo?(arich)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 40•11 years ago
|
||
Is there any way to get a more detailed error message than "Error contacting server"? For the error in comment 39, I see it in the clobberer logs, with a 200 result: 10.22.81.211 - - [02/Oct/2013:05:24:52 -0700] "GET /index.php?master=http%3A%2F%2Fbuildbot-master58.srv.releng.usw2.mozilla.com%3A8001%2F&slave=bld-linux64-ec2-355&builddir=b2g_b2g-in_unagi_eng_dep-00000&branch=b2g-inbound&buildername=b2g_b2g-inbound_unagi_eng_dep HTTP/1.1" 200 4485 "-" "Python-urllib/2.6" Note, however, that the timings in the log list a 30-second timeout: 05:24:52 INFO - Copy/paste: /builds/slave/b2g_b2g-in_unagi_eng_dep-00000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -t 168 http://clobberer.pvt.build.mozilla.org/index.php b2g-inbound b2g_b2g-inbound_unagi_eng_dep b2g_b2g-in_unagi_eng_dep-00000 bld-linux64-ec2-355 http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/ 05:25:22 INFO - Checking clobber URL: http://clobberer.pvt.build.mozilla.org/index.php?master=http%3A%2F%2Fbuildbot-master58.srv.releng.usw2.mozilla.com%3A8001%2F&slave=bld-linux64-ec2-355&builddir=b2g_b2g-in_unagi_eng_dep-00000&branch=b2g-inbound&buildername=b2g_b2g-inbound_unagi_eng_dep 05:25:22 ERROR - Error contacting server Apache doesn't reflect the request duration in its logs, but if curl or ZLB timed out, it's quite likely Apache would still log the 200 response. So: I think this is most likely a 30-second timeout. If I recall, John landed some clobberer changes in late September - maybe those are related?
Flags: needinfo?(dustin) → needinfo?(jhopkins)
Reporter | ||
Comment 41•11 years ago
|
||
The 30s timeout is set at: https://hg.mozilla.org/build/tools/file/f5eef9660426/clobberer/clobberer.py#l151 ...and the mozharness copy of the same file: https://hg.mozilla.org/build/mozharness/file/76d08a2d15f6/external_tools/clobberer.py#l151
Comment 42•11 years ago
|
||
So, adding better logging of that timeout would be a good idea, and increasing it would probably stop the immediate pain, but HTTP requests really shouldn't be taking that long.
Reporter | ||
Comment 43•11 years ago
|
||
Yeah currently the traceback is only output if the verbose option is set (for some reason?!): https://hg.mozilla.org/build/mozharness/file/76d08a2d15f6/external_tools/clobberer.py#l198
Comment 44•11 years ago
|
||
The patch I added in [1] makes single-slave requests more expensive which could be the root of the problem here - it was optimized for fetching clobberer details for multiple slaves by prefetching other slaves having the same builddir/branch/master which is wasteful in the case when only a single build slave is needed. We could modify the script to pass in a boolean "prefetch" argument to getClobberTime and set it to False when doing a single-slave request. [1] http://hg.mozilla.org/build/tools/rev/1ba62dad0efa
Flags: needinfo?(jhopkins)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Comment 46•11 years ago
|
||
This should be fixed by bug 927525.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•