Closed Bug 1451193 Opened 7 years ago Closed 7 years ago

Categories

(Infrastructure & Operations :: RelOps: General, task, P5)

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure)

Filed by: apavel [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=171739832&repo=mozilla-central https://queue.taskcluster.net/v1/task/Fjp2yy_XQQe0kxBb4zba9Q/runs/0/artifacts/public/logs/live_backing.log https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/Fjp2yy_XQQe0kxBb4zba9Q/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 task 2018-04-03T23:08:57.677Z] 23:08:57 WARNING - URL Error: https://hg.mozilla.org/mozilla-central/raw-file/00bdc9451be6557ccce1492b9b966d4435615380/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest [task 2018-04-03T23:08:57.678Z] 23:08:57 INFO - retry: attempt #5 caught URLError exception: <urlopen error timed out> [task 2018-04-03T23:08:57.678Z] 23:08:57 ERROR - Can't download from https://hg.mozilla.org/mozilla-central/raw-file/00bdc9451be6557ccce1492b9b966d4435615380/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest to /builds/worker/workspace/build/.android/releng.manifest! [task 2018-04-03T23:08:57.679Z] 23:08:57 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-03T23:08:57.679Z] 23:08:57 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-03T23:08:57.679Z] 23:08:57 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-03T23:08:57.679Z] 23:08:57 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-03T23:08:57.680Z] 23:08:57 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-03T23:08:57.680Z] 23:08:57 INFO - Running post-action listener: _resource_record_post_action [task 2018-04-03T23:08:57.680Z] 23:08:57 INFO - [mozharness: 2018-04-03 23:08:57.680403Z] Finished setup-avds step (failed) [task 2018-04-03T23:08:57.681Z] 23:08:57 FATAL - Uncaught exception: Traceback (most recent call last): [task 2018-04-03T23:08:57.682Z] 23:08:57 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 2076, in run [task 2018-04-03T23:08:57.682Z] 23:08:57 FATAL - self.run_action(action) [task 2018-04-03T23:08:57.682Z] 23:08:57 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 2015, in run_action [task 2018-04-03T23:08:57.683Z] 23:08:57 FATAL - self._possibly_run_method(method_name, error_if_missing=True) [task 2018-04-03T23:08:57.683Z] 23:08:57 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 1955, in _possibly_run_method [task 2018-04-03T23:08:57.683Z] 23:08:57 FATAL - return getattr(self, method_name)() [task 2018-04-03T23:08:57.684Z] 23:08:57 FATAL - File "/builds/worker/workspace/mozharness/scripts/android_emulator_unittest.py", line 545, in setup_avds [task 2018-04-03T23:08:57.684Z] 23:08:57 FATAL - self._tooltool_fetch(url, dirs['abs_avds_dir']) [task 2018-04-03T23:08:57.685Z] 23:08:57 FATAL - File "/builds/worker/workspace/mozharness/scripts/android_emulator_unittest.py", line 514, in _tooltool_fetch [task 2018-04-03T23:08:57.685Z] 23:08:57 FATAL - if not os.path.exists(manifest_path): [task 2018-04-03T23:08:57.685Z] 23:08:57 FATAL - File "/usr/lib/python2.7/genericpath.py", line 26, in exists [task 2018-04-03T23:08:57.686Z] 23:08:57 FATAL - os.stat(path) [task 2018-04-03T23:08:57.686Z] 23:08:57 FATAL - TypeError: coercing to Unicode: need string or buffer, NoneType found [task 2018-04-03T23:08:57.686Z] 23:08:57 FATAL - Running post_fatal callback... [task 2018-04-03T23:08:57.687Z] 23:08:57 FATAL - Exiting -1 [task 2018-04-03T23:08:57.687Z] 23:08:57 INFO - Running post-run listener: _resource_record_post_run [task 2018-04-03T23:08:57.687Z] 23:08:57 INFO - Running post-run listener: copy_logs_to_upload_dir [task 2018-04-03T23:08:57.688Z] 23:08:57 INFO - Copying logs to upload dir... [task 2018-04-03T23:08:57.688Z] 23:08:57 INFO - mkdir: /builds/worker/workspace/build/upload/logs [task 2018-04-03T23:08:57.696Z] cleanup
I quarantined https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-xlarge/workers/us-east-1/i-0d64098f0e738e2f7 since it has run two jobs and we've filed two "intermittent" failure bugs from those two runs.
Geoff, can you take a look at this Android issue, please?
Flags: needinfo?(pmoore)
Flags: needinfo?(pmoore) → needinfo?(gbrown)
Android test tasks do a couple of downloads of small files (tooltool manifests - just a few lines) from hg.mozilla.org. Desktop tests do not require those downloads, so that's why this is Android-specific. I checked several of the failure logs and had no trouble accessing the urls that failed. For example, https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=172940139&lineNumber=676 [task 2018-04-10T19:47:04.606Z] 19:47:04 WARNING - URL Error: https://hg.mozilla.org/integration/autoland/raw-file/8ef553153fc05f7e72bac8df0cc71d909c1578d6/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest [task 2018-04-10T19:47:04.606Z] 19:47:04 INFO - retry: attempt #5 caught URLError exception: <urlopen error timed out> [task 2018-04-10T19:47:04.606Z] 19:47:04 ERROR - Can't download from https://hg.mozilla.org/integration/autoland/raw-file/8ef553153fc05f7e72bac8df0cc71d909c1578d6/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest to /builds/worker/workspace/build/.android/releng.manifest! [task 2018-04-10T19:47:04.607Z] 19:47:04 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-10T19:47:04.607Z] 19:47:04 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-10T19:47:04.607Z] 19:47:04 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-10T19:47:04.607Z] 19:47:04 ERROR - Caught exception: <urlopen error timed out> [task 2018-04-10T19:47:04.608Z] 19:47:04 ERROR - Caught exception: <urlopen error timed out> I can access https://hg.mozilla.org/integration/autoland/raw-file/8ef553153fc05f7e72bac8df0cc71d909c1578d6/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest from my laptop without any trouble. Obviously I don't expect 100% reliability from the network, but we seem to be hitting this half a dozen times a day now, with no failures before April 3 -- something has changed for the worse. :dustin -- Do you have any idea who might be able to investigate the cause of the hg.mozilla.org time outs?
Flags: needinfo?(gbrown) → needinfo?(dustin)
The reason we've been quarantining is because it isn't about the network at a particular time or fetching a particular resource, it's about creating workers which consistently hit this until they get quarantined, get killed, or die a natural death.
I would guess the hg operations team?
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #11) > I would guess the hg operations team? Isn't that inconsistent with philor's assertion that the problem is isolated to specific worker instances?
Actually, looking more closely at the logs: [task 2018-04-10T19:30:04.013Z] 19:30:03 INFO - Downloading https://hg.mozilla.org/integration/autoland/raw-file/8ef553153fc05f7e72bac8df0cc71d909c1578d6/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest to /builds/worker/workspace/build/.android/releng.manifest [task 2018-04-10T19:30:04.013Z] 19:30:03 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://hg.mozilla.org/integration/autoland/raw-file/8ef553153fc05f7e72bac8df0cc71d909c1578d6/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest', 'file_name': '/builds/worker/workspace/build/.android/releng.manifest'}, attempt #1 [task 2018-04-10T19:30:07.757Z] compiz (core) - Warn: Attempted to restack relative to 0x1400006 which is not a child of the root window or a window compiz owns [task 2018-04-10T19:31:04.057Z] 19:31:04 WARNING - URL Error: https://hg.mozilla.org/integration/autoland/raw-file/8ef553153fc05f7e72bac8df0cc71d909c1578d6/testing/config/tooltool-manifests/androidarm_4_3/releng.manifest [task 2018-04-10T19:31:04.057Z] 19:31:04 INFO - retry: attempt #1 caught URLError exception: <urlopen error timed out> It looks like the "timeout" here is 1 second. Given that a few RTT's are needed, and that some of those RTT's are across the US, which is about 100ms, even with nothing going wrong you're barely going to get the transaction complete within 1 second. Just a few extra ms of queueing delay on some workers in us-east-{1,2} (notably, both quarantined workers mentioned here are us-east-1) would put it over the limit. I suspect setting that to something like 10s or even 30s would improve the situation.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #13) > It looks like the "timeout" here is 1 second. I see something different (truncating lines to make this easier to read): [task 2018-04-10T19:30:04.013Z] 19:30:03 INFO - Downloading ... releng.manifest to /builds... [task 2018-04-10T19:30:04.013Z] 19:30:03 INFO - retry: Calling _download_file with args: (), kwargs: ... [task 2018-04-10T19:31:04.057Z] 19:31:04 WARNING - URL Error: https://hg.mozilla.org... [task 2018-04-10T19:31:04.057Z] 19:31:04 INFO - retry: attempt #1 caught URLError exception: <urlopen ... I think _download_file was called at 19:30:03 and the URLError thrown at 19:31:04: 61 seconds later. _download_file calls self._urlopen(...timeout=30): https://dxr.mozilla.org/mozilla-central/rev/0528a414c2a86dad0623779abde5301d37337934/testing/mozharness/mozharness/base/script.py#475 Slightly-surprisingly, that calls _urlopen in TestingMixin: https://dxr.mozilla.org/mozilla-central/rev/0528a414c2a86dad0623779abde5301d37337934/testing/mozharness/mozharness/base/script.py#475 which just calls urllib2.urlopen with the same timeout -- 30 seconds. Do you agree? (I don't understand why there is a 60 second delay if the timeout is 30 seconds.)
Flags: needinfo?(dustin)
If I recorded a video of myself sweating like a pig, dancing around, shouting Workers! Workers! Workers! would it help you focus on the fact that there are individual workers, two or three per day, which will hit this 100% of the runs they take, and all of the other workers will hit this 0% of the runs they take?
Oh, I must have read those logs with my head tilted to the side, sorry. Your read is absolutely correct. What I don't understand is, why would some characteristic of certain EC2 instances cause *this* HTTP request to fail, when a typical job performs many dozens of HTTP requests? All four quarantined hosts are in us-east-1 so far. Perhaps there's some intersection of hg.mozilla.org behavior and EC2 region -- maybe routing from a particular AZ to or from scl3 is broken?? i-0a32d1d7e272cedb7 - (too old) i-092ed4ba8ee5ff24c - us-east-1e i-029d2ab0d3eee0552 - us-east-1e i-06e5cffa49e93d949 - us-east-1e Greg, is there anything unique about hg.m.o from EC2 at this point, or is that still in the planning stages? Is there any abuse or fraud prevention that might be blackholing particular IPs? If not, let's see if we can get netops involved here to look at the routing for (what we see as) us-east-1e.
Flags: needinfo?(dustin) → needinfo?(gps)
hgmo is still in and serving data from SCL3; there are no automatic mechanisms in place that should affect this (EIS might have some, but I'd be incredibly surprised if that got triggered for only a handful of instances). Generally, what we see with clones failing from hg is due to GeoIP data being wrong and traffic going well out of its way when trying to get bundles from the CDN; but I don't think that's applicable here, if we're pulling raw files. What does traceroute/mtr look like from good/bad instances look like? Is traffic using the VPN tunnels?
Flags: needinfo?(gps)
All https://hg.mo/ traffic is hitting a collection of nearly identical servers in SCL3. There *is* a small window after push operations where the mirrors could be inconsistent. However, as long as downstream consumers are waiting until the Pulse or SNS notification for the push, things should be consistent. (That window is typically 1-2s.) And because these are Firefox test tasks in CI that run minutes after a push, the chance of us hitting that window is ~0. I don't think our network infra is automatically throttling/blocking IPs from CI. Although it is a possibility. This bug is eerily similar to bug 1451080 and bug 1420643. Bug 1451080 even has some investigation of load balancer logs. Anyway, same behavior: intermittent timeouts connecting to hg.mo. I suspect intermittent capacity issues on the hg.mo origin servers or the load balancer. That's definitely something we should get to the bottom of... Maybe we should chain all these related bugs to somewhere central? Bug 1451080 has the most context so far...
(In reply to Gregory Szorc [:gps] from comment #20) > This bug is eerily similar to bug 1451080 and bug 1420643. Bug 1451080 even > has some investigation of load balancer logs. Anyway, same behavior: > intermittent timeouts connecting to hg.mo. I suspect intermittent capacity > issues on the hg.mo origin servers or the load balancer. That's definitely > something we should get to the bottom of... So far it looks like issues connecting from USE1; if it were an issue with the endpoint, I'd expect a distribution of issues across USE1, MDC1+2, and SCL3. I'd be surprised if it's a Zeus capacity issue, as we've been steadily moving sites off that cluster, but I've been surprised before.
Ugh, I seem to have deleted the comment with another traceroute in it. Here's one from us-east-1a: ubuntu@ip-172-31-1-112:~$ traceroute hg.mozilla.org traceroute to hg.mozilla.org (63.245.215.25), 30 hops max, 60 byte packets 1 216.182.231.82 (216.182.231.82) 12.594 ms 216.182.226.34 (216.182.226.34) 16.609 ms 216.182.226.52 (216.182.226.52) 21.635 ms 2 100.66.8.150 (100.66.8.150) 19.511 ms 100.66.12.222 (100.66.12.222) 13.102 ms 100.66.9.250 (100.66.9.250) 15.016 ms 3 100.66.14.154 (100.66.14.154) 14.802 ms 100.66.10.234 (100.66.10.234) 56.026 ms 100.66.14.66 (100.66.14.66) 16.269 ms 4 100.66.7.241 (100.66.7.241) 15.639 ms 100.66.7.199 (100.66.7.199) 13.036 ms 100.66.6.67 (100.66.6.67) 13.282 ms 5 100.66.4.229 (100.66.4.229) 17.487 ms 100.66.4.89 (100.66.4.89) 14.691 ms 100.66.4.235 (100.66.4.235) 17.470 ms 6 100.65.8.97 (100.65.8.97) 0.286 ms 100.65.10.129 (100.65.10.129) 0.416 ms 100.65.8.1 (100.65.8.1) 0.455 ms 7 52.93.24.28 (52.93.24.28) 41.777 ms 205.251.244.76 (205.251.244.76) 25.683 ms 72.21.197.16 (72.21.197.16) 8.158 ms 8 52.93.27.208 (52.93.27.208) 1.886 ms 52.93.24.35 (52.93.24.35) 1.137 ms 52.93.24.21 (52.93.24.21) 1.671 ms 9 54.239.108.130 (54.239.108.130) 23.510 ms 52.93.26.33 (52.93.26.33) 21.402 ms 54.239.111.18 (54.239.111.18) 5.816 ms 10 54.239.111.243 (54.239.111.243) 1.231 ms 54.239.111.255 (54.239.111.255) 1.221 ms 54.239.108.109 (54.239.108.109) 1.237 ms 11 52.95.216.231 (52.95.216.231) 1.041 ms 3.128 ms 1.051 ms 12 ae16.er5.iad10.us.zip.zayo.com (64.125.31.77) 1.392 ms ae15.cr2.dca2.us.zip.zayo.com (64.125.31.41) 2.352 ms ae15.cr1.dca2.us.zip.zayo.com (64.125.31.21) 2.895 ms 13 ae27.cs2.dca2.us.eth.zayo.com (64.125.30.248) 69.862 ms 69.849 ms 69.891 ms 14 ae27.cs2.dca2.us.eth.zayo.com (64.125.30.248) 69.855 ms ae3.cs2.iah1.us.eth.zayo.com (64.125.29.45) 77.102 ms ae27.cs2.dca2.us.eth.zayo.com (64.125.30.248) 69.859 ms 15 ae3.cs2.iah1.us.eth.zayo.com (64.125.29.45) 77.199 ms ae5.cs2.dfw2.us.eth.zayo.com (64.125.28.103) 77.183 ms ae3.cs2.iah1.us.eth.zayo.com (64.125.29.45) 77.115 ms 16 ae5.cs2.dfw2.us.eth.zayo.com (64.125.28.103) 77.227 ms ae3.cs2.lax112.us.eth.zayo.com (64.125.29.21) 71.059 ms 71.000 ms 17 ae2.cs2.sjc2.us.eth.zayo.com (64.125.28.196) 76.886 ms ae3.cs2.lax112.us.eth.zayo.com (64.125.29.21) 70.967 ms ae2.cs2.sjc2.us.eth.zayo.com (64.125.28.196) 76.781 ms 18 ae2.cs2.sjc2.us.eth.zayo.com (64.125.28.196) 95.909 ms 79.977 ms ae5.mpr2.pao1.us.zip.zayo.com (64.125.31.17) 69.644 ms 19 64.125.170.34.t00539-05.above.net (64.125.170.34) 77.280 ms ae5.mpr2.pao1.us.zip.zayo.com (64.125.31.17) 69.627 ms 69.602 ms 20 xe-0-0-1.border1.scl3.mozilla.net (63.245.219.130) 71.148 ms 64.125.170.34.t00539-05.above.net (64.125.170.34) 77.562 ms 77.240 ms 21 xe-0-0-1.border1.scl3.mozilla.net (63.245.219.130) 71.124 ms v-1026.core1.scl3.mozilla.net (63.245.214.70) 72.739 ms 72.743 ms 22 * v-1026.core1.scl3.mozilla.net (63.245.214.70) 72.635 ms 72.752 ms 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * * The us-east-1e looked, from the IATA codes, like DC -> LA -> Palo Alto -> Santa Clara
us-east-1e: ubuntu@ip-172-31-55-57:~$ traceroute hg.mozilla.org traceroute to hg.mozilla.org (63.245.215.25), 30 hops max, 60 byte packets 1 216.182.225.128 (216.182.225.128) 16.975 ms 216.182.225.124 (216.182.225.124) 18.083 ms 216.182.225.128 (216.182.225.128) 16.928 ms 2 100.66.8.254 (100.66.8.254) 18.908 ms 100.66.8.12 (100.66.8.12) 12.707 ms 100.66.8.4 (100.66.8.4) 16.885 ms 3 100.66.15.110 (100.66.15.110) 21.926 ms 100.66.11.128 (100.66.11.128) 16.831 ms 100.66.11.42 (100.66.11.42) 16.604 ms 4 100.66.7.37 (100.66.7.37) 15.347 ms 100.66.6.241 (100.66.6.241) 18.116 ms 100.66.6.49 (100.66.6.49) 17.715 ms 5 100.66.4.27 (100.66.4.27) 11.542 ms 100.66.4.173 (100.66.4.173) 18.511 ms 100.66.4.215 (100.66.4.215) 16.568 ms 6 100.65.9.193 (100.65.9.193) 0.414 ms 100.65.10.65 (100.65.10.65) 0.263 ms 100.65.9.1 (100.65.9.1) 0.303 ms 7 205.251.245.175 (205.251.245.175) 1.146 ms 205.251.244.217 (205.251.244.217) 2.513 ms 205.251.245.175 (205.251.245.175) 1.127 ms 8 54.239.108.228 (54.239.108.228) 20.009 ms 54.239.111.58 (54.239.111.58) 21.702 ms 54.239.108.194 (54.239.108.194) 25.478 ms 9 54.239.108.177 (54.239.108.177) 1.224 ms 54.239.111.243 (54.239.111.243) 1.375 ms 54.239.111.251 (54.239.111.251) 1.532 ms 10 ae1.er5.iad10.us.zip.zayo.com (64.125.12.29) 1.090 ms 52.95.216.231 (52.95.216.231) 1.034 ms ae1.er5.iad10.us.zip.zayo.com (64.125.12.29) 1.045 ms 11 ae15.cr2.dca2.us.zip.zayo.com (64.125.31.41) 2.255 ms ae15.cr1.dca2.us.zip.zayo.com (64.125.31.21) 2.951 ms ae16.er5.iad10.us.zip.zayo.com (64.125.31.77) 1.377 ms 12 ae27.cs2.dca2.us.eth.zayo.com (64.125.30.248) 77.143 ms 77.135 ms ae15.cr2.dca2.us.zip.zayo.com (64.125.31.41) 2.231 ms 13 ae27.cs2.dca2.us.eth.zayo.com (64.125.30.248) 77.205 ms ae3.cs2.iah1.us.eth.zayo.com (64.125.29.45) 99.028 ms ae27.cs2.dca2.us.eth.zayo.com (64.125.30.248) 77.176 ms 14 ae5.cs2.dfw2.us.eth.zayo.com (64.125.28.103) 77.933 ms 77.910 ms ae3.cs2.iah1.us.eth.zayo.com (64.125.29.45) 98.953 ms 15 ae3.cs2.lax112.us.eth.zayo.com (64.125.29.21) 79.569 ms 70.744 ms ae5.cs2.dfw2.us.eth.zayo.com (64.125.28.103) 77.147 ms 16 ae3.cs2.lax112.us.eth.zayo.com (64.125.29.21) 70.713 ms ae2.cs2.sjc2.us.eth.zayo.com (64.125.28.196) 70.043 ms 70.071 ms 17 ae5.mpr2.pao1.us.zip.zayo.com (64.125.31.17) 71.066 ms 71.046 ms 71.174 ms 18 ae5.mpr2.pao1.us.zip.zayo.com (64.125.31.17) 71.105 ms 64.125.170.34.t00539-05.above.net (64.125.170.34) 69.838 ms ae5.mpr2.pao1.us.zip.zayo.com (64.125.31.17) 71.092 ms 19 64.125.170.34.t00539-05.above.net (64.125.170.34) 69.828 ms xe-0-0-1.border1.scl3.mozilla.net (63.245.219.130) 77.610 ms 77.616 ms 20 v-1026.core1.scl3.mozilla.net (63.245.214.70) 80.205 ms xe-0-0-1.border1.scl3.mozilla.net (63.245.219.130) 77.975 ms 77.671 ms 21 v-1026.core1.scl3.mozilla.net (63.245.214.70) 80.010 ms 79.994 ms * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * *
If we're only provisioning a few instances in us-east-1e (perhaps it's typically more expensive than other AZs?) and that routing is slower or goes through a lossy link, then it might align nicely with the symptoms we're seeing here. John, is there any way to tell what portion of instances of a particular workerType are in us-east-1e? https://tools.taskcluster.net/aws-provisioner/gecko-t-linux-large/ is full of zeroes right now, due to bug 1453649. And yes, I agree that it might be good to join this up with the other bugs :gps mentioned.
Flags: needinfo?(jhford)
I think that during the failure times, these requests are not making it to the zlb from the workers. During the times when the failing requests are made to hg.m.o, I don't see the requests recorded in the zlb logs. Instead, I see only requests from other tools and users for the same hash and none of those are failing. Re: bug 1451080, and treeherder's tracking of the failures https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-04-12&endday=2018-04-19&tree=trunk&bug=1451080
See Also: → 1451080
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #24) > John, is there any way to tell what portion of instances of a particular > workerType are in us-east-1e? > https://tools.taskcluster.net/aws-provisioner/gecko-t-linux-large/ is full > of zeroes right now, due to bug 1453649. In the meantime, $ heroku pg:psql --app ec2-manager ec2-manager::CRIMSON=> select "workerType", count(case when az='us-east-1e' then 1 end) as "us-east-1e", count(case when az<>'us-east-1e' then 1 end) as "other" from instances group by "workerType"; workerType | us-east-1e | other ----------------------+------------+------- gecko-t-win10-64-gpu | 15 | 188 gecko-t-win10-64 | 62 | 754 taskcluster-generic | 0 | 2 gecko-t-linux-xlarge | 165 | 2215 win2012r2-cu | 0 | 1 gecko-1-b-win2012 | 8 | 99 gecko-1-b-macosx64 | 0 | 11 releng-svc-prod | 0 | 1 gecko-1-images | 0 | 1 gecko-3-b-win2012 | 29 | 211 gecko-3-b-linux | 7 | 210 gecko-3-b-macosx64 | 0 | 38 gecko-1-b-android | 1 | 12 github-worker | 0 | 8 releng-svc | 0 | 2 gecko-1-b-linux | 21 | 144 gecko-2-decision | 0 | 2 releng-svc-compute | 1 | 2 gecko-t-win7-32 | 57 | 518 hg-worker | 0 | 1 gecko-1-decision | 0 | 2 gecko-t-linux-large | 19 | 1619 gecko-t-win7-32-gpu | 14 | 192 gecko-3-decision | 2 | 4 gecko-3-b-android | 2 | 44 gecko-t-linux-medium | 0 | 1 (26 rows)
Flags: needinfo?(jhford)
Also, you can do the same query s/instances/terminations/ to get historic stats since some time in February I'm almost tempted to make an endpoint of ec2-manager be a way to dump out a sqlite copy of the instances table, so that we can use sqlite on it, since sql queries are often easiest with such simple data.
That worker is in us-east-1d. Looking at https://tools.taskcluster.net/groups/bHimRC4vQ9u7txfI8zer5g/tasks/fDE1GzDvTsGY6Waqk3_TtQ/runs/0/logs/public%2Flogs%2Flive_backing.log I see lots of successful requests to queue.tc.net (including by the worker itself), for example [task 2018-04-14T16:32:57.780Z] 16:32:57 INFO - Downloading https://queue.taskcluster.net/v1/task/X-r_kWR8RbO6fiv7E8bZ9Q/artifacts/public/build/target.tar.bz2 to /builds/worker/workspace/build/target.tar.bz2 [task 2018-04-14T16:32:57.780Z] 16:32:57 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/X-r_kWR8RbO6fiv7E8bZ9Q/artifacts/public/build/target.tar.bz2', 'file_name': '/builds/worker/workspace/build/target.tar.bz2'}, attempt #1 [task 2018-04-14T16:33:30.974Z] 16:33:30 INFO - Downloaded 61476491 bytes. but a whole bunch of failures to reach pypi.pub.build.mozilla.org: [task 2018-04-14T16:34:02.803Z] 16:34:02 INFO - Ignoring indexes: https://pypi.python.org/simple [task 2018-04-14T16:34:02.813Z] 16:34:02 INFO - Collecting psutil>=3.1.1 [task 2018-04-14T16:36:02.862Z] 16:36:02 INFO - Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7f65b3250510>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub [task 2018-04-14T16:38:03.617Z] 16:38:03 INFO - Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7f65b3250690>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub [task 2018-04-14T16:40:04.938Z] 16:40:04 INFO - Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7f65b3250810>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub [task 2018-04-14T16:42:07.252Z] 16:42:07 INFO - Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7f65b3250990>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub [task 2018-04-14T16:44:11.653Z] 16:44:11 INFO - Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7f65b3250b10>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub [task 2018-04-14T16:46:11.965Z] 16:46:11 INFO - Could not find a version that satisfies the requirement psutil>=3.1.1 (from versions: ) I don't see any attempts in that task to reach hg.mozilla.org. On the other task that host failed, it was the more common failure to reach hg.mozilla.org. I believe pypi.pub.build.mozilla.org is still on the ZLB, and probably on the same ZLB as hg. So the pattern here seems to be that a host cannot access services hosted by the Mozilla ZLBs over a long stretch of time, yet can still reach other services. Kendall mentioned not seeing any requests on the ZLB, which is consistent with a connect timeout -- for whatever reason, the SYN/SYN+ACK/ACK three-way is not completing, repeatedly. And we don't see this from all hosts in the AZ, just a few. This really smells like abuse prevention or some other "smart" network behavior. Perhaps it's time to ask netops?
See Also: → 1451432
Chris found in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1454715 that these were caught by the "remotely triggered blackhole system". The blocks were removed for (I think limited to these): - 35.172.213.0/24 (on 3/29 -- reason "Hammering RelEng") - 18.232.118.0/24 (on 3/29 -- reason "Another automated scanner") - 34.207.64.0/24 (on 4/2 -- reason "Hammering HG")
You need to log in before you can comment on or make changes to this bug.