Seeing timeouts when fetching from tooltool

RESOLVED WORKSFORME

Status

Infrastructure & Operations
NetOps
RESOLVED WORKSFORME
4 years ago
4 years ago

People

(Reporter: armenzg, Assigned: adam)

Tracking

Details

(Reporter)

Description

4 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=34624410&tree=Mozilla-Inbound

Running locally on the machine I can see that we can only transfer at 100-200KB/secs and I see the files going to take longer than 300 seconds to download.

Fetching...
retry: Calling <function run_with_timeout at 0x7fd3530a0410> with args: (['/tools/tooltool.py', '--url', 'http://runtime-binaries.pvt.build.mozilla.org/tooltool', '--overwrite', '-m', 'mobile/android/config/tooltool-manifests/android-armv6/releng.manifest', 'fetch'], 300, None, None, False, True), kwargs: {}, attempt #1
Executing: ['/tools/tooltool.py', '--url', 'http://runtime-binaries.pvt.build.mozilla.org/tooltool', '--overwrite', '-m', 'mobile/android/config/tooltool-manifests/android-armv6/releng.manifest', 'fetch']
WARNING: Timeout (300) exceeded, killing process 2286
Traceback (most recent call last):
  File "/tools/tooltool.py", line 835, in <module>
    main()
  File "/tools/tooltool.py", line 832, in main
    exit(0 if process_command(options, args) else 1)
  File "/tools/tooltool.py", line 727, in process_command
    cache_folder=options['cache_folder'])
  File "/tools/tooltool.py", line 508, in fetch_files
    temp_file_name = fetch_file(base_urls, f)
  File "/tools/tooltool.py", line 425, in fetch_file
    indata = f.read(grabchunk)
  File "/tools/python27/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/tools/python27/lib/python2.7/httplib.py", line 561, in read
    s = self.fp.read(amt)
  File "/tools/python27/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
KeyboardInterrupt
retry: Failed, sleeping 30 seconds before retrying
retry: Calling <function run_with_timeout at 0x7fd3530a0410> with args: (['/tools/tooltool.py', '--url', 'http://runtime-binaries.pvt.build.mozilla.org/tooltool', '--overwrite', '-m', 'mobile/android/config/tooltool-manifests/android-armv6/releng.manifest', 'fetch'], 300, None, None, False, True), kwargs: {}, attempt #2
Executing: ['/tools/tooltool.py', '--url', 'http://runtime-binaries.pvt.build.mozilla.org/tooltool', '--overwrite', '-m', 'mobile/android/config/tooltool-manifests/android-armv6/releng.manifest', 'fetch']
WARNING: Timeout (300) exceeded, killing process 2320
All trees closed as of 09:24 due to this.
(Reporter)

Updated

4 years ago
Assignee: nobody → network-operations
Severity: normal → critical
Component: General Automation → NetOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → adam
Version: unspecified → other
From #aws:

09:27 < XioNoX> found it, usw2
09:28 < XioNoX> according to smokeping, something with USW2 isn't very happy right now
09:28 < XioNoX> http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
09:28 < XioNoX> some packet loss on that link, even if latency is fine
09:30 < XioNoX> traffic is not too high, SPUs aren't overloaded
09:58 < hwine> XioNoX: should we open a case with AWS? trees closed
We switched over to the other tunnel and things seem stable. Reopened.
Lowering severity since switching tunnels resolved the issue.
Severity: critical → normal
(Reporter)

Comment 5

4 years ago
What was the root cause?

Thanks!

Updated

4 years ago
QA Contact: adam → jbarnell
As switching to the 2nd AWS tunnel solved the issue I'd guess an issue on one of the Amazon's endpoints. Did anyone open a ticket with them?
(Assignee)

Updated

4 years ago
Assignee: network-operations → adam
(Assignee)

Comment 7

4 years ago
Closing because we have other bugs tracking the packet loss issues with AWS.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.