Closed Bug 1124303 Opened 10 years ago Closed 9 years ago

setup windows build master for testing building on AWS

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Product:

Component:

Platform:

x86

Windows Server 2008

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: jlund)

References

Details

Attachments

(1 file)

150416_windows_aws-add_10_builders-bbot-cfgs.patch 10 years ago Jordan Lund (:jlund) 794 bytes, patch	jlund : review+ jlund : checked-in+	Details \| Diff \| Splinter Review

Kim Moir [:kmoir] ET

Reporter

Description

•

10 years ago

Master is up here http://dev-master1.srv.releng.scl3.mozilla.com:8401/ in /builds/buildbot/qfortier

Kim Moir [:kmoir] ET

Reporter

Comment 1

•

10 years ago

I need the names of the AWS and scl3 slaves to connect to it so Q and markco to can test the build process

Assignee: nobody → kmoir

Flags: needinfo?(q)

Flags: needinfo?(mcornmesser)

Kim Moir [:kmoir] ET

Reporter

Comment 2

•

10 years ago

Also the notes on how I setup the master are here You can use them to setup the test master in the similar way when you get to that stage https://releng.etherpad.mozilla.org/windowsmasters

Mark Cornmesser [:markco]

Comment 3

•

10 years ago

Sorry about the lag on updating. We are still working on builder images that will function in the cloud. Once we have something I will update this bug.

Flags: needinfo?(mcornmesser)

Comment 4

•

10 years ago

So my aws GPO clones machines should be: b-2008-ec2-0001 b-2008-ec2-0002 b-2008-ec2-0003 b-2008-ec2-0004 b-2008-ec2-0005

Flags: needinfo?(q)

Kim Moir [:kmoir] ET

Reporter

Comment 5

•

10 years ago

Are those machines up yet? I didn't see them as available in the aws console. As an aside, I moved your master to dev-master2.bb.releng.use1.mozilla.com because dustin is bringing up a new server for us, the old one is being deprecated.

Comment 6

•

10 years ago

They have been rebuilt 0001 is definitely ready to go I think we should start with that machine and go from there

Updated

•

10 years ago

OS: Mac OS X → Windows Server 2008

Kim Moir [:kmoir] ET

Reporter

Comment 7

•

10 years ago

Okay, I can see b-2008-ec2-0001 in the aws console. However, it's in a different security group than all of the other machines (for instance, other loaners are in the security group tests or build) . What is it's ip address? I can't see what it has been assigned. I need to be able to connect via ssh and setup the buildbot.tac to talk to the master.

Comment 8

•

10 years ago

After some discussion We aren't going to change the build sec group to incorporate vnc, rdp etc until after the testing. A new very broad security group titled "build-win-tes"t has been created and all the slaves have been added. These machines should be able to go into try: b-2008-ec2-0001 10.134.54.71 b-2008-ec2-0002 10.134.54.68 b-2008-ec2-0003 10.134.54.70 b-2008-ec2-0004 10.134.54.69 b-2008-ec2-0005 10.134.54.71

Jordan Lund (:jlund)

Assignee

Comment 9

•

10 years ago

myself and Q iterated on this yesterday and today. some status: After some trials and tribulations, we were able to ssh into these machines, ensure buildbot was running and trying to connect to a master, and then point it to q's staging master. After triggering some builds on it, our current issue is: full job: http://dev-master2.bb.releng.use1.mozilla.com:8401/builders/WINNT%205.2%20mozilla-central%20build/builds/0 failed step log snippet: 'bash' '-c' 'wget -Orepository_manifest.py --no-check-certificate --tries=10 --waitretry=3 http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/utils/repository_manifest.py' --18:53:29-- http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/utils/repository_manifest.py => `repository_manifest.py' Resolving hg.mozilla.org... 63.245.215.25 Connecting to hg.mozilla.org|63.245.215.25|:80... failed: Connection timed out. Q, if you are able to get the flows corrected, feel free to trigger another build on this slave. I've left the master setup with 0002 slave enabled. navigate to http://dev-master2.bb.releng.use1.mozilla.com:8401/builders/WINNT%205.2%20mozilla-central%20build and click on 'Force Build' at the bottom of this page

Comment 10

•

10 years ago

So it looks like we have to assign a public IP for our IGW to work and get to hg.mozilla.org I relaunched the instance with a public IP and build is running! Thank you for the instructions

Comment 11

•

10 years ago

After several successful builds we have new ips for instances that were rebuilt to try to load them up: b-2008-ec2-0001 10.134.54.21 b-2008-ec2-0002 10.134.54.68 (same) b-2008-ec2-0003 10.134.54.22 b-2008-ec2-0004 10.134.54.24 b-2008-ec2-0005 10.134.54.18

Comment 12

•

10 years ago

correction: b-2008-ec2-0004 10.134.54.23

Comment 13

•

10 years ago

Rename scripts worked and all five are taking builds: http://dev-master2.bb.releng.use1.mozilla.com:8401/builders/WINNT%205.2%20mozilla-central%20build

Jordan Lund (:jlund)

Assignee

Comment 14

•

10 years ago

(In reply to Q from comment #13) > Rename scripts worked and all five are taking builds: > > http://dev-master2.bb.releng.use1.mozilla.com:8401/builders/WINNT%205. > 2%20mozilla-central%20build \o/ http://dev-master2.bb.releng.use1.mozilla.com:8401/one_line_per_build so, I think we should try out moar builders: debug, nightly, pgo, and some win64 equivalents. I can help kick off the variety. We should have all 5 slaves enabled to speed up the work. /me syncs up with Q over irc

Jordan Lund (:jlund)

Assignee

Comment 15

•

10 years ago

he's on board and state is ready. triggered lots of jobs at the bottom of this page: http://dev-master2.bb.releng.use1.mozilla.com:8401/builders

Kim Moir [:kmoir] ET

Reporter

Comment 16

•

10 years ago

Mark contacted me via email He would like some new ec2 build instances up to test configuring via puppet. I looked at the console, the existing builders such as b-2008-ec2-0001 are r3.xlarge. Jordan, when you set them up it were they just an image, I assume no configs in cloud-tools? I can set them up it needed

Flags: needinfo?(jlund)

Comment 17

•

10 years ago

I will need to roll-out the images for testing. They should go in as b-2008-ec2-0100 - 104

Flags: needinfo?(jlund)

Kim Moir [:kmoir] ET

Reporter

Comment 18

•

10 years ago

Okay Q, I'll wait until you roll out the images before I help setup the puppet test env. Mark, does that sound good to you?

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Comment 19

•

10 years ago

Sounds good. (In reply to Kim Moir [:kmoir] from comment #18) > Okay Q, I'll wait until you roll out the images before I help setup the > puppet test env. Mark, does that sound good to you?

Flags: needinfo?(mcornmesser)

Jordan Lund (:jlund)

Assignee

Comment 20

•

10 years ago

(In reply to Jordan Lund (:jlund) from comment #15) > he's on board and state is ready. triggered lots of jobs at the bottom of > this page: http://dev-master2.bb.releng.use1.mozilla.com:8401/builders quite a number of these failed. - the win64 ones failed to sendchange because the port defaulted in staging is to point to prod test master. I updated buildbot-configs which included the change to switch win64 builds to mh. and mh defines the port so that should fix that issue - the win32 nightlies failed due to a credential problem: "auth = (options.username, credentials['balrog_credentials'][options.username])" so I fixed up our creds on the master end. going re-trigger both of these now

Jordan Lund (:jlund)

Assignee

Comment 21

•

10 years ago

(In reply to Mark Cornmesser [:markco] from comment #19) > Sounds good. > > > (In reply to Kim Moir [:kmoir] from comment #18) > > Okay Q, I'll wait until you roll out the images before I help setup the > > puppet test env. Mark, does that sound good to you? I can help with this when we are ready. feel free to ping me markco and we can put AWS and puppet windows efforts together :D

Jordan Lund (:jlund)

Assignee

Updated

•

10 years ago

Assignee: kmoir → jlund

Jordan Lund (:jlund)

Assignee

Comment 22

•

10 years ago

> quite a number of these failed. > > - the win64 ones failed to sendchange because the port defaulted in staging > is to point to prod test master. I updated buildbot-configs which included > the change to switch win64 builds to mh. and mh defines the port so that > should fix that issue > > - the win32 nightlies failed due to a credential problem: "auth = > (options.username, credentials['balrog_credentials'][options.username])" so > I fixed up our creds on the master end. > > going re-trigger both of these now fwiw - these are now running green (trigger l10n step fails for nightlies but that's expected as I disabled them in staging)

Jordan Lund (:jlund)

Assignee

Comment 23

•

10 years ago

so we decided over irc that we are ready to take these out of staging and try them in production. So first up, we will try them in Try. as a prerequisite, we need to: 1) patch buildbot-configs to tell the masters about them 2) patch slave health to let it know about them 3) reconfig I'll file those bugs now and action them tomorrow

Jordan Lund (:jlund)

Assignee

Updated

•

10 years ago

Depends on: 1145510

Jordan Lund (:jlund)

Assignee

Updated

•

10 years ago

Depends on: 1145511

Jordan Lund (:jlund)

Assignee

Comment 24

•

10 years ago

been delayed here with other tasks. I updated bug 1145510 and can reconfig once that's reviewed. I think that's the one we care about here. bug 1145511 is not a hard blocker at least not immediately

Jordan Lund (:jlund)

Assignee

Comment 25

•

10 years ago

these have been on try (discussed and enabled over irc) for over a day now. the bad news is they are not looking too good. I added them to slave health for ease of view: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0001 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0002 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0003 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0004 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0005 poking a couple, they seem to compile for the most part fine but stumble on what looks like the taskcluster submission. I suspect it is just a network connection issue or else tc and mh are not configured to handle windows ec2 I can have another look tomorrow at some point log: http://buildbot-master75.bb.releng.use1.mozilla.com:8101/builders/WINNT%205.2%20try%20leak%20test%20build/builds/18/steps/run_script/logs/stdio here is a snippet of what I'm seeing: 22:26:51 INFO - Starting new HTTPS connection (1): taskcluster-public-artifacts.s3-us-west-2.amazonaws.com 22:29:50 DEBUG - Received HTTP Status: 200 22:29:50 DEBUG - Received HTTP Headers: {'content-length': '0', 'x-amz-id-2': 'b12qxFqIH790xPQt77DuFZt2FvM9s5LdYfjv1ABImBB7GVmRrgsX2v53Gy2HGm3R8iq5Ztg2xHM=', 'server': 'AmazonS3', 'x-amz-request-id': 'A33F83110072E572', 'etag': '"961ac2fb46c025cf654bcff7cf97f717"', 'date': 'Fri, 27 Mar 2015 05:26:53 GMT'} 22:29:50 DEBUG - Received HTTP Payload: (limit 1024 char) 22:29:50 INFO - Setting buildbot property symbolsUrl to https://queue.taskcluster.net/v1/task/kL6QyuXZQBCHi52n0VutZw/artifacts/public/build/firefox-39.0a1.en-US.win32.crashreporter-symbols.zip 22:29:50 INFO - Uploading to S3: filename=c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/obj-firefox/dist/firefox-39.0a1.en-US.win32.txt mimetype=text/plain length=59 22:29:50 DEBUG - Found a positional argument: kL6QyuXZQBCHi52n0VutZw 22:29:50 DEBUG - Found a positional argument: 0 22:29:50 DEBUG - Found a positional argument: public/build/firefox-39.0a1.en-US.win32.txt 22:29:50 DEBUG - After processing positional arguments, we have: {u'runId': 0, u'taskId': u'kL6QyuXZQBCHi52n0VutZw', u'name': u'public/build/firefox-39.0a1.en-US.win32.txt'} 22:29:50 DEBUG - After keyword arguments, we have: {u'runId': 0, u'taskId': u'kL6QyuXZQBCHi52n0VutZw', u'name': u'public/build/firefox-39.0a1.en-US.win32.txt'} 22:29:50 DEBUG - Route is: task/kL6QyuXZQBCHi52n0VutZw/runs/0/artifacts/public/build/firefox-39.0a1.en-US.win32.txt 22:29:50 DEBUG - Full URL used is: https://queue.taskcluster.net/v1/task/kL6QyuXZQBCHi52n0VutZw/runs/0/artifacts/public/build/firefox-39.0a1.en-US.win32.txt 22:29:51 DEBUG - Making attempt 0 22:29:51 DEBUG - Making a POST request to https://queue.taskcluster.net/v1/task/kL6QyuXZQBCHi52n0VutZw/runs/0/artifacts/public/build/firefox-39.0a1.en-US.win32.txt 22:29:51 DEBUG - HTTP Headers: {'Content-Type': 'application/json', 'Authorization': 'Hawk id="KHa1Y5wARRGL8R6GAsgW3w", ts="1427434191", nonce="MwOquF", ext="e30=", mac="soQYW+HhKsP7U6Tdqvw3LiFjB6nbnBDxPv3/RcKymE4="'} 22:29:51 DEBUG - HTTP Payload: {"storageType":"s3","expires":"2016-03-25T05:29:50.999000Z","contentType":"text/plain"} (limit 100 char) 22:29:51 INFO - Starting new HTTPS connection (1): queue.taskcluster.net 22:29:52 DEBUG - Received HTTP Status: 409 22:29:52 DEBUG - Received HTTP Headers: {'content-length': '1056', 'via': '1.1 vegur', 'x-powered-by': 'Express', 'server': 'Cowboy', 'access-control-request-method': '*', 'connection': 'keep-alive', 'date': 'Fri, 27 Mar 2015 05:29:52 GMT', 'access-control-allow-origin': '*', 'access-control-allow-methods': 'OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT', 'content-type': 'application/json; charset=utf-8', 'access-control-allow-headers': 'X-Requested-With,Content-Type,Authorization,Accept,Origin'} 22:29:52 DEBUG - Received HTTP Payload: { "message": "The given is not running", "error": { "status": { "taskId": "kL6QyuXZQBCHi52n0VutZw", "provisionerId": "null-provisioner", "workerType": "buildbot-try", "schedulerId": "-", "taskGroupId": "kL6QyuXZQBCHi52n0VutZw", "deadline": "2015-03-27T06:08:18.484Z", "expires": "2016-03-27T06:08:18.484Z", "retriesLeft": 4, "state": "pending", "runs": [ { "runId": 0, "state": "exception", "reasonCreated": "scheduled", "scheduled": "2015-03-27T05:08:20.351Z", "workerGroup": "buildbot-try", "workerId": "buildbot-try", "takenUntil": "2015-03-27T05:28:21.965Z", "started": "2015-03-27T05:08:22.301Z", "reasonResolved": "claim-expired", "resolved": "2015-03-27T05:28:27.250Z" }, { "runId": 1, "state": "pending", "reasonCreated": "retry", "scheduled": "2015-03-27T05:28:27.250Z" } (limit 1024 char) 22:29:52 FATAL - Uncaught exception: Traceback (most recent call last): 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\base\script.py", line 1288, in run 22:29:52 FATAL - self.run_action(action) 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action 22:29:52 FATAL - self._possibly_run_method("postflight_%s" % method_name) 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method 22:29:52 FATAL - return getattr(self, method_name)() 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1500, in postflight_build 22:29:52 FATAL - self.upload_files() 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1346, in upload_files 22:29:52 FATAL - tc.create_artifact(task, upload_file) 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 97, in create_artifact 22:29:52 FATAL - "contentType": mime_type, 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall 22:29:52 FATAL - return self._makeApiCall(e, *args, **kwargs) 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall 22:29:52 FATAL - return self._makeHttpRequest(entry['method'], route, payload) 22:29:52 FATAL - File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest 22:29:52 FATAL - superExc=rerr 22:29:52 FATAL - TaskclusterRestFailure: The given is not running 22:29:52 FATAL - Running post_fatal callback... 22:29:52 ERROR - setting return code to 2 because fatal was called 22:29:52 FATAL - Exiting -1 22:29:52 INFO - Running post-run listener: _summarize 22:29:52 ERROR - # TBPL FAILURE # 22:29:52 INFO - #####

Jordan Lund (:jlund)

Assignee

Comment 26

•

10 years ago

note, I've disabled the 0001-0005 for now

Amy Rich [:arr] [:arich]

Comment 27

•

10 years ago

Jordan, have you been looking into root cause here? We want to make sure either you or Q (or someone) are investigating this since it's blocking rolling out windows on AWS.

Flags: needinfo?(q)

Flags: needinfo?(jlund)

Comment 28

•

10 years ago

Unfortunately, as far as I know this is beyond my current scope I have been focusing on getting the puppettized version of a builder into AWS. If I need to switch focus I can do so as I am finalizing the post mdt captured puppett builder now.

Flags: needinfo?(q)

Mark Cornmesser [:markco]

Comment 29

•

10 years ago

I sent pmoore an email asking him to join the Windows meeting tomorrow morning because of the mention of Taskcluster in the trace back. Jlund would you be down to join us as well? It is at 8:30 am West coast?

Chris AtLee [:catlee]

Comment 30

•

10 years ago

So the failure looks like it's because the upload has taken too long. The mozharness script creates a placeholder taskcluster task to contain the uploaded artifacts. Tasks must be periodically claimed to let the taskcluster infra know that the worker hasn't died. If a worker fails to claim a task within the specified time range, the task is marked as failed. In this case, the task is expiring because we're taking too long to upload. There are a few problems here: - upload speed seems way too slow. we should double check our AWS routing to make sure we're going from these AWS nodes to S3 properly - the mozharness taskcluster plugin should periodically reclaim the task

Jordan Lund (:jlund)

Assignee

Comment 31

•

10 years ago

(In reply to Mark Cornmesser [:markco] from comment #29) > I sent pmoore an email asking him to join the Windows meeting tomorrow > morning because of the mention of Taskcluster in the trace back. > > Jlund would you be down to join us as well? It is at 8:30 am West coast? I'll be there.

Jordan Lund (:jlund)

Assignee

Comment 32

•

10 years ago

> - upload speed seems way too slow. we should double check our AWS routing to > make sure we're going from these AWS nodes to S3 properly Q, can you double check this? > - the mozharness taskcluster plugin should periodically reclaim the task I can look into not tripping over when the task takes too long

Flags: needinfo?(jlund)

Jordan Lund (:jlund)

Assignee

Comment 34

•

10 years ago

there were a few taskcluster client fixes recently that might provide us with better traceback or maybe even clear the https://bugzilla.mozilla.org/show_bug.cgi?id=1124303#c25 issue. I am enabling one aws windows slave (0001) back into try for a job to try and gain some insight as to get more production results. In the mean time, I'll also be adding logic similar to this: https://bugzilla.mozilla.org/show_bug.cgi?id=1149703#c1

Jordan Lund (:jlund)

Assignee

Comment 35

•

10 years ago

for this with visibility: here is that build currently being run: http://buildbot-master75.bb.releng.use1.mozilla.com:8101/builders/WINNT%206.1%20x86-64%20try%20build/builds/7 I've gone ahead and disabled 0001 again in case it burns this job and continues to burn all through the night. Will check back in the morning

Jordan Lund (:jlund)

Assignee

Comment 36

•

10 years ago

(In reply to Jordan Lund (:jlund) from comment #35) > for this with visibility: here is that build currently being run: > http://buildbot-master75.bb.releng.use1.mozilla.com:8101/builders/WINNT%206. > 1%20x86-64%20try%20build/builds/7 this last build failed in a similar manner. It is something network related. We tried bumping the instance type and some attrs (advanced networking, HVM) but had no such luck. Q appears to have had a breakthrough with b-2008-ec2-0004: 20:20:14 <Q> jlund: I got 0004 up to 60mbs 20:20:24 <Q> can you enable and test it ? 20:23:11 <Q> https://relops.pastebin.mozilla.org/8828257 20:37:32 <jonasfj> Q, I'm just curious but what was wrong? 20:47:15 <Q> Network Task Offload 20:47:24 <jlund> Q easily the best thing I've heard all day. I'll be at a computer in 10min 20:53:09 <Q> jonasfj: After registry fix. So the keys are using the PV driver for the card and turn off Network Task Offload in the registry no enhanced network settings on the instance or ami This machine is enabled and connected to bm75. It should pick up a job soon. note: Friday is a STAT holiday for much of Canada and I will be unavailable until Monday.

Chris AtLee [:catlee]

Comment 37

•

10 years ago

Looks like it failed during upload again: https://treeherder.mozilla.org/logviewer.html#?job_id=6246234&repo=try Is it possible that these settings got reset after a reboot?

Jordan Lund (:jlund)

Assignee

Comment 38

•

10 years ago

over irc Q confirmed that the reason is still unknown and he is debugging. He would like this machine to stay in on try till the end of the day to collect metrics on network performance. He may need my help in comparing jobs once he has a better idea of what is going on.

Comment 39

•

10 years ago

Continuing to tweak 004 I have changed the ebs settings and re-enabled in slavealloc. After this I am going to try and remove some QOS Multi-tenancy varibles(fluctuations in network download speed) by doing a short run as a dedicated instance.

Comment 40

•

10 years ago

I need some clarification on taking too long. When does the clock start ticking on the taskcluster timeout ? Is it at the start of the build or at some other point? Looking at a job that started at 21:18:50 I see a taskcluster mention at 21:18:53. Then again at 22:42:54 ( and the first put statement at 22:43:02) there is only a 22 minute gap until the 409 error at 23:04:00. I assume we get the 409 error because the s3 bucket for the task has been reclaimed.

Flags: needinfo?(jlund)

Flags: needinfo?(catlee)

Comment 41

•

10 years ago

http://buildbot-master75.bb.releng.use1.mozilla.com:8101/builders/b2g_try_win32_gecko%20build/builds/11/steps/run_script/logs/stdio

Chris AtLee [:catlee]

Comment 42

•

10 years ago

For this particular problem, the clock starts ticking when we initially create the task in taskcluster. Look for the line that says "Making a PUT request to https://queue.taskcluster.net/v1/task/XXXXXXX" The response further down (in the call to /task/XXXXX/claim/0) indicates how long we have to finish uploading all the artifacts, or to reclaim the task (which we're not doing ATM). At 22:43:06 PT, we got a claim on the task until: "takenUntil": "2015-04-09T06:03:07.555Z", so that looks like a 20min claim. After that we're uploading all the artifacts in serial. Right at the end you can see we start trying to upload log_warning.log, but we're past the deadline and so the task has timed out. Contrast that with this log: http://ftp.mozilla.org/pub/mozilla.org/b2g/try-builds/catlee@mozilla.com-19b778c1db25/try-win32_gecko/try-win32_gecko-bm87-try1-build1607.txt.gz We create the task at 16:50:46, and are done uploading by 16:58:05. This is from a machine in scl3.

Flags: needinfo?(catlee)

Comment 43

•

10 years ago

I can now confirm after some routing changes I cna get speeds up to 109 mb/s to usw2 using s3 browser. Speed tests also show the same speed. However, the upload through the python process still seems slower than it should be. My current late night questions: 1) Is it possible that there is something inherent to the windows python distro causing a problem here? 2) Do we take advantage of multi threaded uploads to S3 for artifacts ? If not is it possible? 3) Who is our python on windows expert in releng or dev ? 4) Who owns the vpc routing configs for the releng aws setup ?

Comment 44

•

10 years ago

Further notes: I do see better and more consistent performance in usw1 vs usw2 using s3 browser in single thread mode over the span of 20 or so tests using buckets in both locations.

Chris AtLee [:catlee]

Comment 45

•

10 years ago

Our routing configs are here: https://github.com/mozilla/build-cloud-tools/blob/master/configs/routingtables.yml and Amazon's published set of IPs is here: https://ip-ranges.amazonaws.com/ip-ranges.json Which IPs were missing when you were doing your tests?

Comment 46

•

10 years ago

I redact my earlier theories about the tools stack after a dive into the client.py. I did a bunch of transfers via s3 browser last night from both the stock image and our windows node and found the usw2 is consistently slower than all other regions tested and in some cases slower than 2mb/s but not consistently that low. I have found reported issues with random slow upload speed to s3 buckets in usw2 from other users but no identified fixes. I am spinning up a node in usw2 to test with without cross regions and I have opened support case 1376320311 with Amazon. In addition I will pull my notes for the missing VPC routes for the use and usw ips I found and post them here.

Flags: needinfo?(jlund)

Comment 47

•

10 years ago

Still no answer from amazon they have by SLA until Monday to respond. The Node in usw2 is performing well and builds are completing without hitting the timeout. I will keep an eye out for the rest of the weekend and if things look good I will start the instance type build benchmarking in usw2 and get refocused on getting puppett configured instances into AWS.

Comment 48

•

10 years ago

Trouble shooting the below response form amazon: Further to my last response I have now configured an EC2 Windows 2012 Server in US-East-1 and an S3 bucket in US-West-2. Strangely enough I did not see any performance issues when copying over a 100MB file using s3 browser. I then went back to the instance id that you supplied and I noticed that you have quite a few rules in terms of routing. I then had a look at the various S3 endpoints to try to determine what underlying IP addresses may be getting used: http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region When I pinged s3-us-west-2.amazonaws.com I received an endpoint IP address of 10.12.13.162 When I pinged s3-us-west-1.amazonaws.com I received an endpoint IP address of 54.231.232.192 Looking at your routing rules for your instance it seems that traffic going to 10.12.13.162 would be directed via your VPN gateway, whereas traffic to 54.231.232.192 would be directed to an Internet Gateway. I am beginning to think that this may be where the problem originates. To determine if this theory holds any weight, it is now vitally important that I get the output of tcptraceroute to both destinations or at the very least the output of the tracert command to both destinations, as this will confirm whether or not a different network path is being used. I would also suggest modifying the routing table to route traffic to S3 endpoints in US-West-2 via an Internet Gateway to see if this improves performance, as this will further help us determine the root cause. I look forward to seeing your responses and working with you further to find a permanent solution to this issue. Best regards, Karl G. Amazon Web Services We value your feedback. Please rate my response using the link below. ===================================================

Jordan Lund (:jlund)

Assignee

Comment 49

•

10 years ago

Attached patch 150416_windows_aws-add_10_builders-bbot-cfgs.patch — Details — Splinter Review

update here: Q has been able to reliably run jobs with 004 on usw2 for some time. He has created 10 more hosts and they are now in slavealloc. Let's tell the masters about them. r+ from Q over irc on default: https://hg.mozilla.org/build/buildbot-configs/rev/844d30771cef this will go live with tomorrows reconfig

Attachment #8593757 - Flags: review+

Attachment #8593757 - Flags: checked-in+

Jordan Lund (:jlund)

Assignee

Comment 50

•

10 years ago

> on default: https://hg.mozilla.org/build/buildbot-configs/rev/844d30771cef > > this will go live with tomorrows reconfig this is now live in production: http://hg.mozilla.org/build/buildbot-configs/rev/6a07c3d3b7eb

Jordan Lund (:jlund)

Assignee

Comment 51

•

10 years ago

update: we have 5 machines (20-24) in usw2 that are running more reliably. Q is continuing to work with AWS to improve network throughput in both regions fyi: I will be on PTO until May 13th. Please contact coop for any releng requests prior to that date

Updated

•

10 years ago

Blocks: 1159384

Updated

•

10 years ago

Depends on: 1165314

Comment 52

•

10 years ago

New image created in AWS ec2-b-win64-2015-05-21-gpo (ami-38849f50). This ami has the latest tweaks captured from GPO and the new network settings for S3 compatibility. It is currently running in try on machine b-2008-ec2-0002.

Phil Ringnalda (:philor)

Comment 53

•

10 years ago

The last successful build for b-2008-ec2-0001 was on March 26th, disabled in slavealloc.

Carsten Book [:Tomcat]

Comment 54

•

10 years ago

(In reply to Phil Ringnalda (:philor) from comment #53) > The last successful build for b-2008-ec2-0001 was on March 26th, disabled in > slavealloc. seems the disabling didn't work somehow, disabled now again

Comment 55

•

10 years ago

Ran out out of disk space after I unlocked it from my master. Rebuilding with a bigger drive.

Chris Cooper [:coop] (he/him)

Comment 56

•

9 years ago

We've switched the nomenclature here, correct? I'm just verifying for bug 1162730. If we follow the same pattern we had for linux, we'll have: build * a handful of long-running b-2008-ec2 instances (for releases, etc.) * the rest/bulk will be b-2008-spot try * all y-2008-spot

Kim Moir [:kmoir] ET

Reporter

Comment 57

•

9 years ago

Think this bug can be closed

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.