setup windows build master for testing building on AWS

RESOLVED FIXED

Status

task
RESOLVED FIXED
4 years ago
a year ago

People

(Reporter: kmoir, Assigned: jlund)

Tracking

Details

Attachments

(1 attachment)

Reporter

Description

4 years ago
Master is up here
http://dev-master1.srv.releng.scl3.mozilla.com:8401/ in /builds/buildbot/qfortier
Reporter

Comment 1

4 years ago
I need the names of the AWS and scl3 slaves to connect to it so Q and markco to can test the build process
Assignee: nobody → kmoir
Flags: needinfo?(q)
Flags: needinfo?(mcornmesser)
Reporter

Comment 2

4 years ago
Also the notes on how I setup the master are here
You can use them to setup the test master in the similar way when you get to that stage https://releng.etherpad.mozilla.org/windowsmasters
Sorry about the lag on updating. We are still working on builder images that will function in the cloud. Once we have something I will update this bug.
Flags: needinfo?(mcornmesser)

Comment 4

4 years ago
So my aws GPO clones machines should be: 
b-2008-ec2-0001
b-2008-ec2-0002
b-2008-ec2-0003
b-2008-ec2-0004
b-2008-ec2-0005
Flags: needinfo?(q)
Reporter

Comment 5

4 years ago
Are those machines up yet?  I didn't see them as available in the aws console.  

As an aside, I moved your master to dev-master2.bb.releng.use1.mozilla.com because dustin is bringing up a new server for us, the old one is being deprecated.

Comment 6

4 years ago
They have been rebuilt 0001 is definitely ready to go I think we should start with that machine and go from there

Updated

4 years ago
OS: Mac OS X → Windows Server 2008
Reporter

Comment 7

4 years ago
Okay,  I can see b-2008-ec2-0001 in the aws console.  However, it's in a different security group than all of the other machines (for instance, other loaners are in the security group tests or build) .  What is it's ip address? I can't see what it has been assigned.  I need to be able to connect via ssh and setup the buildbot.tac to talk to the master.

Comment 8

4 years ago
After some discussion We aren't going to change the build sec group to incorporate vnc, rdp etc until after the testing. A new very broad security group titled "build-win-tes"t has been created and all the slaves have been added. These machines should be able to go into try:

b-2008-ec2-0001 10.134.54.71
b-2008-ec2-0002 10.134.54.68
b-2008-ec2-0003 10.134.54.70
b-2008-ec2-0004 10.134.54.69
b-2008-ec2-0005 10.134.54.71
myself and Q iterated on this yesterday and today.

some status:

After some trials and tribulations, we were able to ssh into these machines, ensure buildbot was running and trying to connect to a master, and then point it to q's staging master.

After triggering some builds on it, our current issue is:

full job:
http://dev-master2.bb.releng.use1.mozilla.com:8401/builders/WINNT%205.2%20mozilla-central%20build/builds/0

failed step log snippet:
'bash' '-c' 'wget -Orepository_manifest.py --no-check-certificate --tries=10 --waitretry=3 http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/utils/repository_manifest.py'
--18:53:29--  http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/utils/repository_manifest.py
           => `repository_manifest.py'
Resolving hg.mozilla.org... 63.245.215.25
Connecting to hg.mozilla.org|63.245.215.25|:80... failed: Connection timed out.

Q, if you are able to get the flows corrected, feel free to trigger another build on this slave. I've left the master setup with 0002 slave enabled.

navigate to http://dev-master2.bb.releng.use1.mozilla.com:8401/builders/WINNT%205.2%20mozilla-central%20build and click on 'Force Build' at the bottom of this page

Comment 10

4 years ago
So it looks like we have to assign a public IP for our IGW to work and get to hg.mozilla.org I relaunched the instance with a public IP and build is running! Thank you for the instructions

Comment 11

4 years ago
After several successful builds we have new ips for instances that were rebuilt to try to load them up:

b-2008-ec2-0001 10.134.54.21
b-2008-ec2-0002 10.134.54.68 (same)
b-2008-ec2-0003 10.134.54.22
b-2008-ec2-0004 10.134.54.24
b-2008-ec2-0005 10.134.54.18

Comment 12

4 years ago
correction:

b-2008-ec2-0004 10.134.54.23
(In reply to Q from comment #13)
> Rename scripts worked and all five are taking builds:
> 
> http://dev-master2.bb.releng.use1.mozilla.com:8401/builders/WINNT%205.
> 2%20mozilla-central%20build

\o/ http://dev-master2.bb.releng.use1.mozilla.com:8401/one_line_per_build

so, I think we should try out moar builders: debug, nightly, pgo, and some win64 equivalents. I can help kick off the variety. We should have all 5 slaves enabled to speed up the work. /me syncs up with Q over irc
he's on board and state is ready. triggered lots of jobs at the bottom of this page: http://dev-master2.bb.releng.use1.mozilla.com:8401/builders
Reporter

Comment 16

4 years ago
Mark contacted me via email

He would like some new ec2 build instances up to test configuring via puppet. 

I looked at the console, the existing builders such as b-2008-ec2-0001 are r3.xlarge.

Jordan, when you set them up it were they just an image, I assume no configs in cloud-tools?  

I can set them up it needed
Flags: needinfo?(jlund)

Comment 17

4 years ago
I will need to roll-out the images for testing. They should go in as b-2008-ec2-0100 - 104
Flags: needinfo?(jlund)
Reporter

Comment 18

4 years ago
Okay Q, I'll wait until you roll out the images before I help setup the puppet test env.  Mark, does that sound good to you?
Flags: needinfo?(mcornmesser)
Sounds good. 


(In reply to Kim Moir [:kmoir] from comment #18)
> Okay Q, I'll wait until you roll out the images before I help setup the
> puppet test env.  Mark, does that sound good to you?
Flags: needinfo?(mcornmesser)
(In reply to Jordan Lund (:jlund) from comment #15)
> he's on board and state is ready. triggered lots of jobs at the bottom of
> this page: http://dev-master2.bb.releng.use1.mozilla.com:8401/builders

quite a number of these failed.

- the win64 ones failed to sendchange because the port defaulted in staging is to point to prod test master. I updated buildbot-configs which included the change to switch win64 builds to mh. and mh defines the port so that should fix that issue

- the win32 nightlies failed due to a credential problem: "auth = (options.username, credentials['balrog_credentials'][options.username])" so I fixed up our creds on the master end.

going re-trigger both of these now
(In reply to Mark Cornmesser [:markco] from comment #19)
> Sounds good. 
> 
> 
> (In reply to Kim Moir [:kmoir] from comment #18)
> > Okay Q, I'll wait until you roll out the images before I help setup the
> > puppet test env.  Mark, does that sound good to you?

I can help with this when we are ready. feel free to ping me markco and we can put AWS and puppet windows efforts together :D
Assignee: kmoir → jlund
> quite a number of these failed.
> 
> - the win64 ones failed to sendchange because the port defaulted in staging
> is to point to prod test master. I updated buildbot-configs which included
> the change to switch win64 builds to mh. and mh defines the port so that
> should fix that issue
> 
> - the win32 nightlies failed due to a credential problem: "auth =
> (options.username, credentials['balrog_credentials'][options.username])" so
> I fixed up our creds on the master end.
> 
> going re-trigger both of these now

fwiw - these are now running green (trigger l10n step fails for nightlies but that's expected as I disabled them in staging)
so we decided over irc that we are ready to take these out of staging and try them in production.

So first up, we will try them in Try.

as a prerequisite, we need to:

1) patch buildbot-configs to tell the masters about them
2) patch slave health to let it know about them
3) reconfig

I'll file those bugs now and action them tomorrow
Depends on: 1145510
Depends on: 1145511
been delayed here with other tasks. I updated bug 1145510 and can reconfig once that's reviewed. I think that's the one we care about here. bug 1145511 is not a hard blocker at least not immediately
these have been on try (discussed and enabled over irc) for over a day now.

the bad news is they are not looking too good. I added them to slave health for ease of view:

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0001
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0002
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0003
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0004
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ec2-0005

poking a couple, they seem to compile for the most part fine but stumble on what looks like the taskcluster submission. I suspect it is just a network connection issue or else tc and mh are not configured to handle windows ec2

I can have another look tomorrow at some point

log:
http://buildbot-master75.bb.releng.use1.mozilla.com:8101/builders/WINNT%205.2%20try%20leak%20test%20build/builds/18/steps/run_script/logs/stdio
here is a snippet of what I'm seeing:
22:26:51     INFO - Starting new HTTPS connection (1): taskcluster-public-artifacts.s3-us-west-2.amazonaws.com
22:29:50    DEBUG - Received HTTP Status:  200
22:29:50    DEBUG - Received HTTP Headers: {'content-length': '0', 'x-amz-id-2': 'b12qxFqIH790xPQt77DuFZt2FvM9s5LdYfjv1ABImBB7GVmRrgsX2v53Gy2HGm3R8iq5Ztg2xHM=', 'server': 'AmazonS3', 'x-amz-request-id': 'A33F83110072E572', 'etag': '"961ac2fb46c025cf654bcff7cf97f717"', 'date': 'Fri, 27 Mar 2015 05:26:53 GMT'}
22:29:50    DEBUG - Received HTTP Payload:  (limit 1024 char)
22:29:50     INFO - Setting buildbot property symbolsUrl to https://queue.taskcluster.net/v1/task/kL6QyuXZQBCHi52n0VutZw/artifacts/public/build/firefox-39.0a1.en-US.win32.crashreporter-symbols.zip
22:29:50     INFO - Uploading to S3: filename=c:/builds/moz2_slave/try-w32-d-00000000000000000000/build/src/obj-firefox/dist/firefox-39.0a1.en-US.win32.txt mimetype=text/plain length=59
22:29:50    DEBUG - Found a positional argument: kL6QyuXZQBCHi52n0VutZw
22:29:50    DEBUG - Found a positional argument: 0
22:29:50    DEBUG - Found a positional argument: public/build/firefox-39.0a1.en-US.win32.txt
22:29:50    DEBUG - After processing positional arguments, we have: {u'runId': 0, u'taskId': u'kL6QyuXZQBCHi52n0VutZw', u'name': u'public/build/firefox-39.0a1.en-US.win32.txt'}
22:29:50    DEBUG - After keyword arguments, we have: {u'runId': 0, u'taskId': u'kL6QyuXZQBCHi52n0VutZw', u'name': u'public/build/firefox-39.0a1.en-US.win32.txt'}
22:29:50    DEBUG - Route is: task/kL6QyuXZQBCHi52n0VutZw/runs/0/artifacts/public/build/firefox-39.0a1.en-US.win32.txt
22:29:50    DEBUG - Full URL used is: https://queue.taskcluster.net/v1/task/kL6QyuXZQBCHi52n0VutZw/runs/0/artifacts/public/build/firefox-39.0a1.en-US.win32.txt
22:29:51    DEBUG - Making attempt 0
22:29:51    DEBUG - Making a POST request to https://queue.taskcluster.net/v1/task/kL6QyuXZQBCHi52n0VutZw/runs/0/artifacts/public/build/firefox-39.0a1.en-US.win32.txt
22:29:51    DEBUG - HTTP Headers: {'Content-Type': 'application/json', 'Authorization': 'Hawk id="KHa1Y5wARRGL8R6GAsgW3w", ts="1427434191", nonce="MwOquF", ext="e30=", mac="soQYW+HhKsP7U6Tdqvw3LiFjB6nbnBDxPv3/RcKymE4="'}
22:29:51    DEBUG - HTTP Payload: {"storageType":"s3","expires":"2016-03-25T05:29:50.999000Z","contentType":"text/plain"} (limit 100 char)
22:29:51     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
22:29:52    DEBUG - Received HTTP Status:  409
22:29:52    DEBUG - Received HTTP Headers: {'content-length': '1056', 'via': '1.1 vegur', 'x-powered-by': 'Express', 'server': 'Cowboy', 'access-control-request-method': '*', 'connection': 'keep-alive', 'date': 'Fri, 27 Mar 2015 05:29:52 GMT', 'access-control-allow-origin': '*', 'access-control-allow-methods': 'OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT', 'content-type': 'application/json; charset=utf-8', 'access-control-allow-headers': 'X-Requested-With,Content-Type,Authorization,Accept,Origin'}
22:29:52    DEBUG - Received HTTP Payload: {
  "message": "The given is not running",
  "error": {
    "status": {
      "taskId": "kL6QyuXZQBCHi52n0VutZw",
      "provisionerId": "null-provisioner",
      "workerType": "buildbot-try",
      "schedulerId": "-",
      "taskGroupId": "kL6QyuXZQBCHi52n0VutZw",
      "deadline": "2015-03-27T06:08:18.484Z",
      "expires": "2016-03-27T06:08:18.484Z",
      "retriesLeft": 4,
      "state": "pending",
      "runs": [
        {
          "runId": 0,
          "state": "exception",
          "reasonCreated": "scheduled",
          "scheduled": "2015-03-27T05:08:20.351Z",
          "workerGroup": "buildbot-try",
          "workerId": "buildbot-try",
          "takenUntil": "2015-03-27T05:28:21.965Z",
          "started": "2015-03-27T05:08:22.301Z",
          "reasonResolved": "claim-expired",
          "resolved": "2015-03-27T05:28:27.250Z"
        },
        {
          "runId": 1,
          "state": "pending",
          "reasonCreated": "retry",
          "scheduled": "2015-03-27T05:28:27.250Z"
        }
    (limit 1024 char)
22:29:52    FATAL - Uncaught exception: Traceback (most recent call last):
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\base\script.py", line 1288, in run
22:29:52    FATAL -     self.run_action(action)
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\base\script.py", line 1231, in run_action
22:29:52    FATAL -     self._possibly_run_method("postflight_%s" % method_name)
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\base\script.py", line 1171, in _possibly_run_method
22:29:52    FATAL -     return getattr(self, method_name)()
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1500, in postflight_build
22:29:52    FATAL -     self.upload_files()
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\mozilla\building\buildbase.py", line 1346, in upload_files
22:29:52    FATAL -     tc.create_artifact(task, upload_file)
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\scripts\mozharness\mozilla\taskcluster_helper.py", line 97, in create_artifact
22:29:52    FATAL -     "contentType": mime_type,
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
22:29:52    FATAL -     return self._makeApiCall(e, *args, **kwargs)
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
22:29:52    FATAL -     return self._makeHttpRequest(entry['method'], route, payload)
22:29:52    FATAL -   File "c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
22:29:52    FATAL -     superExc=rerr
22:29:52    FATAL - TaskclusterRestFailure: The given is not running
22:29:52    FATAL - Running post_fatal callback...
22:29:52    ERROR - setting return code to 2 because fatal was called
22:29:52    FATAL - Exiting -1
22:29:52     INFO - Running post-run listener: _summarize
22:29:52    ERROR - # TBPL FAILURE #
22:29:52     INFO - #####
note, I've disabled the 0001-0005 for now
Jordan, have you been looking into root cause here? We want to make sure either you or Q (or someone) are investigating this since it's blocking rolling out windows on AWS.
Flags: needinfo?(q)
Flags: needinfo?(jlund)

Comment 28

4 years ago
Unfortunately, as far as  I know this is beyond my current scope I have been focusing on getting the puppettized version of a builder into AWS. If I need to switch focus I can do so as I am finalizing the post mdt captured puppett builder now.
Flags: needinfo?(q)
I sent pmoore an email asking him to join the Windows meeting tomorrow morning because of the mention of Taskcluster in the trace back.

Jlund would you be down to join us as well? It is at 8:30 am West coast?
So the failure looks like it's because the upload has taken too long. The mozharness script creates a placeholder taskcluster task to contain the uploaded artifacts. Tasks must be periodically claimed to let the taskcluster infra know that the worker hasn't died. If a worker fails to claim a task within the specified time range, the task is marked as failed.

In this case, the task is expiring because we're taking too long to upload. There are a few problems here:
- upload speed seems way too slow. we should double check our AWS routing to make sure we're going from these AWS nodes to S3 properly
- the mozharness taskcluster plugin should periodically reclaim the task
(In reply to Mark Cornmesser [:markco] from comment #29)
> I sent pmoore an email asking him to join the Windows meeting tomorrow
> morning because of the mention of Taskcluster in the trace back.
> 
> Jlund would you be down to join us as well? It is at 8:30 am West coast?

I'll be there.
> - upload speed seems way too slow. we should double check our AWS routing to
> make sure we're going from these AWS nodes to S3 properly

Q, can you double check this?

> - the mozharness taskcluster plugin should periodically reclaim the task

I can look into not tripping over when the task takes too long
Flags: needinfo?(jlund)
Duplicate of this bug: 1121513
there were a few taskcluster client fixes recently that might provide us with better traceback or maybe even clear the https://bugzilla.mozilla.org/show_bug.cgi?id=1124303#c25 issue. I am enabling one aws windows slave (0001) back into try for a job to try and gain some insight as to get more production results.

In the mean time, I'll also be adding logic similar to this: https://bugzilla.mozilla.org/show_bug.cgi?id=1149703#c1
for this with visibility: here is that build  currently being run: http://buildbot-master75.bb.releng.use1.mozilla.com:8101/builders/WINNT%206.1%20x86-64%20try%20build/builds/7

I've gone ahead and disabled 0001 again in case it burns this job and continues to burn all through the night.

Will check back in the morning
(In reply to Jordan Lund (:jlund) from comment #35)
> for this with visibility: here is that build  currently being run:
> http://buildbot-master75.bb.releng.use1.mozilla.com:8101/builders/WINNT%206.
> 1%20x86-64%20try%20build/builds/7

this last build failed in a similar manner. It is something network related. We tried bumping the instance type and some attrs (advanced networking, HVM) but had no such luck.

Q appears to have had a breakthrough with b-2008-ec2-0004:
20:20:14 <Q> jlund: I got 0004 up to 60mbs
20:20:24 <Q> can you enable and test it ?
20:23:11 <Q> https://relops.pastebin.mozilla.org/8828257
20:37:32 <jonasfj> Q, I'm just curious but what was wrong?
20:47:15 <Q> Network Task Offload
20:47:24 <jlund> Q easily the best thing I've heard all day. I'll be at a computer in 10min
20:53:09 <Q> jonasfj: After registry fix. So the keys are using the PV driver for the card and turn off Network Task Offload in the registry no enhanced network settings on the instance or ami

This machine is enabled and connected to bm75. It should pick up a job soon.

note: Friday is a STAT holiday for much of Canada and I will be unavailable until Monday.
Looks like it failed during upload again:
https://treeherder.mozilla.org/logviewer.html#?job_id=6246234&repo=try

Is it possible that these settings got reset after a reboot?
over irc Q confirmed that the reason is still unknown and he is debugging. He would like this machine to stay in on try till the end of the day to collect metrics on network performance. He may need my help in comparing jobs once he has a better idea of what is going on.

Comment 39

4 years ago
Continuing to tweak 004 I have changed the ebs settings and re-enabled in slavealloc. After this I am going to try and remove some QOS Multi-tenancy varibles(fluctuations in network download speed) by doing a short run as a dedicated instance.

Comment 40

4 years ago
I need some clarification on taking too long. When does the clock start ticking on the taskcluster timeout ? Is it at the start of the build or at some other point? Looking at a job that started at 21:18:50 I see a taskcluster mention at 21:18:53. Then again at 22:42:54 ( and the first put statement at 22:43:02) there is only a 22 minute gap until the 409 error at 23:04:00. I assume we get the 409 error because the s3 bucket for the task has been reclaimed.
Flags: needinfo?(jlund)
Flags: needinfo?(catlee)
For this particular problem, the clock starts ticking when we initially create the task in taskcluster. Look for the line that says "Making a PUT request to https://queue.taskcluster.net/v1/task/XXXXXXX"

The response further down (in the call to /task/XXXXX/claim/0) indicates how long we have to finish uploading all the artifacts, or to reclaim the task (which we're not doing ATM). At 22:43:06 PT, we got a claim on the task until:
        "takenUntil": "2015-04-09T06:03:07.555Z",

so that looks like a 20min claim.

After that we're uploading all the artifacts in serial. Right at the end you can see we start trying to upload log_warning.log, but we're past the deadline and so the task has timed out.

Contrast that with this log: http://ftp.mozilla.org/pub/mozilla.org/b2g/try-builds/catlee@mozilla.com-19b778c1db25/try-win32_gecko/try-win32_gecko-bm87-try1-build1607.txt.gz

We create the task at 16:50:46, and are done uploading by 16:58:05. This is from a machine in scl3.
Flags: needinfo?(catlee)

Comment 43

4 years ago
I can now confirm after some routing changes I cna get speeds up to 109 mb/s to usw2 using s3 browser. Speed tests also show the same speed. However, the upload through the python process still seems slower than it should be. My current late night questions: 
 1) Is it possible that there is something inherent to the windows python distro causing a problem here?
 2) Do we take advantage of multi threaded uploads to S3 for artifacts ? If not is it possible?
 3) Who is our python on windows expert in releng or dev ?
 4) Who owns the vpc routing configs for the releng aws setup ?

Comment 44

4 years ago
Further notes: I do see better and more consistent performance in usw1 vs usw2 using s3 browser in single thread mode over the span of 20 or so tests using buckets in both locations.
Our routing configs are here:

https://github.com/mozilla/build-cloud-tools/blob/master/configs/routingtables.yml

and Amazon's published set of IPs is here:
https://ip-ranges.amazonaws.com/ip-ranges.json

Which IPs were missing when you were doing your tests?

Comment 46

4 years ago
I redact my earlier theories about the tools stack after a dive into the client.py. I did a bunch of transfers via s3 browser last night from both the stock image and our windows node and found the usw2 is consistently slower than all other regions tested and in some cases slower than 2mb/s but not consistently that low. I have found reported issues with random slow upload speed to s3 buckets in usw2 from other users but no identified fixes.  I am spinning up a  node in usw2 to test with without cross regions and I have opened support case 1376320311 with Amazon. 

In addition I will pull my notes for the missing VPC routes for the use and usw ips I found and post them here.
Flags: needinfo?(jlund)

Comment 47

4 years ago
Still no answer from amazon they have by SLA until Monday to respond. The Node in usw2 is performing well and builds are completing without hitting the timeout. I will keep an eye out for the rest of the weekend and if things look good I will start the instance type build benchmarking in usw2 and get refocused on getting puppett configured instances into AWS.

Comment 48

4 years ago
Trouble shooting the below response form amazon:

Further to my last response I have now configured an EC2 Windows 2012 Server in US-East-1 and an S3 bucket in US-West-2. Strangely enough I did not see any performance issues when copying over a 100MB file using s3 browser. I then went back to the instance id that you supplied and I noticed that you have quite a few rules in terms of routing. I then had a look at the various S3 endpoints to try to determine what underlying IP addresses may be getting used: http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region When I pinged s3-us-west-2.amazonaws.com I received an endpoint IP address of 10.12.13.162 When I pinged s3-us-west-1.amazonaws.com I received an endpoint IP address of 54.231.232.192 Looking at your routing rules for your instance it seems that traffic going to 10.12.13.162 would be directed via your VPN gateway, whereas traffic to 54.231.232.192 would be directed to an Internet Gateway. I am beginning to think that this may be where the problem originates. To determine if this theory holds any weight, it is now vitally important that I get the output of tcptraceroute to both destinations or at the very least the output of the tracert command to both destinations, as this will confirm whether or not a different network path is being used. I would also suggest modifying the routing table to route traffic to S3 endpoints in US-West-2 via an Internet Gateway to see if this improves performance, as this will further help us determine the root cause. I look forward to seeing your responses and working with you further to find a permanent solution to this issue. Best regards, Karl G. Amazon Web Services We value your feedback. Please rate my response using the link below. ===================================================
update here: Q has been able to reliably run jobs with 004 on usw2 for some time. He has created 10 more hosts and they are now in slavealloc. Let's tell the masters about them. r+ from Q over irc

on default: https://hg.mozilla.org/build/buildbot-configs/rev/844d30771cef

this will go live with tomorrows reconfig
Attachment #8593757 - Flags: review+
Attachment #8593757 - Flags: checked-in+
> on default: https://hg.mozilla.org/build/buildbot-configs/rev/844d30771cef
> 
> this will go live with tomorrows reconfig

this is now live in production: http://hg.mozilla.org/build/buildbot-configs/rev/6a07c3d3b7eb
update: we have 5 machines (20-24) in usw2 that are running more reliably. Q is continuing to work with AWS to improve network throughput in both regions

fyi: I will be on PTO until May 13th. Please contact coop for any releng requests prior to that date

Updated

4 years ago
Blocks: 1159384

Updated

4 years ago
Depends on: 1165314

Comment 52

4 years ago
New image created in AWS ec2-b-win64-2015-05-21-gpo (ami-38849f50). This ami has the latest tweaks captured from GPO and the new network settings for S3 compatibility. It is currently running in try on machine b-2008-ec2-0002.
The last successful build for b-2008-ec2-0001 was on March 26th, disabled in slavealloc.
(In reply to Phil Ringnalda (:philor) from comment #53)
> The last successful build for b-2008-ec2-0001 was on March 26th, disabled in
> slavealloc.

seems the disabling didn't work somehow, disabled now again

Comment 55

4 years ago
Ran out out of disk space after  I unlocked it from my master. Rebuilding with a bigger drive.
We've switched the nomenclature here, correct? I'm just verifying for bug 1162730. 

If we follow the same pattern we had for linux, we'll have:

build
* a handful of long-running b-2008-ec2 instances (for releases, etc.)
* the rest/bulk will be b-2008-spot

try
* all y-2008-spot
Reporter

Comment 57

3 years ago
Think this bug can be closed
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.