Closed Bug 1152506 Opened 10 years ago Closed 10 years ago

[provisioner] switch all production jobs to aws-provisioner-v1

Categories

(Taskcluster :: Services, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: jhford)

Details

Attachments

(1 file)

We should move to the new provisioner. This is a bug to keep track of all the switches.
This is a nit. But if it's not too late could we call the new provisioner for: "aws-provisioner-v1" Then next time we do something like this it's obvious that it should be aws-provisioner-v2. I like that there is some sanity in the names we use :) Note: identifiers like this are bound to 22 characters [a-zA-Z_-]
I'm finding it nearly impossible to get emulator-ics instances to boot as the spot bid would be >$1.8, and the old max is $0.4... The dolphin, flame-kk and mulet-linux builds appear to be very solid in my testing. For management, there's a basic provisioner ui which lists the configured worker types as well as basic stats on the number of 'capacity units' which we have in each state and the number of pending jobs. For now, capacity units can be assumed to be equivalent to instances, but that won't be forever. Right now, to edit workertypes, you'll need to use the command line client. git clone https://github.com/taskcluster/aws-provisioner cd aws-provisioner and then you can use Manage.js to manage the files. the fetch command will grab a file down as a json file and the create command with the filename will create or update the workerType on the server This is currently being tested in https://treeherder.allizom.org/#/jobs?repo=try&revision=65207e92ddc3
http://docs.taskcluster.net/tools/task-inspector/#YOJXrHi6RpO-vSdOyC1Oog/2 Did this fail? It looks like it is passing from the UI, but the logs suggest that it didn't pass... How should I interpret this?
Flags: needinfo?(jopsen)
Flags: needinfo?(garndt)
Mulet and flame-kk looks good, not sure about the dolphin one.
Looks like that task really failed but the script it was running didn't return the right exit code to signal for reportFailed to be called.
Flags: needinfo?(garndt)
(In reply to Greg Arndt [:garndt] from comment #5) > Looks like that task really failed but the script it was running didn't > return the right exit code to signal for reportFailed to be called. hmm, so that definately feels like a valid bug... I wonder who owns this script.
I believe wander worked on the Dolphin bits.
Hi Wander, I was wondering if you have any thoughts regarding the return value passed in this script? It seems to be returning a 0 exit code for a failed build, which means that the job is showing as passed instead of read. I'm not sure if this is provisioner related, but regardless it means that failing jobs in production with the other provisioner might also be incorrectly reporting as green.
Flags: needinfo?(wcosta)
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #8) > Hi Wander, I was wondering if you have any thoughts regarding the return > value passed in this script? It seems to be returning a 0 exit code for a > failed build, which means that the job is showing as passed instead of read. > I'm not sure if this is provisioner related, but regardless it means that > failing jobs in production with the other provisioner might also be > incorrectly reporting as green. Hey, this script is owned by :kli (Bug 1144463), but it is based on build-phone.sh, that I originally wrote. Anyway, it feels like the issue is that build-phone.sh doesn't have execution permission (at least in my local tree, it doesn't). Moreover, the reviewed patch [1] does not seem to be the patch applied [2]. [1] https://bugzilla.mozilla.org/attachment.cgi?id=8587497 [2] https://hg.mozilla.org/integration/b2g-inbound/rev/2727e5aa21cd
Flags: needinfo?(wcosta) → needinfo?(kli)
(In reply to Wander Lairson Costa [:wcosta] from comment #9) > Hey, this script is owned by :kli (Bug 1144463), but it is based on > build-phone.sh, that I originally wrote. Anyway, it feels like the issue is > that build-phone.sh doesn't have execution permission (at least in my local > tree, it doesn't). Moreover, the reviewed patch [1] does not seem to be the > patch applied [2]. Well, there's three issues here: 1. build-phone.sh does not have exec permissions 2. failing to run build-phone.sh does not exit != 0 3. patches aren't being landed as reviewed Where should I file bugs for the first and second issue? For the third issue, i wonder if there was any reason for the changes without a quick review...
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #10) > (In reply to Wander Lairson Costa [:wcosta] from comment #9) > > > Hey, this script is owned by :kli (Bug 1144463), but it is based on > > build-phone.sh, that I originally wrote. Anyway, it feels like the issue is > > that build-phone.sh doesn't have execution permission (at least in my local > > tree, it doesn't). Moreover, the reviewed patch [1] does not seem to be the > > patch applied [2]. > > Well, there's three issues here: > > 1. build-phone.sh does not have exec permissions > 2. failing to run build-phone.sh does not exit != 0 s/build-phone/build-dolphin/ > 3. patches aren't being landed as reviewed > > Where should I file bugs for the first and second issue? Good question, first bug seems very straightforward. For bug 2, my guess it is a docker-worker issue, :garndt is a better person to ask for, however. > For the third issue, i wonder if there was any reason for the changes without a quick > review... Maybe this was just a mistake, a matter of backing out the landed patch and push the correct one. I think this can be done together with bug 1.
Yeah, it looks like a task bug... Ie. not provisioner side. Perhaps docker-worker should also complain if folders it's told to upload doesn't exist.
Flags: needinfo?(jopsen)
Summary: [provisioner] switch all production jobs to aws-provisioner2 → [provisioner] switch all production jobs to aws-provisioner-v1
I've been thinking for awhile now that we should be failing a task if artifacts are not there since these are assumed to be results of a successful task completion. Also, downstream tasks might be relying on those tasks being there. That perhaps should be debated under a new bug.
(In reply to Wander Lairson Costa [:wcosta] from comment #9) > > Hey, this script is owned by :kli (Bug 1144463), but it is based on > build-phone.sh, that I originally wrote. Anyway, it feels like the issue is > that build-phone.sh doesn't have execution permission (at least in my local > tree, it doesn't). Moreover, the reviewed patch [1] does not seem to be the > patch applied [2]. > > [1] https://bugzilla.mozilla.org/attachment.cgi?id=8587497 > [2] https://hg.mozilla.org/integration/b2g-inbound/rev/2727e5aa21cd Yes, this is a mistake. The landed patch is the one I pushed to try [1]. I landed it after try is green. It's really sorry to make this trouble. Should I backed out it and land the reviewed one or submit a follow up fix to update the different parts? Thanks! [1] https://hg.mozilla.org/try/raw-rev/7fb1450e5395
Flags: needinfo?(kli) → needinfo?(wcosta)
(In reply to Kai-Zhen Li [:kli][:seinlin] from comment #14) > (In reply to Wander Lairson Costa [:wcosta] from comment #9) > > > > Hey, this script is owned by :kli (Bug 1144463), but it is based on > > build-phone.sh, that I originally wrote. Anyway, it feels like the issue is > > that build-phone.sh doesn't have execution permission (at least in my local > > tree, it doesn't). Moreover, the reviewed patch [1] does not seem to be the > > patch applied [2]. > > > > [1] https://bugzilla.mozilla.org/attachment.cgi?id=8587497 > > [2] https://hg.mozilla.org/integration/b2g-inbound/rev/2727e5aa21cd > > Yes, this is a mistake. The landed patch is the one I pushed to try [1]. I > landed it after try is green. It's really sorry to make this trouble. Should > I backed out it and land the reviewed one or submit a follow up fix to > update the different parts? Thanks! > > [1] https://hg.mozilla.org/try/raw-rev/7fb1450e5395 I don't have an answer here, so whatever approach you take is fine for me. When you fix this, would mind fixing the execute permission issue in build-dolphin?
Flags: needinfo?(wcosta)
Today I accidentally went to the old provisioner web interface http://aws-provisioner.taskcluster.net/worker-type instead of the new one https://tools.taskcluster.net/aws-provisioner/ This caused some confusion. Should we either bring down the old web interface, or place a big warning banner on it to say it is the old provisioner, and that there is a new one (for people that don't know there is a new one)? There were only 3 instances running - 2 x emualtor-x86-kk and 1 x gaia. Maybe this is not worth doing if we are migrating all jobs over anyway, since then presumably we'll bring the old one down.
Hopefully the old one won't have anything left soon. Then we can silently take it down. And jhford can get the domain for his new provisioner :) Though UI will remain on tools.tc.n
Component: TaskCluster → AWS-Provisioner
Product: Testing → Taskcluster
The old provisioner is offline, so if anything isn't using the new ID, it's silently failing and no one cares.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee: nobody → jhford
Component: AWS-Provisioner → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: