Closed Bug 1752375 Opened 2 years ago Closed 2 years ago

linux test image for gcp

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

(Whiteboard: [relops-gcp])

Attachments

(3 files, 2 obsolete files)

+++ This bug was initially created as a clone of Bug #1752371 +++

it has been a while since we created the base test image for instances in gcp

Blocks: 1752378

comment (when not linked to two issues in jira)

hmm, unassigned in jira despite assigned here. maybe something needs updated on my user to match the account

comment 1 appeared in jira 1 minute later. that's faster than I expected

and that last comment appeared within about a minute of the comment creation also :thumbsup:

test removing assignee

Assignee: dhouse → relops

and testing re-assigning to me

Assignee: relops → dhouse

great. the integration adds a comment to the jira issue for the first removal (and who the assignee becomes)

k, it shows the new assigned also. but it doesn't set the assignee on the issue

Assignee: dhouse → mcornmesser
Assignee: mcornmesser → dhouse
Assignee: dhouse → relops
Assignee: relops → dhouse
Flags: needinfo?(dhouse)
Group: mozilla-employee-confidential
Flags: needinfo?(dhouse)
Summary: new linux test image for gcp → linux test image for gcp

comment in secret

comment in public (when making public)

Group: mozilla-employee-confidential

gcp bionic base requires kdump-tools also now. I've not copied an image into a staging project yet, but the image building is working.

I have a current-state image (latest at docker-firefoxci-gcp-l1-googlecompute-2022-03-31t20-06-40z) and I'm working through testing it

kdump and v4l are the only needed changes I have distilled to

The build from monopacker had a broken video loopback, despite the work I'd done.. compiled etc, but doesn't work.

So in the last two weeks I worked on getting an ubuntu1404 install and alternatively a disk image from the aws ami:

install: My first attempts at 14 installs failed because googlebot(gce can download iso's from urls in a tsv) is blocked on the ubuntu download and archive sites. And I didn't want to deal with uploading the iso unless I had to.
Tried ubuntu16, and the video loopback install fails. So I didn't try further.

disk image: I first tried the ami import, and hit many permissions blockers (and could not decode the encoded import error messages because of obscure permissions error messages).
So then I tried a dd image copy, uploaded to s3, and then downloaded and extracted+written to a blank disk. From this, something was wrong with the partition table preventing me from having the boot disk flagged as a bootable partition. It should have been the same and set bootable, but since it is from a vm(and not fully virtualized like qemu?) there may have been some magic missing for the partition table on a virtual disk in gce.
So, I tried the ami import again. I had to re-try this many times working through adding+correcting the service user, roles, s3 bucket perms, etc. Finally I got a good ami export last week. But it failed on first import boot (the gce import adds keys etc) because it could not get the correct apt repos set to install the google cloud tools.
I brought up the ami in aws and manually added the repos and installed the google cloud tools. Then I took a snapshot of this updated ami, and re-did the import. After re-trying this a number of times(seems it always needs re-tried because of timeouts etc between the clouds), I got the export+import to work Friday/Saturday and brought up an instance but wrestling with startup issues through the last 4days.

On first start..
docker-worker scratch volume mounting: if there is no secondary disk, startup fails and panics and the instance is shut down. If there is a persistent disk attached, it is formatted and mounted. If the local ssd is attached, it is formatted and mounted. (if both, they are combined into a scratch volume)
startup-script: taskcluster does allow a metadata "userdata"/startup-script to be passed through from the worker-manager (https://github.com/taskcluster/taskcluster/blob/main/services/worker-manager/src/providers/google.js#L279), but the pool editor has formatting restrictions that prevent sending more than a single line string.
start-docker-worker: appears to correctly get the workerConfig from the metadata

I am currently sorting out what docker-worker is doing at startup and why it is shutting down the instances within 10s (there are tasks in the queue). Docker is running and un-tarring the docker-worker archives and gets aborted by the start-docker-worker script because the start-worker ends after 15s (the previous idle setting in the config ... so maybe the config is updated after startup?). It may just be something with the claimTask credentials.

docker-worker was attempting to get the aws metadata instead. so that's why the config didn't get updated:

 Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker 2022/04/26 17:10:53 Configuring with provider aws
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker 2022/04/26 17:10:53 Could not query user data: (Permanent) HTTP response code 404
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker HTTP/1.1 404 Not Found#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Content-Length: 1577#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Content-Type: text/html; charset=UTF-8#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Date: Tue, 26 Apr 2022 17:10:53 GMT#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Metadata-Flavor: Google#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Server: Metadata Server for VM#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker X-Frame-Options: SAMEORIGIN#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker X-Xss-Protection: 0#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker #015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker <!DOCTYPE html>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker <html lang=en>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <meta charset=utf-8>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <title>Error 404 (Not Found)!!1</title>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <style>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker     *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_col
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker or_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   </style>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <p><b>404.</b> <ins>That’s an error.</ins>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <p>The requested URL <code>/latest/user-data</code> was not found on this server.  <ins>That’s all we know.</ins>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Worker exited. Shutting down...

switched the on-image config to 'google' (https://github.com/taskcluster/taskcluster/blob/429371f0393080d47abb8dc865f8a3ec0357f6db/services/worker-manager/src/providers/index.js#L43)
like startup-script:
"value": "#!/usr/bin/awk BEGIN{print"a";system("sed\ -ibak\ s/aws/google/\ /etc/start-worker.yml;sleep\ 3600");print"e"}"
Then docker-worker pulled the userdata/metadata and set the workerConfig and started.
But, using worker-runner for secrets, it found none for the pool. So I had to set secrets for it to find.

Now it runs, still not claimTask'ing but doing garbage collection and other metrics and then stopping with:
"Taskcluster Credentials are expiring in 30s; stopping worker"
The time is correct on the image+instances. Maybe there is an expired secret on the disk (but it would expire on instances in aws too?).

got tasks running yesterday:
https://firefox-ci-tc.services.mozilla.com/tasks/a1uXn38nThanO5SNcVALOw/runs/0

after extending the reregistrationTimeout to not interrupt tasks with a worker stop
tasks are succeeding:
https://firefox-ci-tc.services.mozilla.com/tasks/JhHnCIROR0OU5LMGDqHdmw/runs/5/logs/public/logs/live.log

also changed the logging to collect the cloudinit log+output so that we can debug that when needed without catching(lost after the instance is deleted and at next boot) the serial console output or ssh'ing into the machine.

:masterwayz is working on new pool definitions

once we've confirmed the video+audio loopbacks and other pieces are working correctly (tests green), I'll take a new golden image not requiring the current startup-script configuration changes

Attachment #9273857 - Attachment is obsolete: true
Pushed by mgoossens@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/c614e06b226f
Test temporarily fix for GCP testing image r=releng-reviewers,hneiva

I copied some tasks from gecko-3/t-linux-xlarge and they're running on this new image (applied the startupscript changes; no need for the startup now) in a pool on gcp.

Attachment #9274159 - Attachment is obsolete: true

5 tasks succeeded that I'd cloned from gecko-3/t-linux-xlarge. So the new image looks good to start with from 1 and 3.

Flags: needinfo?(mgoossens)
Flags: needinfo?(mgoossens)

Is there anything left to do here, or can we resolve this?

Flags: needinfo?(mgoossens)
Flags: needinfo?(dhouse)

I think we are done, but let's wait for dhouse to verify

Flags: needinfo?(mgoossens)

I agree this is done. For any adjustments or fixes, we can track those separately.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(dhouse)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: