Closed Bug 1752375 Opened 2 years ago Closed 2 years ago

linux test image for gcp

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

(Whiteboard: [relops-gcp])

Attachments

(3 files, 2 obsolete files)

last_successful_u1404-ami_import.log 2 years ago :dhouse 39.53 KB, text/plain		Details
draft_testing_gecko-t-gcp_pool-config.json 2 years ago :dhouse 1.82 KB, text/plain		Details
gecko-t-linux-xlarge.json provider:fxci-test-gcp 2 years ago :dhouse 2.05 KB, text/plain		Details
Bug 1752375 - Test temporarily fix for GCP testing image r=#releng-reviewers 2 years ago Michelle Goossens [:masterwayz] 48 bytes, text/x-phabricator-request		Details \| Review
worker pool testing new d-w image from fxci-level3-gcp.json 2 years ago :dhouse 1.68 KB, text/plain		Details

:dhouse

Assignee

Description

•

2 years ago

+++ This bug was initially created as a clone of Bug #1752371 +++

it has been a while since we created the base test image for instances in gcp

Jira Integration Bot

Updated

•

2 years ago

See Also: → https://mozilla-hub.atlassian.net/browse/RELOPS-72

:dhouse

Assignee

Updated

•

2 years ago

Blocks: 1752378

:dhouse

Assignee

Comment 1

•

2 years ago

comment (when not linked to two issues in jira)

:dhouse

Assignee

Comment 2

•

2 years ago

hmm, unassigned in jira despite assigned here. maybe something needs updated on my user to match the account

:dhouse

Assignee

Comment 3

•

2 years ago

comment 1 appeared in jira 1 minute later. that's faster than I expected

:dhouse

Assignee

Comment 4

•

2 years ago

and that last comment appeared within about a minute of the comment creation also :thumbsup:

:dhouse

Assignee

Comment 5

•

2 years ago

test removing assignee

Assignee: dhouse → relops

:dhouse

Assignee

Comment 6

•

2 years ago

and testing re-assigning to me

Assignee: relops → dhouse

:dhouse

Assignee

Comment 7

•

2 years ago

great. the integration adds a comment to the jira issue for the first removal (and who the assignee becomes)

:dhouse

Assignee

Comment 8

•

2 years ago

k, it shows the new assigned also. but it doesn't set the assignee on the issue

:dhouse

Assignee

Updated

•

2 years ago

Assignee: dhouse → mcornmesser

:dhouse

Assignee

Updated

•

2 years ago

Assignee: mcornmesser → dhouse

:dhouse

Assignee

Updated

•

2 years ago

Assignee: dhouse → relops

:dhouse

Assignee

Updated

•

2 years ago

Assignee: relops → dhouse

:dhouse

Assignee

Updated

•

2 years ago

Flags: needinfo?(dhouse)

:dhouse

Assignee

Updated

•

2 years ago

Group: mozilla-employee-confidential

Flags: needinfo?(dhouse)

:dhouse

Assignee

Updated

•

2 years ago

Summary: new linux test image for gcp → linux test image for gcp

:dhouse

Assignee

Comment 9

•

2 years ago

comment in secret

:dhouse

Assignee

Comment 10

•

2 years ago

comment in public (when making public)

Group: mozilla-employee-confidential

:dhouse

Assignee

Comment 11

•

2 years ago

gcp bionic base requires kdump-tools also now. I've not copied an image into a staging project yet, but the image building is working.

:dhouse

Assignee

Comment 12

•

2 years ago

I have a current-state image (latest at docker-firefoxci-gcp-l1-googlecompute-2022-03-31t20-06-40z) and I'm working through testing it

kdump and v4l are the only needed changes I have distilled to

:dhouse

Assignee

Comment 13

•

2 years ago

The build from monopacker had a broken video loopback, despite the work I'd done.. compiled etc, but doesn't work.

So in the last two weeks I worked on getting an ubuntu1404 install and alternatively a disk image from the aws ami:

install: My first attempts at 14 installs failed because googlebot(gce can download iso's from urls in a tsv) is blocked on the ubuntu download and archive sites. And I didn't want to deal with uploading the iso unless I had to.
Tried ubuntu16, and the video loopback install fails. So I didn't try further.

disk image: I first tried the ami import, and hit many permissions blockers (and could not decode the encoded import error messages because of obscure permissions error messages).
So then I tried a dd image copy, uploaded to s3, and then downloaded and extracted+written to a blank disk. From this, something was wrong with the partition table preventing me from having the boot disk flagged as a bootable partition. It should have been the same and set bootable, but since it is from a vm(and not fully virtualized like qemu?) there may have been some magic missing for the partition table on a virtual disk in gce.
So, I tried the ami import again. I had to re-try this many times working through adding+correcting the service user, roles, s3 bucket perms, etc. Finally I got a good ami export last week. But it failed on first import boot (the gce import adds keys etc) because it could not get the correct apt repos set to install the google cloud tools.
I brought up the ami in aws and manually added the repos and installed the google cloud tools. Then I took a snapshot of this updated ami, and re-did the import. After re-trying this a number of times(seems it always needs re-tried because of timeouts etc between the clouds), I got the export+import to work Friday/Saturday and brought up an instance but wrestling with startup issues through the last 4days.

On first start..
docker-worker scratch volume mounting: if there is no secondary disk, startup fails and panics and the instance is shut down. If there is a persistent disk attached, it is formatted and mounted. If the local ssd is attached, it is formatted and mounted. (if both, they are combined into a scratch volume)
startup-script: taskcluster does allow a metadata "userdata"/startup-script to be passed through from the worker-manager (https://github.com/taskcluster/taskcluster/blob/main/services/worker-manager/src/providers/google.js#L279), but the pool editor has formatting restrictions that prevent sending more than a single line string.
start-docker-worker: appears to correctly get the workerConfig from the metadata

I am currently sorting out what docker-worker is doing at startup and why it is shutting down the instances within 10s (there are tasks in the queue). Docker is running and un-tarring the docker-worker archives and gets aborted by the start-docker-worker script because the start-worker ends after 15s (the previous idle setting in the config ... so maybe the config is updated after startup?). It may just be something with the claimTask credentials.

:dhouse

Assignee

Comment 14

•

2 years ago

Attached file last_successful_u1404-ami_import.log — Details

https://console.cloud.google.com/cloud-build/builds;region=us-central1/cd424a00-bd7b-4635-ba78-a17ea3ddbc1d?project=taskcluster-imaging

:dhouse

Assignee

Comment 15

•

2 years ago

Attached file draft_testing_gecko-t-gcp_pool-config.json (obsolete) — Details

:dhouse

Assignee

Comment 16

•

2 years ago

docker-worker was attempting to get the aws metadata instead. so that's why the config didn't get updated:

 Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker 2022/04/26 17:10:53 Configuring with provider aws
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker 2022/04/26 17:10:53 Could not query user data: (Permanent) HTTP response code 404
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker HTTP/1.1 404 Not Found#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Content-Length: 1577#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Content-Type: text/html; charset=UTF-8#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Date: Tue, 26 Apr 2022 17:10:53 GMT#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Metadata-Flavor: Google#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Server: Metadata Server for VM#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker X-Frame-Options: SAMEORIGIN#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker X-Xss-Protection: 0#015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker #015
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker <!DOCTYPE html>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker <html lang=en>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <meta charset=utf-8>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <title>Error 404 (Not Found)!!1</title>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <style>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker     *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_col
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker or_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   </style>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <p><b>404.</b> <ins>That’s an error.</ins>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker   <p>The requested URL <code>/latest/user-data</code> was not found on this server.  <ins>That’s all we know.</ins>
Apr 26 17:10:53 gecko-t-linux-xlarge-f7qqmwqdtg6rtnj1cjnmfg docker-worker Worker exited. Shutting down...

switched the on-image config to 'google' (https://github.com/taskcluster/taskcluster/blob/429371f0393080d47abb8dc865f8a3ec0357f6db/services/worker-manager/src/providers/index.js#L43)
like startup-script:
"value": "#!/usr/bin/awk BEGIN{print"a";system("sed\ -ibak\ s/aws/google/\ /etc/start-worker.yml;sleep\ 3600");print"e"}"
Then docker-worker pulled the userdata/metadata and set the workerConfig and started.
But, using worker-runner for secrets, it found none for the pool. So I had to set secrets for it to find.

Now it runs, still not claimTask'ing but doing garbage collection and other metrics and then stopping with:
"Taskcluster Credentials are expiring in 30s; stopping worker"
The time is correct on the image+instances. Maybe there is an expired secret on the disk (but it would expire on instances in aws too?).

:dhouse

Assignee

Comment 17

•

2 years ago

Attached file gecko-t-linux-xlarge.json provider:fxci-test-gcp (obsolete) — Details

got tasks running yesterday:
https://firefox-ci-tc.services.mozilla.com/tasks/a1uXn38nThanO5SNcVALOw/runs/0

after extending the reregistrationTimeout to not interrupt tasks with a worker stop
tasks are succeeding:
https://firefox-ci-tc.services.mozilla.com/tasks/JhHnCIROR0OU5LMGDqHdmw/runs/5/logs/public/logs/live.log

also changed the logging to collect the cloudinit log+output so that we can debug that when needed without catching(lost after the instance is deleted and at next boot) the serial console output or ssh'ing into the machine.

:masterwayz is working on new pool definitions

once we've confirmed the video+audio loopbacks and other pieces are working correctly (tests green), I'll take a new golden image not requiring the current startup-script configuration changes

Attachment #9273857 - Attachment is obsolete: true

Michelle Goossens [:masterwayz]

Comment 18

•

2 years ago

Attached file Bug 1752375 - Test temporarily fix for GCP testing image r=#releng-reviewers — Details

Pulsebot

Comment 19

•

2 years ago

Pushed by mgoossens@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/c614e06b226f
Test temporarily fix for GCP testing image r=releng-reviewers,hneiva

:dhouse

Assignee

Comment 20

•

2 years ago

Attached file worker pool testing new d-w image from fxci-level3-gcp.json — Details

I copied some tasks from gecko-3/t-linux-xlarge and they're running on this new image (applied the startupscript changes; no need for the startup now) in a pool on gcp.

Attachment #9274159 - Attachment is obsolete: true

:dhouse

Assignee

Comment 21

•

2 years ago

5 tasks succeeded that I'd cloned from gecko-3/t-linux-xlarge. So the new image looks good to start with from 1 and 3.

Flags: needinfo?(mgoossens)

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Blocks: 1772714

Michelle Goossens [:masterwayz]

Comment 22

•

2 years ago

See https://bugzilla.mozilla.org/show_bug.cgi?id=1772714#c1

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Flags: needinfo?(mgoossens)

Andrew Halberstadt [:ahal]

Comment 23

•

2 years ago

Is there anything left to do here, or can we resolve this?

Flags: needinfo?(mgoossens)

Flags: needinfo?(dhouse)

Michelle Goossens [:masterwayz]

Comment 24

•

2 years ago

I think we are done, but let's wait for dhouse to verify

Flags: needinfo?(mgoossens)

:dhouse

Assignee

Comment 25

•

2 years ago

I agree this is done. For any adjustments or fixes, we can track those separately.

Status: NEW → RESOLVED

Closed: 2 years ago

Flags: needinfo?(dhouse)

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.