Closed Bug 1291249 Opened 8 years ago Closed 8 years ago

Logs uploaded by generic-worker to public-artifacts.taskcluster.net aren't using gzip

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: pmoore)

References

Details

Attachments

(2 files)

Compare:

1) curl -IL "https://queue.taskcluster.net/v1/task/Kf6jXsCGRGqu9N-6QLmhdA/runs/0/artifacts/public%2Flogs%2Flive_backing.log"
2) curl -IL "https://queue.taskcluster.net/v1/task/BNw5W2zzQdifz6iQS0Hc8g/runs/0/artifacts/public%2Flogs%2Flive_backing.log"

For both, the initial redirect (presumably served by cloud-mirror) is identical (and points to public-artifacts.taskcluster.net).

However the response from public-artifacts.taskcluster.net differs between the two:

For (1):
 - Content-Type: text/plain; charset=utf-8
 - Content-Encoding: gzip
 - Server: AmazonS3
 - X-Cache: Miss from cloudfront
 - Via: 1.1 0e74db3db7a48d7f6e7f64382ca5c629.cloudfront.net (CloudFront)
 - x-amz-version-id: TE5s3plkjKgBZb2FRdukrukeN940HqDg
 - X-Amz-Cf-Id: nAnd-_UToIJvatX0HRIPa6-kUru-aCE8RXwjwsBG5S9nQR7LlUboNw==

For (2):
 - Content-Type: text/plain
 - (No `Content-Encoding` header)
 - Server: AmazonS3
 - X-Cache: Miss from cloudfront
 - Via: 1.1 4222b2a73c8078ae05f5cfa25b5cd0ab.cloudfront.net (CloudFront)
 - x-amz-version-id: clGXtTUi3qcrhVzN8H2RzKxyCyz0wI3P
 - X-Amz-Cf-Id: E0QD4YOLEaD66JBz9KHPOzpUceR4xALaSlxZAN8VmxP6rk_0lmzTRg==

ie for the 2nd, the response isn't gzipped.

Are these coming from separate S3 buckets, where one has gzip enabled, and others not?

The problem is that this causes transfers to take more time, plus currently breaks the Treeherder log parser, since it assumes the response will be gzipped (something we could fix, though it would mean we wouldn't spot gzip regressions, so I'm on the fence as to whether we should do so).

Many thanks :-)
Flags: needinfo?(jhford)
Flags: needinfo?(garndt)
It appears the differences here are that #1 is a docker-worker task that gzips the logs and #2 is a generic-worker task where it does not do the same.

Pete, is there any reason we do not gzip the logs that generic worker uploads?
Flags: needinfo?(garndt) → needinfo?(pmoore)
Ah thank you (I didn't even know there were two different types of workers).
Component: Integration → Generic-Worker
Summary: Some logs served from public-artifacts.taskcluster.net aren't using gzip → Logs uploaded by generic-worker public-artifacts.taskcluster.net aren't using gzip
Summary: Logs uploaded by generic-worker public-artifacts.taskcluster.net aren't using gzip → Logs uploaded by generic-worker to public-artifacts.taskcluster.net aren't using gzip
Whoops! I didn't realise I need to gzip - let me create a bug for that...
Flags: needinfo?(pmoore)
ah, let's just use this bug....
Assignee: nobody → pmoore
Attachment #8777374 - Flags: review?(garndt)
Flags: needinfo?(jhford)
Comment on attachment 8777374 [details] [review]
Github Pull Request for generic-worker

Looks good.  I added some comments but I don't think anything that would block this (although I think the one log handle reference should be updated in the case that live log can't be created)
Attachment #8777374 - Flags: review?(garndt) → review+
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/a6a4df0c928f57db883ff6a887e44dcf8d52f05a
Bug 1291249 - gzip log artifacts and set Content-Encoding in PUT to S3

https://github.com/taskcluster/generic-worker/commit/62632d655076d0b5160398c77033f1414780c4c3
Merge pull request #16 from taskcluster/bug1291249

Bug 1291249 - gzip log artifacts and set Content-Encoding in PUT to S3
This updates the following manifests:

  * userdata/Manifest/win10.json
  * userdata/Manifest/win2012.json
  * userdata/Manifest/win7.json

@grenade - will this also update worker type gecko-1-b-win2012?

Not sure if we need to do something else to get that worker type in this repo?
Attachment #8778345 - Flags: review?(rthijssen)
Rob, do you know when you might be able to look at this review?
This change is needed to fix the log parsing errors we're seeing on Treeherder.

Many thanks :-)
Flags: needinfo?(rthijssen)
Hi Amy - I don't suppose you know if Rob's on vacation at the moment, and if so when he'll be back? Or alternatively if there's someone else that can take this review? (It's been open 10 days)

Many thanks :-)
Flags: needinfo?(arich)
rob is back as of today.
Flags: needinfo?(arich)
Attachment #8778345 - Flags: review?(rthijssen) → review+
Thank you for merging.

What are the next steps for getting this deployed? (I'm still seeing exceptions in production)

Thanks!
Flags: needinfo?(rthijssen)
the merge results in an auto deploy (a tc github job is triggered by the merge).
the amis for gecko-1-b-win2012 and gecko-1-t-win7-32 were updated here:
https://tools.taskcluster.net/task-group-inspector/#PaQfiwpxRz-a7isqfVDH8w/

since we're still getting exceptions, maybe i can take a look at the errors and the generic-worker logs in pete's absence and see if i can spot something? there are also some non-gecko windows workers (win2012r2, nss-win2012r2) that aren't auto-updated so it would be good to know where we see exceptions.
Flags: needinfo?(rthijssen)
A recent log that is causing exceptions:

$ curl -sSIL https://queue.taskcluster.net/v1/task/1u02dPbhRxSGDSdnagRhQA/runs/0/artifacts/public%2Flogs%2Flive_backing.log | grep 'Content-Encoding'
$

One that is fine:

$ curl -sSIL https://queue.taskcluster.net/v1/task/BnVu6P_TSvWCvGG2YQaEww/runs/0/artifacts/public%2Flogs%2Flive_backing.log | grep 'Content-Encoding'
Content-Encoding: gzip
$
ok, that's helpful.

it looks like that first log is using worker type nss-win2012 which is still on generic-worker 5.0.0 (shown at the beginning of the log).

this one, using worker type gecko-1-b-win2012 and g-w 5.1.0, appears to be correctly gzipped: https://public-artifacts.taskcluster.net/EV5pj7XdSxmCy4mo9mp-BQ/0/public/logs/live_backing.log

i believe either ttaubert or pmoore run a manual script to update win2012r2 and nss-win2012r2 worker type amis.
Pete, when you're back from PTO could you updated the nss-win2012 AMI (and any others) to use the new generic-worker release? If we could at some point automate those, that would be awesome too :-)
Flags: needinfo?(pmoore)
We have potentially four options to consider (that I can think of):

1) Release a new nss-win2012r2 worker type that uses generic worker v5.1.0 or later (create new AMIs)
2) Point nss-win2012r2 worker type to use the same AMIs as gecko-1-b-win2012 worker type
3) Update nss-win2012r2 to automatically be in sync with gecko-1-b-win2012 worker type (e.g. add to OpenCloudConfig)
4) Switch nss to use gecko-1-b-win2012 worker type

The choice depends on how flexible we want to be when rolling out changes in future.

1) means the nss worker type is independent from gecko, and changes made to gecko won't impact nss tasks
2) is a way to share the AMI set up, but have separate worker types, so nss workers are in a separate pool to gecko workers, so no caches are shared etc
3) this is like 2) but means changes will automatically get rolled out, when those AMI changes are made
4) means we're literally using the same workers, so a worker could run both nss jobs and gecko jobs, and potentially caches are shared

I lean towards options 1) or 3). 3) is a little bit scary in that somebody might update something for gecko and forget it impacts nss. But on the plus side, it is automated entirely. 1 carries the risk that nss and gecko slowly fork so maintenance could be more work. This would essentially move the ownership/maintenance of the worker type over to RelOps team.

Let's hold off on this until ttaubert returns, as it is quite an important decision that will fundamentally affect maintenance/operations/ownership.
Flags: needinfo?(pmoore) → needinfo?(ttaubert)
Number 3 sounds like the best option to me. Firefox will always also have to build NSS and so I think it's very unlikely that the AMI would be modified such that NSS would break. Especially because we use tooltool for the toolchain, so that's separate.
Flags: needinfo?(ttaubert)
Depends on: 1298864
This will need a bit of work in OpenCloudConfig, created bug 1298864 for this.
Tim, if you want to test against the existing gecko worker type AMIs produced by OpenCloudConfig, I can manually update the nss-win2012r2 worker type to point to those AMIs. When would be a good time to try out this change, when there is low traffic?

I'm thinking this will be easier that setting up a separate staging worker type just for testing, as that would require various roles being added and/or modified.
Flags: needinfo?(ttaubert)
(In reply to Pete Moore [:pmoore][:pete] from comment #21)
> Tim, if you want to test against the existing gecko worker type AMIs
> produced by OpenCloudConfig, I can manually update the nss-win2012r2 worker
> type to point to those AMIs. When would be a good time to try out this
> change, when there is low traffic?

Traffic is quite low at the moment, feel free to give this a try today or tomorrow.
Flags: needinfo?(ttaubert)
Do you know when you might get a chance to update the AMIs? We're still seeing errors due to this on Treeherder.
Flags: needinfo?(pmoore)
Would it be ok if I rolled out in a couple of weeks? I'm in the midst of adding scope-protected caches and prepopulated folders to the worker, might be nice to roll it out all in one. If it is urgent though, I can set some time aside to roll out twice.
Flags: needinfo?(pmoore)
At the moment none of the NSS Windows logs are being parsed by Treeherder as a result of this, though if no-one is using them, perhaps it doesn't matter short term.

Other than that, it's just added noise in New Relic which would be good to clean up (since we can then lower our error rate thresholds to something that will actually catch real issues), but a week or two more isn't a deal-breaker.

Longer term we need a better plan for automating these - a 6 week turnaround for infra issues that break ingestion is pretty long.

Thank you for your help with this anyway :-)
I've completed all my testing, and am planning to roll this out tomorrow morning, local time (i.e. in around 14 hours from now).

Sorry for the delay!
Sounds great, thank you! :-)
It looks like the new worker is causing some bustage, see:

https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=9972938c3bd408857da5cabb50e00d550aa5115c

I've spoken to Franziskus, and he says it is ok for me to perform some troubleshooting (no immediate requirement to have the builds passing)...

Looking into it now...
I found the problem.

https://github.com/taskcluster/generic-worker/blob/8f7b5dce6cd863cd63bb3408e71c78c1720cfc98/worker_types/nss-win2012r2/userdata#L126

tries to install virtualenv 15.1.0 - however the latest version is 15.0.3. Therefore all the upgrades fail in this command. Therefore pypiwin32 installation did not take place, thus the win32 package isn't available to python.

I'll fix this and roll some new AMIs out...

Pete
The userdata is now fixed:

  https://github.com/taskcluster/generic-worker/commit/b88b1a8c540f1ea7000db8334a909d55e2862247

Creating new AMIs now to include this fix...
I also rolled back the worker type definition to the previous working version, and terminated all the bad instances - so tasks should be working again. When the new AMIs are ready, I'll update the worker type definition again, and check that all is ok.
The new AMIs have been created, and the nss-win2012r2 worker type has been updated. However, there are still some instances running with the old AMI. I propose we leave things as they are, and let the tasks naturally roll over to the new worker type.

I'll keep my eye on nss-try today, once some new workers have been provisioned.

Note - workers will stay alive until they don't claim a task for an hour - at which point they will terminate themselves. Therefore if there is a constant flow of jobs, it may be that the old workers stay around for quite a while.

The provisioner shows which AMI a worker is created from here:
https://tools.taskcluster.net/aws-provisioner/#nss-win2012r2/resources

The new AMIs are:

us-east-1: ami-b387fca4
us-west-1: ami-9f3779ff
us-west-2: ami-a14d91c1

The other way to tell is the first few lines of the task log will say that the generic worker is version 5.3.1 if the worker is running with the new AMI.
(In reply to Pete Moore [:pmoore][:pete] from comment #33)
> Try push to test:
> https://treeherder.mozilla.org/#/jobs?repo=nss-
> try&revision=f8ab952ad38d2b0caa885d4605777e6738619884
> 
> All workers are now on one of the new AMIs...

Third time lucky?

https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=38a1fd6433133edf2e2172b00cba8e9284e9aecd

The previous try push was accidentally made against a different NSS branch, and wasn't enabled in taskcluster! Whoops! :-)

Watching the new try push...
\o/ curl -L 'https://queue.taskcluster.net/v1/task/ARAMIxXcQ-CD7y4_oldVww/runs/0/artifacts/public%2Flogs%2Flive_backing.log' | gunzip
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Many thanks!
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: