1638921 - In firefox ci expireartifacts, expiretasks, and expiretaskdependency are not completing

Multiple Authors

Description

•

5 years ago

•

Edited

expireartifacts, expiretasks, and expiretaskdependency all run daily and k8s is configured to kill them if they take longer than a day. In the Firefox CI deployment, this is happening since 2020/04/17. That is 1 day before we switched from Azure storage to Postgres.

expiretask, expiretaskdependency logs have the error

 "ResourceNotFoundError: Resource not found
    at makeError (/app/libraries/entities/src/index.js:61:15)
    at make404 (/app/libraries/entities/src/index.js:68:23)
    at subClass.remove (/app/libraries/entities/src/index.js:150:13)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async Promise.all (index 0)
    at async handleResults (/app/libraries/entities/src/index.js:499:9)
    at async handleResults (/app/libraries/entities/src/index.js:502:18)
    at async handleResults (/app/libraries/entities/src/index.js:502:18)
    at async handleResults (/app/libraries/entities/src/index.js:502:18)
    at async Function.scan (/app/libraries/entities/src/index.js:505:17)
    at async Function.TaskDependency.expire (/app/services/queue/src/data.js:641:3)
    at async /app/services/queue/src/main.js:531:23
    at async Monitor.oneShot (/app/libraries/monitor/src/monitor.js:191:7)"

I don't see any errors in the expireartifacts logs.

Johan Lorenzo [:jlorenzo]

Updated

•

5 years ago

Updated

•

5 years ago

Assignee: nobody → dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

5 years ago

The pagination we use for expiration is based on limit/offset, but all of these tables are fast-moving. So basically they're running

select * from tasks where expires < somedate offset 0 limit 1000
delete each of those individually: delete from tasks where partition_key=$1
go on to the next partition: select * from tasks where expires < somedate offset 1000 limit 1000

The problem is, that offset in the third step skips 1000 expired tasks, since the first batch of 1000 are no longer in the DB.

The problem is, the error condition here would be that tasks live on past their expiration date, but the expiration crontasks run quickly to completion. Which is the oppposite of what we're seeing in this bug. (Note that it is what we're seeing in https://github.com/taskcluster/taskcluster/issues/3034!).

I can't imagine why task.remove(true) would be getting a 404. This is the only place that tasks are deleted, and scan() can't return rows that didn't exist when it was called. So the only thing I can think of is that this error occurs when two crontasks run at once.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

5 years ago

It appears that k8s is, indeed, running multiple copies of this crontask at the same time, judging by the pod names.

[there was an image here but bugzilla eated it]

Also, ConcurrencyPolicy is set to Allow.

This is pretty wacky -- there's a Pod, taskcluster-queue-expiretask-1591315200-qdfmf, created Jun 7, 2020, 7:29:12 PM local time, or June 7 at 23:29:12 GMT. These cron tasks are set to start jobs at midnight UTC, so this job either jumped the gun by 31 minutes or was 23.5 hours late. Its controller is taskcluster-queue-expiretask-1591315200, and deciphering that timestamp gives June 5 00:00:00, so apparently the pod was started 47.5 hours late? And, that pod is still running today on the 15th!

So overall, I am guessing that something happened in April that caused a pod to hang, or just take a long time, and that since that time k8s has been happily scheduling all the pods it can, all of which have just been erroring out.

So, fixes probably include:

ConcurrencyPolicy: Forbid
RestartPolicy: Never
reduce the activeDeadlineSeconds to a minute shy of 24 hours just so there's time to kill the pod if it does run the full duration
kill all of the running jobs with fire

Brian, can we try the last step now, to see if it helps? Also, it might be good to run a quick query to see how many tasks are past their expiration date - select date_trunc('day', expires) as expires_day, count(*) from tasks group by date_trunc('day', expires) where expires < now().

If that number is large -- and I bet it is -- then we may need to implement the other fixes in order to successfully start bringing the number down.

Flags: needinfo?(bpitts)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

5 years ago

The Other Fixes

Brian Pitts

Reporter

Comment 4

•

5 years ago

This is interesting and forcing me to dig into some bits of k8s I haven't looked at much before.

I think the fix you did should work but may not be the best one.

I believe the primary issue with concurrent execution is that you are specifying activeDeadlineSeconds in the pod spec instead of in the job spec. This means that if the pod has not finished within a day hours it will be killed, but there's no actual limit on the Job duration. Thus the following happens

The CronJob starts Job 1
Job 1 starts pod A

24 hours later

Pod A is killed for exceeding activeDeadlineSeconds. SInce it had RestartPolicy: OnFailure, it is restarted as Pod B
The CronJob starts Job 2 even though Job 1 is still running because of concurrencyPolicy: Allow
Job 2 starts pod C

So at this point we have two Jobs and two containers running, and if the Pods take longer than 48 hours we'll end up with 3 jobs and 3 containers, and so on.

By setting the the pod spec to RestartPolicy: Never, and making activeDeadlineSeconds slightly shorter than a day you have prevented this scenario from happening. Now the pod will be killed and not restarted and the Job will be marked as failed before the CronJob tries to start a new Job. By setting the CronJobs concurrencyPolicy: Forbid you've added extra assurance that there won't be concurrent runs by preventing scheduling new Job sif the old one is somehow still active.

However, I think setting RestartPolicy: Never in the pod spec isn't the best idea, since it means if something happens (e.g. node failure) that interrupts the Job it will not resume until the next scheduled run. Instead, I would set it back to RestartPolicy: OnFailure but move activeDeadlineSeconds from the pod spec to the job spec. Then, the pod can be restarted within the activeDeadlineSeconds window, but at the end of that time if one is running it is killed and the job is failed.

Brian Pitts

Reporter

Comment 5

•

5 years ago

bpitts@cloudshell:~ (moz-fx-taskcluster-prod-4b87)$ kubectl get jobs --namespace=firefoxcitc-taskcluster | grep '0/1'
taskcluster-queue-expireartifacts-1591660800                     0/1           7d14h      7d14h
taskcluster-queue-expireartifacts-1591747200                     0/1           6d14h      6d14h
taskcluster-queue-expireartifacts-1591833600                     0/1           5d14h      5d14h
taskcluster-queue-expireartifacts-1591920000                     0/1           4d14h      4d14h
taskcluster-queue-expireartifacts-1592006400                     0/1           3d14h      3d14h
taskcluster-queue-expireartifacts-1592092800                     0/1           2d14h      2d14h
taskcluster-queue-expireartifacts-1592179200                     0/1           38h        38h
taskcluster-queue-expireartifacts-1592265600                     0/1           14h        14h
taskcluster-queue-expiretask-1592092800                          0/1           2d14h      2d14h
taskcluster-queue-expiretask-1592179200                          0/1           38h        38h
taskcluster-queue-expiretask-1592265600                          0/1           14h        14h
taskcluster-queue-expiretaskdependency-1591920000                0/1           4d14h      4d14h
taskcluster-queue-expiretaskdependency-1592092800                0/1           2d14h      2d14h
taskcluster-queue-expiretaskdependency-1592179200                0/1           38h        38h
taskcluster-queue-expiretaskdependency-1592265600                0/1           14h        14h
taskcluster-queue-expiretaskgroupmembers-1592092800              0/1           2d14h      2d14h
taskcluster-queue-expiretaskgroupmembers-1592265600              0/1           14h        14h

I've deleted all those jobs that were running more than 24 hours.

Dustin, want to take another look at that query? There is no tasks table.

Flags: needinfo?(bpitts)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

5 years ago

Dustin, want to take another look at that query? There is no tasks table.

Sorry, that should have been queue_tasks_entities.

I'll read the rest of your message after the plenary but thank you for the in-depth analysis!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

5 years ago

OK, actually tested:

dustindev=> select date_trunc('day', (value ->> 'expires')::timestamp) as expires_day, count(*) from queue_tasks_entities group by date_trunc('day', (value ->> 'expires')::timestamp);
     expires_day     | count 
---------------------+-------
 2020-07-31 00:00:00 |     3
 2020-08-04 00:00:00 |     1
 2020-11-12 00:00:00 |     1
 2020-11-23 00:00:00 |     2
 2021-04-10 00:00:00 |     1
 2021-04-17 00:00:00 |     6
 2021-04-20 00:00:00 |     1
 2021-06-04 00:00:00 |     1
(8 rows)

Brian Pitts

Reporter

Comment 8

•

5 years ago

Attached file task_expires_days.txt — Details

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

5 years ago

https://github.com/taskcluster/taskcluster/pull/3075

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

5 years ago

These fixes are being deployed in 31.0.0, so let's see if they help. If not, the postgres phase-2 work will help!

Chris Cooper [:coop] (he/him)

Updated

•

5 years ago

Comment 12

•

5 years ago

These fixes are now on firefox-ci-tc.

I made another pass at killing jobs after the latest deploy.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 13

•

5 years ago

Are these crontasks running to completion now?

Brian Pitts

Reporter

Comment 14

•

5 years ago

I think your viewer access in the GCP project lets you see the cluster workloads.

taskcluster-queue-expiretask is still exceeding the deadline, but the others are completing.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 15

•

5 years ago

OK, let's check back after the TCW this weekend (as it rewrites all of this stuff) and revisit if it's still an issue. If you have a cloud SQL window open and can repeat comment 8 that'd be great, but mostly for curiosity so don't go out of your way.

Brian Pitts

Reporter

Comment 16

•

5 years ago

I fired off the query, but it takes a while.

I'm optimistic things will be better in firefox-ci-tc after the TCW changes, because they made a dramatic difference in the runtime of expiretask and expireartifacts in community-tc. Looks like expiretask there went from 4 hours to 4 minutes.

Brian Pitts

Reporter

Comment 17

•

5 years ago

Attached file task_expires_days_again.txt — Details

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 18

•

5 years ago

Interesting, still 59k expired tasks! I'm glad to hear the task now runs a lot more quickly..

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 19

•

5 years ago

I'll be interested to see how this is doing in v35.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 20

•

4 years ago

Brian, while I'm waiting for the cloud shell to initialize, can you point me to how to see the timelines for historical runs of these cron jobs?

Brian Pitts

Reporter

Comment 21

•

4 years ago

If a task completed and emitted a monitor.periodic log, we have metrics for it.

https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?viewPanel=15&orgId=1&from=now-6h&to=now

https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?viewPanel=16&orgId=1&from=now-6h&to=now

If you need info on jobs that aren't logging, kubectl should (somehow) be able to get you details on when the pods for a cronjob started and stopped. You can also see it int he GKE console if you drill down enough.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 22

•

4 years ago

From those graphs, the times all look to be in minutes, so I think we're OK.

It's curious that the times are all in multiples of 251.7 seconds. That seems unlikely for the job itself -- is that some kind of quantization artifact of the analysis?

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 23

•

4 years ago

..but that means this bug is (finally) done :)

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

task_expires_days.txt 5 years ago Brian Pitts 10.31 KB, text/plain		Details
task_expires_days_again.txt 5 years ago Brian Pitts 12.47 KB, text/plain		Details