Closed Bug 1638921 Opened 5 years ago Closed 4 years ago

In firefox ci expireartifacts, expiretasks, and expiretaskdependency are not completing

Categories

(Taskcluster :: Services, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: brian, Assigned: dustin)

References

Details

Attachments

(2 files)

expireartifacts, expiretasks, and expiretaskdependency all run daily and k8s is configured to kill them if they take longer than a day. In the Firefox CI deployment, this is happening since 2020/04/17. That is 1 day before we switched from Azure storage to Postgres.

expiretask, expiretaskdependency logs have the error

 "ResourceNotFoundError: Resource not found
    at makeError (/app/libraries/entities/src/index.js:61:15)
    at make404 (/app/libraries/entities/src/index.js:68:23)
    at subClass.remove (/app/libraries/entities/src/index.js:150:13)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async Promise.all (index 0)
    at async handleResults (/app/libraries/entities/src/index.js:499:9)
    at async handleResults (/app/libraries/entities/src/index.js:502:18)
    at async handleResults (/app/libraries/entities/src/index.js:502:18)
    at async handleResults (/app/libraries/entities/src/index.js:502:18)
    at async Function.scan (/app/libraries/entities/src/index.js:505:17)
    at async Function.TaskDependency.expire (/app/services/queue/src/data.js:641:3)
    at async /app/services/queue/src/main.js:531:23
    at async Monitor.oneShot (/app/libraries/monitor/src/monitor.js:191:7)"   

I don't see any errors in the expireartifacts logs.

See Also: → 1640597
Assignee: nobody → dustin

The pagination we use for expiration is based on limit/offset, but all of these tables are fast-moving. So basically they're running

  • select * from tasks where expires < somedate offset 0 limit 1000
  • delete each of those individually: delete from tasks where partition_key=$1
  • go on to the next partition: select * from tasks where expires < somedate offset 1000 limit 1000

The problem is, that offset in the third step skips 1000 expired tasks, since the first batch of 1000 are no longer in the DB.

The problem is, the error condition here would be that tasks live on past their expiration date, but the expiration crontasks run quickly to completion. Which is the oppposite of what we're seeing in this bug. (Note that it is what we're seeing in https://github.com/taskcluster/taskcluster/issues/3034!).

I can't imagine why task.remove(true) would be getting a 404. This is the only place that tasks are deleted, and scan() can't return rows that didn't exist when it was called. So the only thing I can think of is that this error occurs when two crontasks run at once.

It appears that k8s is, indeed, running multiple copies of this crontask at the same time, judging by the pod names.

[there was an image here but bugzilla eated it]

Also, ConcurrencyPolicy is set to Allow.

This is pretty wacky -- there's a Pod, taskcluster-queue-expiretask-1591315200-qdfmf, created Jun 7, 2020, 7:29:12 PM local time, or June 7 at 23:29:12 GMT. These cron tasks are set to start jobs at midnight UTC, so this job either jumped the gun by 31 minutes or was 23.5 hours late. Its controller is taskcluster-queue-expiretask-1591315200, and deciphering that timestamp gives June 5 00:00:00, so apparently the pod was started 47.5 hours late? And, that pod is still running today on the 15th!

So overall, I am guessing that something happened in April that caused a pod to hang, or just take a long time, and that since that time k8s has been happily scheduling all the pods it can, all of which have just been erroring out.

So, fixes probably include:

  • ConcurrencyPolicy: Forbid
  • RestartPolicy: Never
  • reduce the activeDeadlineSeconds to a minute shy of 24 hours just so there's time to kill the pod if it does run the full duration
  • kill all of the running jobs with fire

Brian, can we try the last step now, to see if it helps? Also, it might be good to run a quick query to see how many tasks are past their expiration date - select date_trunc('day', expires) as expires_day, count(*) from tasks group by date_trunc('day', expires) where expires < now().

If that number is large -- and I bet it is -- then we may need to implement the other fixes in order to successfully start bringing the number down.

Flags: needinfo?(bpitts)

This is interesting and forcing me to dig into some bits of k8s I haven't looked at much before.

I think the fix you did should work but may not be the best one.

I believe the primary issue with concurrent execution is that you are specifying activeDeadlineSeconds in the pod spec instead of in the job spec. This means that if the pod has not finished within a day hours it will be killed, but there's no actual limit on the Job duration. Thus the following happens

The CronJob starts Job 1
Job 1 starts pod A

24 hours later

Pod A is killed for exceeding activeDeadlineSeconds. SInce it had RestartPolicy: OnFailure, it is restarted as Pod B
The CronJob starts Job 2 even though Job 1 is still running because of concurrencyPolicy: Allow
Job 2 starts pod C

So at this point we have two Jobs and two containers running, and if the Pods take longer than 48 hours we'll end up with 3 jobs and 3 containers, and so on.

By setting the the pod spec to RestartPolicy: Never, and making activeDeadlineSeconds slightly shorter than a day you have prevented this scenario from happening. Now the pod will be killed and not restarted and the Job will be marked as failed before the CronJob tries to start a new Job. By setting the CronJobs concurrencyPolicy: Forbid you've added extra assurance that there won't be concurrent runs by preventing scheduling new Job sif the old one is somehow still active.

However, I think setting RestartPolicy: Never in the pod spec isn't the best idea, since it means if something happens (e.g. node failure) that interrupts the Job it will not resume until the next scheduled run. Instead, I would set it back to RestartPolicy: OnFailure but move activeDeadlineSeconds from the pod spec to the job spec. Then, the pod can be restarted within the activeDeadlineSeconds window, but at the end of that time if one is running it is killed and the job is failed.

bpitts@cloudshell:~ (moz-fx-taskcluster-prod-4b87)$ kubectl get jobs --namespace=firefoxcitc-taskcluster | grep '0/1'
taskcluster-queue-expireartifacts-1591660800                     0/1           7d14h      7d14h
taskcluster-queue-expireartifacts-1591747200                     0/1           6d14h      6d14h
taskcluster-queue-expireartifacts-1591833600                     0/1           5d14h      5d14h
taskcluster-queue-expireartifacts-1591920000                     0/1           4d14h      4d14h
taskcluster-queue-expireartifacts-1592006400                     0/1           3d14h      3d14h
taskcluster-queue-expireartifacts-1592092800                     0/1           2d14h      2d14h
taskcluster-queue-expireartifacts-1592179200                     0/1           38h        38h
taskcluster-queue-expireartifacts-1592265600                     0/1           14h        14h
taskcluster-queue-expiretask-1592092800                          0/1           2d14h      2d14h
taskcluster-queue-expiretask-1592179200                          0/1           38h        38h
taskcluster-queue-expiretask-1592265600                          0/1           14h        14h
taskcluster-queue-expiretaskdependency-1591920000                0/1           4d14h      4d14h
taskcluster-queue-expiretaskdependency-1592092800                0/1           2d14h      2d14h
taskcluster-queue-expiretaskdependency-1592179200                0/1           38h        38h
taskcluster-queue-expiretaskdependency-1592265600                0/1           14h        14h
taskcluster-queue-expiretaskgroupmembers-1592092800              0/1           2d14h      2d14h
taskcluster-queue-expiretaskgroupmembers-1592265600              0/1           14h        14h

I've deleted all those jobs that were running more than 24 hours.

Dustin, want to take another look at that query? There is no tasks table.

Flags: needinfo?(bpitts)

Dustin, want to take another look at that query? There is no tasks table.

Sorry, that should have been queue_tasks_entities.

I'll read the rest of your message after the plenary but thank you for the in-depth analysis!

OK, actually tested:

dustindev=> select date_trunc('day', (value ->> 'expires')::timestamp) as expires_day, count(*) from queue_tasks_entities group by date_trunc('day', (value ->> 'expires')::timestamp);
     expires_day     | count 
---------------------+-------
 2020-07-31 00:00:00 |     3
 2020-08-04 00:00:00 |     1
 2020-11-12 00:00:00 |     1
 2020-11-23 00:00:00 |     2
 2021-04-10 00:00:00 |     1
 2021-04-17 00:00:00 |     6
 2021-04-20 00:00:00 |     1
 2021-06-04 00:00:00 |     1
(8 rows)
Attached file task_expires_days.txt

These fixes are being deployed in 31.0.0, so let's see if they help. If not, the postgres phase-2 work will help!

See Also: → 1650411

These fixes are now on firefox-ci-tc.

I made another pass at killing jobs after the latest deploy.

Are these crontasks running to completion now?

I think your viewer access in the GCP project lets you see the cluster workloads.

taskcluster-queue-expiretask is still exceeding the deadline, but the others are completing.

OK, let's check back after the TCW this weekend (as it rewrites all of this stuff) and revisit if it's still an issue. If you have a cloud SQL window open and can repeat comment 8 that'd be great, but mostly for curiosity so don't go out of your way.

I fired off the query, but it takes a while.

I'm optimistic things will be better in firefox-ci-tc after the TCW changes, because they made a dramatic difference in the runtime of expiretask and expireartifacts in community-tc. Looks like expiretask there went from 4 hours to 4 minutes.

Interesting, still 59k expired tasks! I'm glad to hear the task now runs a lot more quickly..

I'll be interested to see how this is doing in v35.

Brian, while I'm waiting for the cloud shell to initialize, can you point me to how to see the timelines for historical runs of these cron jobs?

If a task completed and emitted a monitor.periodic log, we have metrics for it.

https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?viewPanel=15&orgId=1&from=now-6h&to=now

https://earthangel-b40313e5.influxcloud.net/d/kqryhOpWk/taskcluster-visuals?viewPanel=16&orgId=1&from=now-6h&to=now

If you need info on jobs that aren't logging, kubectl should (somehow) be able to get you details on when the pods for a cronjob started and stopped. You can also see it int he GKE console if you drill down enough.

From those graphs, the times all look to be in minutes, so I think we're OK.

It's curious that the times are all in multiples of 251.7 seconds. That seems unlikely for the job itself -- is that some kind of quantization artifact of the analysis?

..but that means this bug is (finally) done :)

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: