Airflow task crash_symbolication.top_signatures_correlations failing on 2022-06-22
Categories
(Data Platform and Tools :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: akomar, Assigned: willkg)
References
Details
(Whiteboard: [airflow-triage])
Attachments
(1 file)
Airflow task crash_symbolication.top_signatures_correlations failing on 2022-06-22
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
The top signatures correlations task looks at release channel first, then beta, then nightly. The release channel has more crash reports in it by far. For the logs that I've looked at, when this task runs out of memory, it's when processing the release channel crash reports.
Here's a log of one of the failed runs for this bug:
I see error lines like this:
22/06/22 09:12:00 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 4 on
top-signatures-correlations-2022-06-21-w-1.c.airflow-dataproc.internal: Container killed by YARN for exceeding
memory limits. 12.8 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
12gb is twice what we had when I was looking at bug #1774415 last week. In one of the PRs, I upgraded the instance type. Seems like today, this job is using twice the memory it was using before.
Here are links to some other recent successful runs:
- 2022-06-13 (before previous fail) https://console.cloud.google.com/dataproc/jobs/top-signatures-correlations_357f7736/monitoring?region=us-west1&project=airflow-dataproc
- 2022-06-15 (first successful run after I "fixed" it in bug #1774415): https://console.cloud.google.com/dataproc/jobs/top-signatures-correlations_bd738cb3/monitoring?region=us-west1&project=airflow-dataproc
- 2022-06-17: https://console.cloud.google.com/dataproc/jobs/top-signatures-correlations_502e6cb2/monitoring?region=us-west1&project=airflow-dataproc
- 2022-06-20: https://console.cloud.google.com/dataproc/jobs/top-signatures-correlations_3c40155b/monitoring?region=us-west1&project=airflow-dataproc
Here's the number of crash reports being looked at for the release channel per run:
date | outcome | memory | total top signatures | total crashes |
---|---|---|---|---|
2022-06-13 | success | ? | 300 | 89920 |
2022-06-15 | success | ? | 250 | 66759 |
2022-06-17 | success, but has memory errors in logs | 12gb | 250 | 75833 |
2022-06-20 | success | ? | 250 | 91403 |
2022-06-22 | fail | 12gb | 250 | 116816 |
It's interesting to point out that there are "YARN killed this thing because MEMORY!" errors in the 2022-06-17 log which was marked success.
So the 2022-06-22 run is looking at 20,000 more crash reports than the 2022-06-20 run. Maybe the top 250 crash signatures are accounting for more crashes? Maybe we've had a big spike recently in one of the signatures?
I think what we should do at this point is investigate memory usage in the job and figure out if we can calculate the same correlations in a more memory efficient way. That's not something I can do now.
The other option I can think of is to try dropping the number of top signatures from 250 back to 200 and see if that gets us over this hump. 200 is the original number we used before we increased it to 1000 and then dropped it to 300 back in March 2022. I'm going to work on that next.
Comment 2•3 years ago
|
||
Assignee | ||
Comment 3•3 years ago
|
||
willkg merged PR #375: "bug 1775444: drop top signatures from 250 to 200" in 6a9e4e4.
I'll let this settle and then rerun the tasks.
Assignee | ||
Comment 4•3 years ago
|
||
I cleared the DAG and everything ran fine. Marking as FIXED.
Updated•3 years ago
|
Description
•