Unsuccessful task run with exit code: 137 completed in X seconds
Categories
(Release Engineering :: Firefox-CI Administration, defect, P3)
Tracking
(firefox104 fixed)
Tracking | Status | |
---|---|---|
firefox104 | --- | fixed |
People
(Reporter: NarcisB, Unassigned)
References
Details
(Keywords: intermittent-failure, leave-open)
Attachments
(3 files, 2 obsolete files)
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta+
|
Details | Review |
Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293933882&repo=autoland&lineNumber=122
[fetches 2020-03-19T20:07:15.071Z] Extracting /builds/worker/fetches/android-ndk.tar.xz to /builds/worker/fetches
[fetches 2020-03-19T20:07:17.135Z] /builds/worker/fetches/android-gradle-dependencies.tar.xz extracted in 26.889s
[fetches 2020-03-19T20:07:17.135Z] Removing /builds/worker/fetches/android-gradle-dependencies.tar.xz
[fetches 2020-03-19T20:07:36.518Z] http://taskcluster/api/queue/v1/task/SBZuExGLRVGdZnLaYSsaig/artifacts/project/gecko/android-sdk/android-sdk-linux.tar.xz resolved to 321341423 bytes with sha256 40fde7d48c5c71a5afea101e430b8934f26541a6a3ef7a9f45a614b0b863b639 in 48.882s
[fetches 2020-03-19T20:07:36.518Z] Extracting /builds/worker/fetches/android-sdk-linux.tar.xz to /builds/worker/fetches
[taskcluster 2020-03-19 20:07:50.070Z] === Task Finished ===
[taskcluster 2020-03-19 20:07:50.073Z] Artifact "public/build/maven" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview/maven/"
[taskcluster 2020-03-19 20:07:50.074Z] Artifact "public/build/geckoview_example.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview_example/outputs/apk/withGeckoBinaries/debug/geckoview_example-withGeckoBinaries-debug.apk"
[taskcluster 2020-03-19 20:07:50.075Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2020-03-19 20:07:50.076Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2020-03-19 20:07:50.077Z] Artifact "public/build/geckoview-androidTest.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview/outputs/apk/androidTest/withGeckoBinaries/debug/geckoview-withGeckoBinaries-debug-androidTest.apk"
[taskcluster 2020-03-19 20:07:50.314Z] Unsuccessful task run with exit code: 137 completed in 78.134 seconds
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 27•4 years ago
|
||
Recent failure here are investigated in bug 1668111.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 37•4 years ago
|
||
In the last 7 days, there have been 53 occurrences, most on linux1804-64 debug and opt.
Recent failure: https://treeherder.mozilla.org/logviewer?job_id=322531461&repo=autoland&lineNumber=820
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 56•3 years ago
|
||
Comment 57•3 years ago
|
||
Moving to General as this is a target for intermittent filing and not an actionable bug.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 122•2 years ago
|
||
There have been 41 failures in the last 7 days.
Happens on:
- linux1804-64-asan-qr opt
- linux1804-64-qr debug and opt
- linux1804-64-tsan-qr opt
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=381481869&repo=autoland&lineNumber=1540
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 127•2 years ago
|
||
Hi Dave, could you please take a look or assign this to someone?
Thank you.
There are 76 total failures in the last 7 days on
- linux1804-64-asan-qr opt
- linux1804-64-qr opt and debug
- linux1804-64-tsan-qr opt
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=382082644&repo=autoland&lineNumber=52637
[task 2022-06-22T00:02:23.891Z] 00:02:23 INFO - [Parent 24414, IPDL Background] WARNING: quota manager shutdown step: '0.008251s: stopCrashBrowserTimer', file /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:3792
[task 2022-06-22T00:02:23.895Z] 00:02:23 INFO - DEBUG: Starting phase profile-before-change-telemetry
[task 2022-06-22T00:02:23.897Z] 00:02:23 INFO - DEBUG: Spinning the event loop
[task 2022-06-22T00:02:23.899Z] 00:02:23 INFO - [Child 24550, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.906Z] 00:02:23 INFO - [Child 24550, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.915Z] 00:02:23 INFO - [Child 24550, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:23.942Z] 00:02:23 INFO - [Child 24473, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.948Z] 00:02:23 INFO - [Child 24473, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.961Z] 00:02:23 INFO - [Child 24473, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:23.972Z] 00:02:23 INFO - [Child 24497, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.976Z] 00:02:23 INFO - [Child 24497, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.983Z] 00:02:23 INFO - [Child 24497, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:24.099Z] 00:02:24 INFO - DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.125Z] 00:02:24 INFO - DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.144Z] 00:02:24 INFO - DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.148Z] 00:02:24 INFO - DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.166Z] 00:02:24 INFO - DEBUG: Completed blocker TelemetryController: shutting down for phase profile-before-change-telemetry
[task 2022-06-22T00:02:24.167Z] 00:02:24 INFO - DEBUG: Finished phase profile-before-change-telemetry
[task 2022-06-22T00:02:24.168Z] 00:02:24 INFO - DEBUG: Starting phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.169Z] 00:02:24 INFO - DEBUG: Spinning the event loop
[task 2022-06-22T00:02:24.173Z] 00:02:24 INFO - DEBUG: Completed blocker OS.File: flush pending requests, warn about unclosed files, shut down service. for phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.174Z] 00:02:24 INFO - DEBUG: Finished phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.178Z] 00:02:24 INFO - DEBUG: Starting phase web-workers-shutdown
[task 2022-06-22T00:02:24.179Z] 00:02:24 INFO - DEBUG: Spinning the event loop
[task 2022-06-22T00:02:24.183Z] 00:02:24 INFO - DEBUG: Finished phase web-workers-shutdown
[task 2022-06-22T00:02:24.198Z] 00:02:24 INFO - [Parent 24414, IPDL Background] WARNING: IPC Connection Error: [Parent][PBackgroundParent] RunMessage(msgname=PRemoteWorkerService::Msg___delete__) Channel closing: too late to send/recv, messages will be lost: file /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1876
[task 2022-06-22T00:02:24.812Z] 00:02:24 INFO - [Parent 24414, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:24.823Z] 00:02:24 INFO - [Parent 24414, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:24.841Z] 00:02:24 INFO - [Parent 24414, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:24.844Z] 00:02:24 INFO - nsStringStats
[task 2022-06-22T00:02:24.845Z] 00:02:24 INFO - => mAllocCount: 71579
[task 2022-06-22T00:02:24.846Z] 00:02:24 INFO - => mReallocCount: 0
[task 2022-06-22T00:02:24.846Z] 00:02:24 INFO - => mFreeCount: 71578 -- LEAKED 1 !!!
[task 2022-06-22T00:02:24.846Z] 00:02:24 INFO - => mShareCount: 50252
[task 2022-06-22T00:02:24.847Z] 00:02:24 INFO - => mAdoptCount: 1679
[task 2022-06-22T00:02:24.847Z] 00:02:24 INFO - => mAdoptFreeCount: 1771
[task 2022-06-22T00:02:24.847Z] 00:02:24 INFO - => Process ID: 24414, Thread ID: 139628161734528
[task 2022-06-22T00:02:25.205Z] 00:02:25 INFO - DEBUG: Adding blocker PermissionManager: Flushing data for phase xpcom-will-shutdown
[taskcluster 2022-06-22 00:02:26.463Z] === Task Finished ===
[taskcluster 2022-06-22 00:02:26.538Z] Artifact "public/logs" not found at "/builds/worker/workspace/logs/"
[taskcluster 2022-06-22 00:02:26.540Z] Artifact "public/test" not found at "/builds/worker/artifacts/"
[taskcluster 2022-06-22 00:02:26.541Z] Artifact "public/test_info" not found at "/builds/worker/workspace/build/blobber_upload_dir/"
[taskcluster 2022-06-22 00:02:26.600Z] Unsuccessful task run with exit code: 137 completed in 1259.169 seconds
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment 131•2 years ago
|
||
Does the frequency increase for these docker crashes align with worker changes?
Comment 132•2 years ago
|
||
The jobs that I see in there all run on GCP yes (from what I remember), the timing matches.
That's not ideal.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 136•2 years ago
|
||
What's the next here to bring the failure frequency down and who owns it?
Comment 137•2 years ago
|
||
Well not knowing what is going wrong I have no clue myself, maybe ahal knows who we could forward it to.
Comment hidden (Intermittent Failures Robot) |
Comment 139•2 years ago
|
||
After some quick Googling, looks like exit code 137 means we ran out of memory. Are there particular suites this happens with? If so maybe we can bump those up to the xlarge
pool. If it is happening in lots of places, maybe we need to increase the instance type of the large
pool?
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 141•2 years ago
•
|
||
There aren't any particular test suites where this is happening, they're mochitest, xpcs, crashtests, gtests, marionette. The only thing in common is that this is a linux only fail:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-05-31&endday=2022-06-30&tree=trunk&bug=1623747
It was a pretty steady increase until last week when it spiked considerably, in the last few days is failing very very often.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 151•2 years ago
|
||
Hey Michelle, could you take a look at how much RAM the AWS instances have vs GCP? Could we bump the RAM up in the large
pool in GCP without switching to a new instance?
I suspect this bug is going to be the top priority w.r.t to GCP migration until it is fixed.
Comment 152•2 years ago
|
||
Both AWS and GCP instances appear to have 8GB of memory from what I can see
n2-standard-2
vs m5.large
Comment 153•2 years ago
|
||
I wonder if something in the image is causing us to use more memory than the AWS pools.. Can we increase memory for GCP regardless to see if it works in the short term?
Maybe longer term we can look into improving the image.
Comment 154•2 years ago
|
||
given that this is the #1 intermittent, we should prioritize investigating this. here is a case where the failure happens in <4 minutes, typically it is 10+ minutes.
Do we know for a fact this is memory? I imagine it is based on the error code, but how can we tell (is here a console for gcp instances)? Is the memory managed on GCP the same as AWS? I assume trying larger instances or something with more memory could reduce this problem.
Comment 155•2 years ago
|
||
No it's not a guarantee that it's OOM. From the Exit code 137
heading of this article, when a docker container exits with this error it means it got a SIGKILL. Though from that article and from others, if the SIGKILL was not initiated manually, it usually comes from the Docker daemon itself killing it due to OOM.
Comment 156•2 years ago
|
||
on this try push:
https://treeherder.mozilla.org/jobs?repo=try&revision=0817494e1d36f4f75806ef28cfe261bd704811bd
there are:
1747 total test jobs
99 failed jobs
21 Exit code 137
failures
1.2% of the test jobs result in Exit code 137
.
How can we test on a different instance type?
Comment 157•2 years ago
•
|
||
The fastest way would be to change the {alias}
in this line to a hardcoded t-linux-xlarge-gcp
. This will make all the things configured for the large
workertype use the xlarge
one instead. Then do try push as normal.
The t-linux-xlarge
workertype has 16GB of RAM instead of 8.
To solve this outside of a try push, we can either switch the instance type from n2-standard-2
to n2-highmem-2
(also has 16GB of RAM), or we can create a custom instance with e.g 12GB of RAM if that's all we need. These changes will need to happen in ci-configuration
.
Comment 158•2 years ago
|
||
Since this is getting out of hand with failures, am prioritizing this as number one spot on my plate.
Try push is baking to see if more RAM fixes this.
Comment 159•2 years ago
|
||
Try push: https://treeherder.mozilla.org/jobs?repo=try&revision=5a0ff2119a63053e18bb1b08616daca2c712e264
ahal, jmaher; does that push look any better with the exit code 137?
It runs mochitest-browser-chrome on xlarge, -plain not for some reason, but it should be a start!
Comment 160•2 years ago
•
|
||
I did a push as well with mochitest-plain (not yet landed) using xlarge
and it reduced the failure rate, but out of ~1800 tasks I had 6 137
errors. 1 of them in 34 seconds (still setting up the py venv stuff):
https://firefoxci.taskcluster-artifacts.net/W43QEQmIQxqErW_4BmL5OQ/0/public/logs/live_backing.log
this hints strongly that something else is going on, OOM isn't our single problem, either we need to look elsewhere, or we have multiple problems.
It is promising that we have had a reduction of 137
errors with the xlarge
instances, in fact, almost a 50% reduction in other intermittents on the try push as well.
for people that like to hack in redash, here is a query I put together in a few minutes (caveat, this would probably fail an interview for sql skills) to show failures given a pushid (found via devtools::network panel while loading a try push and finding the url with pushid=XXXXXXXX):
set @PUSHID=1091826; /* jmaher mochitest-plain with xlarge */
/* set @PUSHID=1090928; /* masterwayz mochitest-plain */
set @PUSHID=1091837; /* masterwayz browser-chrome with xlarge */
select
oom.counter as OOM,
failures.counter as total_failures,
count(j.id) as total_jobs
from
(select
count(tle.line) as counter
from
job j,
job_type jt,
text_log_error tle
where
j.push_id=@PUSHID AND
j.job_type_id=jt.id AND
jt.name like 'test-linux1804-64-%'
and result='testfailed'
and tle.job_id = j.id
and tle.line like 'Unsuccessful task run with exit code: 137%') as oom,
(select
count(j.id) as counter
from
job_type jt,
job j
where
j.push_id=@PUSHID AND
j.job_type_id=jt.id AND
jt.name like 'test-linux1804-64-%'
and result='testfailed'
) as failures,
job_type jt,
job j
where
j.push_id=@PUSHID AND
j.job_type_id=jt.id AND
jt.name like 'test-linux1804-64-%'
Comment hidden (Intermittent Failures Robot) |
Comment 162•2 years ago
|
||
this hints strongly that something else is going on, OOM isn't our single problem
I think something else might be going on, but that it's likely still memory related. Maybe there's a limit on the per-container memory allocation here that didn't exist with the other pool.
Dave, do we configure docker in the image? If so is it possible we're restricting the amount of memory available to running containers to a greater degree than we are in the AWS image?
Comment 163•2 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #162)
this hints strongly that something else is going on, OOM isn't our single problem
I think something else might be going on, but that it's likely still memory related. Maybe there's a limit on the per-container memory allocation here that didn't exist with the other pool.
Dave, do we configure docker in the image? If so is it possible we're restricting the amount of memory available to running containers to a greater degree than we are in the AWS image?
This is a direct disk image from aws. So the config should be the same except for what is available on the vm when it boots the disk. I'll verify to see what memory is available to docker on a running instance.
Do the task logs possibly record how much memory is used or available? We might compare with aws runs to see how much is being used.
Since the large instance type reduced the failures, could we run on 2xlarge or greater to workaround this for now?
Comment 164•2 years ago
|
||
I think we should test on other instances. Are we sure each instance gets the memory allocated, or is there a collection of instances that share from a pool of memory. I find it near impossible to believe that the example I shared above failed due to OOM during setup- if it were that close to OOM then we would hit this >30% of the time during setup, let alone the first Firefox browser launch.
Comment 165•2 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment 167•2 years ago
|
||
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/e33cc022c0df Temporarily fix for exit code 137 failures r=releng-reviewers,ahal
Comment hidden (Intermittent Failures Robot) |
Comment 169•2 years ago
|
||
Michelle's fix went live a day ago. We should pay attention to how this affects the intermittent rate.
Comment 170•2 years ago
|
||
Just parsing the data from treeherder, I see:
date: count
2022-07-14: 36
2022-07-13: 72
2022-07-12: 146
2022-07-11: 91
The fix went live I assume 25% through 07-13, and we are about halfway done with 07-14. I am doing another try push to see the statistics compared to previous try pushes.
Comment 171•2 years ago
|
||
I did a push of linux mochitest-plain with --rebuild 10:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=e8c4bd66be522b6843a25b64456d896c1a747e48
this is running on n2-highmem, I ran the query from comment 160 (mod: set @PUSHID=1093896;
). Overall there are:
1878 total jobs
117 failures
15 OOM
This doesn't yield a lot of confidence- high mem is probably reducing our failures, but by 30-40%, I doubt we are close to half reduced. Which is more evidence that this isn't related to memory at all, or if it is then we are not allocated a stable fixed amount of memory per instance/container.
Comment 172•2 years ago
|
||
I'd guess it's related to memory, but not to to the total memory available to the host.
Comment 173•2 years ago
|
||
can we configure these error 137 to be TBPL_RETRY? Maybe we can switch from n2-highmem to xlarge?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 179•2 years ago
|
||
Nothing solid yet from the system side:
I was hopeful when I saw an oom_adj deprecation notice,
Jul 20 16:42:24 dhouse-gecko-t-xlarge-mem-check-image-2 kernel: [ 30.189261] start-worker (2281): /proc/2281/oom_adj is deprecated, please use /proc/2281/oom_score_adj instead.
but we see the same on aws:
Jul 20 16:47:33 ip-10-145-79-109 kernel [ 70.056394] start-worker (2360): /proc/2360/oom_adj is deprecated, please use /proc/2360/oom_score_adj instead.
Comment hidden (Intermittent Failures Robot) |
Comment 181•2 years ago
|
||
From what I can find, the gcp instances have as much or more memory.total that docker-worker is finding compared to aws.
gecko-t.t-linux-large-gcp 16827727872
vs
gecko-t.t-linux-large.m5large 8105631744
and
gecko-t.t-linux-xlarge-gcp 16827449344
vs
gecko-t.t-linux-xlarge.m5axlarge 16431046656
I checked from logs in papertrail for just over 24h to get the total memory available that docker-worker sees on gcp and aws (averaged, all are within a few bytes):
gecko-t.misc.c5dxlarge 8031952896
gecko-t.misc.m5dxlarge 16428949504
gecko-t.misc.r5dxlarge 33252298752
gecko-t.t-linux-large-gcp 16827727872
gecko-t.t-linux-large.m5large 8105631744
gecko-t.t-linux-metal.m5metal 405176481792
gecko-t.t-linux-metal.r5metal 811050600448
gecko-t.t-linux-xlarge-gcp 16827449344
gecko-t.t-linux-xlarge-source.c5xlarge 8031952896
gecko-t.t-linux-xlarge-source.m5axlarge 16431046656
gecko-t.t-linux-xlarge-source.m5dxlarge 16428949504
gecko-t.t-linux-xlarge-source.m5xlarge 16428949504
gecko-t.t-linux-xlarge.c5xlarge 8031952896
gecko-t.t-linux-xlarge.m5axlarge 16431046656
gecko-t.t-linux-xlarge.m5dxlarge 16428949504
gecko-t.t-linux-xlarge.m5xlarge 16428949504
The log lines look like:
Jul 18 20:15:40 gecko-t-t-linux-xlarge-gcp-tojjarhtrie7im-weuaocq docker-worker: 2022/07/18 20:15:40 {"EnvVersion":"2.0","Fields":{"key":"memory.total","v":1,"val":16827449344},"Hostname":"gecko-t-t-linux-xlarge-gcp-tojjarhtrie7im-weuaocq","Logger":"taskcluster.docker-worker.gecko-t.t-linux-xlarge-gcp.projects/887720501152/machineTypes/n2-standard-4","Pid":2435,"Severity":6,"Timestamp":1658175340287000000,"Type":"monitor.measure","serviceContext":{"service":"docker-worker"},"severity":"INFO"}
Comment 182•2 years ago
|
||
this is good info dhouse- can we check the allocated memory used in the docker container? possibly there is some setting which is using more memory on gcp than on aws.
Comment hidden (Intermittent Failures Robot) |
Comment 184•2 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #182)
this is good info dhouse- can we check the allocated memory used in the docker container? possibly there is some setting which is using more memory on gcp than on aws.
Maybe we could check this from a task, and re-run it many times to see if we catch when a task gets this failure? I took the total from memory metrics recorded by docker-worker. But checking inside the task could give us more information.
I'll collect the memory.free from the logs for different worker types and look for low/min to compare.
Also, I'll look more for logs on some of the specific instances/tasks with failures from the intermittent failures search/view on this bug. There could be something we're missing.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 188•2 years ago
|
||
Updated•2 years ago
|
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 191•2 years ago
|
||
Pushed by mgoossens@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/317a9ea4c9ad Run large tests on xlarge to reduce error rates r=ahal,jmaher
Comment 192•2 years ago
|
||
Backed out for causing reftest failures.
Backout link: https://hg.mozilla.org/integration/autoland/rev/1b40798587e3691910e9feab3925e4756d4ba8d2
Failure log: https://treeherder.mozilla.org/logviewer?job_id=385532058&repo=autoland&lineNumber=11285
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 195•2 years ago
|
||
I'd be surprised if this patch actually caused that test to fail. Afaict, it just a refactor and shouldn't have any impact on the task definitions. I'd guess the test is permafail (or very highly intermittent) and we just haven't noticed that fact yet. I ran a backfill on Michelle's push to confirm.
Comment 196•2 years ago
|
||
so far this looks to be 100% reproduced on that push and previous pushes are not showing any failures. I sanity checked log files. Looking at reftest viewer the reference image is "timed out after 2000ms", so possibly the larger machine changes the timing?
Comment 197•2 years ago
|
||
I believe we can set this to retry via:
https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/transforms/job/mozharness.py#210
it looks to be an array, so maybe setting it to [4, 137]
would work?
Comment 198•2 years ago
|
||
so far this looks to be 100% reproduced on that push and previous pushes are not showing any failures. I sanity checked log files. Looking at reftest viewer the reference image is "timed out after 2000ms", so possibly the larger machine changes the timing?
Ok that makes sense. There's an existing intermittent on file here (bug 1731862). I mistakenly thought the patch that landed was simply a cleanup and the switch to larger instance had already happened.
I believe we can set this to retry via:
https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/transforms/job/mozharness.py#210
If the worker is crashing due to running out of memory, I'm not sure that mozharness will still be running to do the retry. I know Taskcluster has a retry mechanism built-in, we'd probably need to use that instead. Something like this:
https://searchfox.org/mozilla-central/source/taskcluster/ci/release-final-verify/kind.yml#25
Comment 199•2 years ago
|
||
that is the same as the link I had for the mozharness transform, basically we need to set the task definition to accept exit code 137 as a retry code:
# Retry if mozharness returns TBPL_RETRY
worker["retry-exit-status"] = [4, 137]
Maybe :masterwayz could give some try pushes to see if we can retry on the 137 errors, and then consider switching back to the regular instances. Then we can keep tabs on the % cpu usage over time and make sure it isn't increasing >2% of our total.
for math, would the cost of xlarge everywhere outstrip the cost of large + 2% ?
Comment 200•2 years ago
|
||
I'll work on a patch with that and try things out!
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 202•2 years ago
|
||
Comment 203•2 years ago
|
||
Backed out changeset e33cc022c0df (bug 1623747) as it is no longer needed;
Comment 204•2 years ago
|
||
Pushed by mgoossens@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/0732258f3f67 Backed out 1 changesets (bug 1623747) r=releng-reviewers,ahal
Comment 205•2 years ago
|
||
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6d36f9426a5f retry task on exit code 137. r=ahal
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 207•2 years ago
|
||
bugherder |
Comment 208•2 years ago
|
||
This is still happening
Comment 209•2 years ago
|
||
linting/source-test specific tasks
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 211•2 years ago
|
||
This was not supposed to end up being closed.
Comment 212•2 years ago
|
||
sounds like we need this for source-test and lint, then uplift to mozilla-beta. That won't get everything, but the large majority. If we want everything, there are instances in update-generate-sources* which affect windows/osx.
Comment 213•2 years ago
|
||
Comment on attachment 9287705 [details]
Bug 1623747 - retry task on exit code 137. r=ahal!
Beta/Release Uplift Approval Request
- User impact if declined: n/a
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): this cleans up some infrastructure failures in CI by setting them to auto_retry which seems to solve the problem!
- String changes made/needed:
- Is Android affected?: No
Updated•2 years ago
|
Comment 214•2 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment 216•2 years ago
|
||
Comment on attachment 9287705 [details]
Bug 1623747 - retry task on exit code 137. r=ahal!
Approved for 104.0b5
Updated•2 years ago
|
Comment 217•2 years ago
|
||
bugherder uplift |
Comment 218•2 years ago
|
||
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/683d337f92f2 Auto retry source-test jobs on exit 137. r=ahal
Comment 219•2 years ago
|
||
bugherder |
Comment 220•2 years ago
|
||
Comment on attachment 9288035 [details]
Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal!
Beta/Release Uplift Approval Request
- User impact if declined: n/a
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): helps retry tasks that fail on linux with a known infrastructure error.
- String changes made/needed:
- Is Android affected?: No
Comment hidden (Intermittent Failures Robot) |
Comment 222•2 years ago
|
||
Comment on attachment 9288035 [details]
Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal!
Approved for 104.0b6
Comment 223•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Description
•