1623747 - Unsuccessful task run with exit code: 137 completed in X seconds

Reporter

Description

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=293933882&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&revision=f753bf2c8d70cd31970e42dad254c54b17705da7&searchStr=android%2C5.0%2Caarch64%2Copt%2Cbuild-android-aarch64%2Fopt%2C%28b%29

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293933882&repo=autoland&lineNumber=122

[fetches 2020-03-19T20:07:15.071Z] Extracting /builds/worker/fetches/android-ndk.tar.xz to /builds/worker/fetches
[fetches 2020-03-19T20:07:17.135Z] /builds/worker/fetches/android-gradle-dependencies.tar.xz extracted in 26.889s
[fetches 2020-03-19T20:07:17.135Z] Removing /builds/worker/fetches/android-gradle-dependencies.tar.xz
[fetches 2020-03-19T20:07:36.518Z] http://taskcluster/api/queue/v1/task/SBZuExGLRVGdZnLaYSsaig/artifacts/project/gecko/android-sdk/android-sdk-linux.tar.xz resolved to 321341423 bytes with sha256 40fde7d48c5c71a5afea101e430b8934f26541a6a3ef7a9f45a614b0b863b639 in 48.882s
[fetches 2020-03-19T20:07:36.518Z] Extracting /builds/worker/fetches/android-sdk-linux.tar.xz to /builds/worker/fetches
[taskcluster 2020-03-19 20:07:50.070Z] === Task Finished ===
[taskcluster 2020-03-19 20:07:50.073Z] Artifact "public/build/maven" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview/maven/"
[taskcluster 2020-03-19 20:07:50.074Z] Artifact "public/build/geckoview_example.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview_example/outputs/apk/withGeckoBinaries/debug/geckoview_example-withGeckoBinaries-debug.apk"
[taskcluster 2020-03-19 20:07:50.075Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2020-03-19 20:07:50.076Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2020-03-19 20:07:50.077Z] Artifact "public/build/geckoview-androidTest.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview/outputs/apk/androidTest/withGeckoBinaries/debug/geckoview-withGeckoBinaries-debug-androidTest.apk"
[taskcluster 2020-03-19 20:07:50.314Z] Unsuccessful task run with exit code: 137 completed in 78.134 seconds

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Updated

•

4 years ago

Keywords: intermittent-failure

Comment hidden (Intermittent Failures Robot)

Sarah Clements [:sclements]

Updated

•

4 years ago

Comment 27

•

4 years ago

Recent failure here are investigated in bug 1668111.

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Updated

•

4 years ago

Whiteboard: [stockwell disable-recommended]

Comment hidden (Intermittent Failures Robot)

Bogdan Tara[:bogdan_tara | bogdant]

Comment 37

•

4 years ago

In the last 7 days, there have been 53 occurrences, most on linux1804-64 debug and opt.

Recent failure: https://treeherder.mozilla.org/logviewer?job_id=322531461&repo=autoland&lineNumber=820

Comment hidden (Intermittent Failures Robot)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

3 years ago

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → INACTIVE

Comment hidden (Intermittent Failures Robot)

Natalia Csoregi [:nataliaCs]

Comment 56

•

3 years ago

https://treeherder.mozilla.org/logviewer?job_id=334287503&repo=autoland&lineNumber=852

Status: RESOLVED → REOPENED

Resolution: INACTIVE → ---

Dustin J. Mitchell [:dustin] (he/him)

Comment 57

•

3 years ago

Moving to General as this is a target for intermittent filing and not an actionable bug.

Component: Workers → General

Product: Taskcluster → Firefox

Comment hidden (Intermittent Failures Robot)

Atila Butkovits

Comment 122

•

2 years ago

There have been 41 failures in the last 7 days.

Happens on:

linux1804-64-asan-qr opt
linux1804-64-qr debug and opt
linux1804-64-tsan-qr opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=381481869&repo=autoland&lineNumber=1540

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Comment 127

•

2 years ago

Hi Dave, could you please take a look or assign this to someone?
Thank you.

There are 76 total failures in the last 7 days on

linux1804-64-asan-qr opt
linux1804-64-qr opt and debug
linux1804-64-tsan-qr opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=382082644&repo=autoland&lineNumber=52637

[task 2022-06-22T00:02:23.891Z] 00:02:23     INFO -  [Parent 24414, IPDL Background] WARNING: quota manager shutdown step: '0.008251s: stopCrashBrowserTimer', file /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:3792
[task 2022-06-22T00:02:23.895Z] 00:02:23     INFO -  DEBUG: Starting phase profile-before-change-telemetry
[task 2022-06-22T00:02:23.897Z] 00:02:23     INFO -  DEBUG: Spinning the event loop
[task 2022-06-22T00:02:23.899Z] 00:02:23     INFO -  [Child 24550, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.906Z] 00:02:23     INFO -  [Child 24550, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.915Z] 00:02:23     INFO -  [Child 24550, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:23.942Z] 00:02:23     INFO -  [Child 24473, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.948Z] 00:02:23     INFO -  [Child 24473, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.961Z] 00:02:23     INFO -  [Child 24473, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:23.972Z] 00:02:23     INFO -  [Child 24497, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.976Z] 00:02:23     INFO -  [Child 24497, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.983Z] 00:02:23     INFO -  [Child 24497, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:24.099Z] 00:02:24     INFO -  DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.125Z] 00:02:24     INFO -  DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.144Z] 00:02:24     INFO -  DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.148Z] 00:02:24     INFO -  DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.166Z] 00:02:24     INFO -  DEBUG: Completed blocker TelemetryController: shutting down for phase profile-before-change-telemetry
[task 2022-06-22T00:02:24.167Z] 00:02:24     INFO -  DEBUG: Finished phase profile-before-change-telemetry
[task 2022-06-22T00:02:24.168Z] 00:02:24     INFO -  DEBUG: Starting phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.169Z] 00:02:24     INFO -  DEBUG: Spinning the event loop
[task 2022-06-22T00:02:24.173Z] 00:02:24     INFO -  DEBUG: Completed blocker OS.File: flush pending requests, warn about unclosed files, shut down service. for phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.174Z] 00:02:24     INFO -  DEBUG: Finished phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.178Z] 00:02:24     INFO -  DEBUG: Starting phase web-workers-shutdown
[task 2022-06-22T00:02:24.179Z] 00:02:24     INFO -  DEBUG: Spinning the event loop
[task 2022-06-22T00:02:24.183Z] 00:02:24     INFO -  DEBUG: Finished phase web-workers-shutdown
[task 2022-06-22T00:02:24.198Z] 00:02:24     INFO -  [Parent 24414, IPDL Background] WARNING: IPC Connection Error: [Parent][PBackgroundParent] RunMessage(msgname=PRemoteWorkerService::Msg___delete__) Channel closing: too late to send/recv, messages will be lost: file /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1876
[task 2022-06-22T00:02:24.812Z] 00:02:24     INFO -  [Parent 24414, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:24.823Z] 00:02:24     INFO -  [Parent 24414, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:24.841Z] 00:02:24     INFO -  [Parent 24414, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:24.844Z] 00:02:24     INFO -  nsStringStats
[task 2022-06-22T00:02:24.845Z] 00:02:24     INFO -   => mAllocCount:          71579
[task 2022-06-22T00:02:24.846Z] 00:02:24     INFO -   => mReallocCount:            0
[task 2022-06-22T00:02:24.846Z] 00:02:24     INFO -   => mFreeCount:           71578  --  LEAKED 1 !!!
[task 2022-06-22T00:02:24.846Z] 00:02:24     INFO -   => mShareCount:          50252
[task 2022-06-22T00:02:24.847Z] 00:02:24     INFO -   => mAdoptCount:           1679
[task 2022-06-22T00:02:24.847Z] 00:02:24     INFO -   => mAdoptFreeCount:       1771
[task 2022-06-22T00:02:24.847Z] 00:02:24     INFO -   => Process ID: 24414, Thread ID: 139628161734528
[task 2022-06-22T00:02:25.205Z] 00:02:25     INFO -  DEBUG: Adding blocker PermissionManager: Flushing data for phase xpcom-will-shutdown
[taskcluster 2022-06-22 00:02:26.463Z] === Task Finished ===
[taskcluster 2022-06-22 00:02:26.538Z] Artifact "public/logs" not found at "/builds/worker/workspace/logs/"
[taskcluster 2022-06-22 00:02:26.540Z] Artifact "public/test" not found at "/builds/worker/artifacts/"
[taskcluster 2022-06-22 00:02:26.541Z] Artifact "public/test_info" not found at "/builds/worker/workspace/build/blobber_upload_dir/"
[taskcluster 2022-06-22 00:02:26.600Z] Unsuccessful task run with exit code: 137 completed in 1259.169 seconds

Flags: needinfo?(dtownsend)

Whiteboard: [stockwell needswork:owner]

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 131

•

2 years ago

Does the frequency increase for these docker crashes align with worker changes?

Flags: needinfo?(dtownsend) → needinfo?(mgoossens)

Michelle Goossens [:masterwayz]

Comment 132

•

2 years ago

The jobs that I see in there all run on GCP yes (from what I remember), the timing matches.
That's not ideal.

Flags: needinfo?(mgoossens)

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 136

•

2 years ago

What's the next here to bring the failure frequency down and who owns it?

Flags: needinfo?(mgoossens)

Michelle Goossens [:masterwayz]

Comment 137

•

2 years ago

Well not knowing what is going wrong I have no clue myself, maybe ahal knows who we could forward it to.

Flags: needinfo?(mgoossens) → needinfo?(ahal)

Comment hidden (Intermittent Failures Robot)

Andrew Halberstadt [:ahal]

Comment 139

•

2 years ago

After some quick Googling, looks like exit code 137 means we ran out of memory. Are there particular suites this happens with? If so maybe we can bump those up to the xlarge pool. If it is happening in lots of places, maybe we need to increase the instance type of the large pool?

Component: General → Firefox-CI Administration

Flags: needinfo?(ahal)

Product: Firefox → Release Engineering

QA Contact: mgoossens

Version: Trunk → unspecified

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Severity: normal → S3

Priority: -- → P3

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Comment 141

•

2 years ago

•

Edited

There aren't any particular test suites where this is happening, they're mochitest, xpcs, crashtests, gtests, marionette. The only thing in common is that this is a linux only fail:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-05-31&endday=2022-06-30&tree=trunk&bug=1623747
It was a pretty steady increase until last week when it spiked considerably, in the last few days is failing very very often.

Comment hidden (Intermittent Failures Robot)

Andrew Halberstadt [:ahal]

Comment 151

•

2 years ago

Hey Michelle, could you take a look at how much RAM the AWS instances have vs GCP? Could we bump the RAM up in the large pool in GCP without switching to a new instance?

I suspect this bug is going to be the top priority w.r.t to GCP migration until it is fixed.

Flags: needinfo?(mgoossens)

Michelle Goossens [:masterwayz]

Comment 152

•

2 years ago

Both AWS and GCP instances appear to have 8GB of memory from what I can see
n2-standard-2 vs m5.large

Flags: needinfo?(mgoossens)

Andrew Halberstadt [:ahal]

Comment 153

•

2 years ago

I wonder if something in the image is causing us to use more memory than the AWS pools.. Can we increase memory for GCP regardless to see if it works in the short term?

Maybe longer term we can look into improving the image.

Joel Maher ( :jmaher ) (UTC -8)

Comment 154

•

2 years ago

given that this is the #1 intermittent, we should prioritize investigating this. here is a case where the failure happens in <4 minutes, typically it is 10+ minutes.

Do we know for a fact this is memory? I imagine it is based on the error code, but how can we tell (is here a console for gcp instances)? Is the memory managed on GCP the same as AWS? I assume trying larger instances or something with more memory could reduce this problem.

Andrew Halberstadt [:ahal]

Comment 155

•

2 years ago

No it's not a guarantee that it's OOM. From the Exit code 137 heading of this article, when a docker container exits with this error it means it got a SIGKILL. Though from that article and from others, if the SIGKILL was not initiated manually, it usually comes from the Docker daemon itself killing it due to OOM.

Joel Maher ( :jmaher ) (UTC -8)

Comment 156

•

2 years ago

on this try push:
https://treeherder.mozilla.org/jobs?repo=try&revision=0817494e1d36f4f75806ef28cfe261bd704811bd

there are:
1747 total test jobs
99 failed jobs
21 Exit code 137 failures

1.2% of the test jobs result in Exit code 137.

How can we test on a different instance type?

Flags: needinfo?(ahal)

Andrew Halberstadt [:ahal]

Comment 157

•

2 years ago

•

Edited

The fastest way would be to change the {alias} in this line to a hardcoded t-linux-xlarge-gcp. This will make all the things configured for the large workertype use the xlarge one instead. Then do try push as normal.

The t-linux-xlarge workertype has 16GB of RAM instead of 8.

To solve this outside of a try push, we can either switch the instance type from n2-standard-2 to n2-highmem-2 (also has 16GB of RAM), or we can create a custom instance with e.g 12GB of RAM if that's all we need. These changes will need to happen in ci-configuration.

Flags: needinfo?(ahal)

Michelle Goossens [:masterwayz]

Comment 158

•

2 years ago

Since this is getting out of hand with failures, am prioritizing this as number one spot on my plate.
Try push is baking to see if more RAM fixes this.

Assignee: nobody → mgoossens

Status: REOPENED → ASSIGNED

Priority: P3 → P2

Michelle Goossens [:masterwayz]

Comment 159

•

2 years ago

Try push: https://treeherder.mozilla.org/jobs?repo=try&revision=5a0ff2119a63053e18bb1b08616daca2c712e264
ahal, jmaher; does that push look any better with the exit code 137?
It runs mochitest-browser-chrome on xlarge, -plain not for some reason, but it should be a start!

Flags: needinfo?(jmaher)

Flags: needinfo?(ahal)

Joel Maher ( :jmaher ) (UTC -8)

Comment 160

•

2 years ago

•

Edited

I did a push as well with mochitest-plain (not yet landed) using xlarge and it reduced the failure rate, but out of ~1800 tasks I had 6 137 errors. 1 of them in 34 seconds (still setting up the py venv stuff):
https://firefoxci.taskcluster-artifacts.net/W43QEQmIQxqErW_4BmL5OQ/0/public/logs/live_backing.log

this hints strongly that something else is going on, OOM isn't our single problem, either we need to look elsewhere, or we have multiple problems.

It is promising that we have had a reduction of 137 errors with the xlarge instances, in fact, almost a 50% reduction in other intermittents on the try push as well.

for people that like to hack in redash, here is a query I put together in a few minutes (caveat, this would probably fail an interview for sql skills) to show failures given a pushid (found via devtools::network panel while loading a try push and finding the url with pushid=XXXXXXXX):

set @PUSHID=1091826; /* jmaher mochitest-plain with xlarge */
/* set @PUSHID=1090928; /* masterwayz mochitest-plain */
set @PUSHID=1091837; /* masterwayz browser-chrome with xlarge */

select
  oom.counter as OOM,
  failures.counter as total_failures,
  count(j.id) as total_jobs
from
  (select
   count(tle.line) as counter
  from
   job j,
   job_type jt,
   text_log_error tle
  where
    j.push_id=@PUSHID AND
    j.job_type_id=jt.id AND
    jt.name like 'test-linux1804-64-%'
    and result='testfailed'
    and tle.job_id = j.id
    and tle.line like 'Unsuccessful task run with exit code: 137%') as oom,
  (select
    count(j.id) as counter
   from
     job_type jt,
     job j
   where
     j.push_id=@PUSHID AND
     j.job_type_id=jt.id AND
     jt.name like 'test-linux1804-64-%'
     and result='testfailed'
  ) as failures,
  job_type jt,
  job j
where
  j.push_id=@PUSHID AND
  j.job_type_id=jt.id AND
  jt.name like 'test-linux1804-64-%'

Flags: needinfo?(jmaher)

Comment hidden (Intermittent Failures Robot)

Andrew Halberstadt [:ahal]

Comment 162

•

2 years ago

this hints strongly that something else is going on, OOM isn't our single problem

I think something else might be going on, but that it's likely still memory related. Maybe there's a limit on the per-container memory allocation here that didn't exist with the other pool.

Dave, do we configure docker in the image? If so is it possible we're restricting the amount of memory available to running containers to a greater degree than we are in the AWS image?

Flags: needinfo?(ahal) → needinfo?(dhouse)

:dhouse

Comment 163

•

2 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #162)

this hints strongly that something else is going on, OOM isn't our single problem

I think something else might be going on, but that it's likely still memory related. Maybe there's a limit on the per-container memory allocation here that didn't exist with the other pool.

Dave, do we configure docker in the image? If so is it possible we're restricting the amount of memory available to running containers to a greater degree than we are in the AWS image?

This is a direct disk image from aws. So the config should be the same except for what is available on the vm when it boots the disk. I'll verify to see what memory is available to docker on a running instance.

Do the task logs possibly record how much memory is used or available? We might compare with aws runs to see how much is being used.

Since the large instance type reduced the failures, could we run on 2xlarge or greater to workaround this for now?

Joel Maher ( :jmaher ) (UTC -8)

Comment 164

•

2 years ago

I think we should test on other instances. Are we sure each instance gets the memory allocated, or is there a collection of instances that share from a pool of memory. I find it near impossible to believe that the example I shared above failed due to OOM during setup- if it were that close to OOM then we would hit this >30% of the time during setup, let alone the first Firefox browser launch.

Michelle Goossens [:masterwayz]

Comment 165

•

2 years ago

Attached file Bug 1623747 - Temporarily fix for exit code 137 failures r=#releng-reviewers (obsolete) — Details

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 167

•

2 years ago

Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/e33cc022c0df
Temporarily fix for exit code 137 failures r=releng-reviewers,ahal

Comment hidden (Intermittent Failures Robot)

Andrew Halberstadt [:ahal]

Comment 169

•

2 years ago

Michelle's fix went live a day ago. We should pay attention to how this affects the intermittent rate.

Joel Maher ( :jmaher ) (UTC -8)

Comment 170

•

2 years ago

Just parsing the data from treeherder, I see:
date: count
2022-07-14: 36
2022-07-13: 72
2022-07-12: 146
2022-07-11: 91

The fix went live I assume 25% through 07-13, and we are about halfway done with 07-14. I am doing another try push to see the statistics compared to previous try pushes.

Joel Maher ( :jmaher ) (UTC -8)

Comment 171

•

2 years ago

I did a push of linux mochitest-plain with --rebuild 10:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=e8c4bd66be522b6843a25b64456d896c1a747e48

this is running on n2-highmem, I ran the query from comment 160 (mod: set @PUSHID=1093896;). Overall there are:
1878 total jobs
117 failures
15 OOM

This doesn't yield a lot of confidence- high mem is probably reducing our failures, but by 30-40%, I doubt we are close to half reduced. Which is more evidence that this isn't related to memory at all, or if it is then we are not allocated a stable fixed amount of memory per instance/container.

Andrew Halberstadt [:ahal]

Comment 172

•

2 years ago

I'd guess it's related to memory, but not to to the total memory available to the host.

Joel Maher ( :jmaher ) (UTC -8)

Comment 173

•

2 years ago

can we configure these error 137 to be TBPL_RETRY? Maybe we can switch from n2-highmem to xlarge?

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Assignee: mgoossens → nobody

Status: ASSIGNED → NEW

Comment hidden (Intermittent Failures Robot)

:dhouse

Comment 179

•

2 years ago

Nothing solid yet from the system side:

I was hopeful when I saw an oom_adj deprecation notice,

Jul 20 16:42:24 dhouse-gecko-t-xlarge-mem-check-image-2 kernel: [   30.189261] start-worker (2281): /proc/2281/oom_adj is deprecated, please use /proc/2281/oom_score_adj instead.

but we see the same on aws:

Jul 20 16:47:33 ip-10-145-79-109 kernel [   70.056394] start-worker (2360): /proc/2360/oom_adj is deprecated, please use /proc/2360/oom_score_adj instead.

Comment hidden (Intermittent Failures Robot)

:dhouse

Comment 181

•

2 years ago

From what I can find, the gcp instances have as much or more memory.total that docker-worker is finding compared to aws.

gecko-t.t-linux-large-gcp 16827727872
vs
gecko-t.t-linux-large.m5large 8105631744

and
gecko-t.t-linux-xlarge-gcp 16827449344
vs
gecko-t.t-linux-xlarge.m5axlarge 16431046656

I checked from logs in papertrail for just over 24h to get the total memory available that docker-worker sees on gcp and aws (averaged, all are within a few bytes):

gecko-t.misc.c5dxlarge 8031952896
gecko-t.misc.m5dxlarge 16428949504
gecko-t.misc.r5dxlarge 33252298752
gecko-t.t-linux-large-gcp 16827727872
gecko-t.t-linux-large.m5large 8105631744
gecko-t.t-linux-metal.m5metal 405176481792
gecko-t.t-linux-metal.r5metal 811050600448
gecko-t.t-linux-xlarge-gcp 16827449344
gecko-t.t-linux-xlarge-source.c5xlarge 8031952896
gecko-t.t-linux-xlarge-source.m5axlarge 16431046656
gecko-t.t-linux-xlarge-source.m5dxlarge 16428949504
gecko-t.t-linux-xlarge-source.m5xlarge 16428949504
gecko-t.t-linux-xlarge.c5xlarge 8031952896
gecko-t.t-linux-xlarge.m5axlarge 16431046656
gecko-t.t-linux-xlarge.m5dxlarge 16428949504
gecko-t.t-linux-xlarge.m5xlarge 16428949504

The log lines look like:

Jul 18 20:15:40 gecko-t-t-linux-xlarge-gcp-tojjarhtrie7im-weuaocq docker-worker: 2022/07/18 20:15:40 {"EnvVersion":"2.0","Fields":{"key":"memory.total","v":1,"val":16827449344},"Hostname":"gecko-t-t-linux-xlarge-gcp-tojjarhtrie7im-weuaocq","Logger":"taskcluster.docker-worker.gecko-t.t-linux-xlarge-gcp.projects/887720501152/machineTypes/n2-standard-4","Pid":2435,"Severity":6,"Timestamp":1658175340287000000,"Type":"monitor.measure","serviceContext":{"service":"docker-worker"},"severity":"INFO"}

Joel Maher ( :jmaher ) (UTC -8)

Comment 182

•

2 years ago

this is good info dhouse- can we check the allocated memory used in the docker container? possibly there is some setting which is using more memory on gcp than on aws.

Comment hidden (Intermittent Failures Robot)

:dhouse

Comment 184

•

2 years ago

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #182)

this is good info dhouse- can we check the allocated memory used in the docker container? possibly there is some setting which is using more memory on gcp than on aws.

Maybe we could check this from a task, and re-run it many times to see if we catch when a task gets this failure? I took the total from memory metrics recorded by docker-worker. But checking inside the task could give us more information.

I'll collect the memory.free from the logs for different worker types and look for low/min to compare.

Also, I'll look more for logs on some of the specific instances/tasks with failures from the intermittent failures search/view on this bug. There could be something we're missing.

Comment hidden (Intermittent Failures Robot)

Michelle Goossens [:masterwayz]

Comment 188

•

2 years ago

Attached file Bug 1623747 - Run large tests on xlarge to reduce error rates r=ahal!,jmaher! (obsolete) — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → mgoossens

Status: NEW → ASSIGNED

Michelle Goossens [:masterwayz]

Comment 189

•

2 years ago

No, bad bot.

Assignee: mgoossens → nobody

Status: ASSIGNED → NEW

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → mgoossens

Attachment #9286865 - Attachment description: Bug 1623747 - Run large tests on xlarge to reduce errors rates r=ahal!,jmaher! → Bug 1623747 - Run large tests on xlarge to reduce error rates r=ahal!,jmaher!

Status: NEW → ASSIGNED

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 191

•

2 years ago

Pushed by mgoossens@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/317a9ea4c9ad
Run large tests on xlarge to reduce error rates r=ahal,jmaher

Atila Butkovits

Comment 192

•

2 years ago

Backed out for causing reftest failures.

Backout link: https://hg.mozilla.org/integration/autoland/rev/1b40798587e3691910e9feab3925e4756d4ba8d2

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=317a9ea4c9ad7376a9bbbcf0076b8ad5186e2aad&selectedTaskRun=IoqEz-a-SM6KUbrihOpvCw.0

Failure log: https://treeherder.mozilla.org/logviewer?job_id=385532058&repo=autoland&lineNumber=11285

Flags: needinfo?(mgoossens)

Comment hidden (Intermittent Failures Robot)

Andrew Halberstadt [:ahal]

Comment 195

•

2 years ago

I'd be surprised if this patch actually caused that test to fail. Afaict, it just a refactor and shouldn't have any impact on the task definitions. I'd guess the test is permafail (or very highly intermittent) and we just haven't noticed that fact yet. I ran a backfill on Michelle's push to confirm.

Joel Maher ( :jmaher ) (UTC -8)

Comment 196

•

2 years ago

so far this looks to be 100% reproduced on that push and previous pushes are not showing any failures. I sanity checked log files. Looking at reftest viewer the reference image is "timed out after 2000ms", so possibly the larger machine changes the timing?

Joel Maher ( :jmaher ) (UTC -8)

Comment 197

•

2 years ago

I believe we can set this to retry via:
https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/transforms/job/mozharness.py#210

it looks to be an array, so maybe setting it to [4, 137] would work?

Andrew Halberstadt [:ahal]

Updated

•

2 years ago

Comment 198

•

2 years ago

so far this looks to be 100% reproduced on that push and previous pushes are not showing any failures. I sanity checked log files. Looking at reftest viewer the reference image is "timed out after 2000ms", so possibly the larger machine changes the timing?

Ok that makes sense. There's an existing intermittent on file here (bug 1731862). I mistakenly thought the patch that landed was simply a cleanup and the switch to larger instance had already happened.

I believe we can set this to retry via:
https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/transforms/job/mozharness.py#210

If the worker is crashing due to running out of memory, I'm not sure that mozharness will still be running to do the retry. I know Taskcluster has a retry mechanism built-in, we'd probably need to use that instead. Something like this:
https://searchfox.org/mozilla-central/source/taskcluster/ci/release-final-verify/kind.yml#25

Joel Maher ( :jmaher ) (UTC -8)

Comment 199

•

2 years ago

that is the same as the link I had for the mozharness transform, basically we need to set the task definition to accept exit code 137 as a retry code:

    # Retry if mozharness returns TBPL_RETRY
    worker["retry-exit-status"] = [4, 137]

Maybe :masterwayz could give some try pushes to see if we can retry on the 137 errors, and then consider switching back to the regular instances. Then we can keep tabs on the % cpu usage over time and make sure it isn't increasing >2% of our total.

for math, would the cost of xlarge everywhere outstrip the cost of large + 2% ?

Michelle Goossens [:masterwayz]

Comment 200

•

2 years ago

I'll work on a patch with that and try things out!

Phabricator Automation

Updated

•

2 years ago

Attachment #9286865 - Attachment is obsolete: true

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 202

•

2 years ago

Attached file Bug 1623747 - retry task on exit code 137. r=ahal! — Details

Michelle Goossens [:masterwayz]

Comment 203

•

2 years ago

Attached file Bug 1623747 - Backed out 1 changesets (bug 1623747) r=#releng-reviewers — Details

Backed out changeset e33cc022c0df (bug 1623747) as it is no longer needed;

Pulsebot

Comment 204

•

2 years ago

Pushed by mgoossens@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/0732258f3f67
Backed out 1 changesets (bug 1623747) r=releng-reviewers,ahal

Pulsebot

Comment 205

•

2 years ago

Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6d36f9426a5f
retry task on exit code 137. r=ahal

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Flags: needinfo?(mgoossens)

Comment hidden (Intermittent Failures Robot)

Cristina Cozmuta (:CrissCozmuta)

Comment 207

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6d36f9426a5f

Status: ASSIGNED → RESOLVED

Closed: 3 years ago → 2 years ago

status-firefox105: --- → fixed

Resolution: --- → FIXED

Sandor Molnar[:smolnar]

Comment 208

•

2 years ago

This is still happening

Status: RESOLVED → REOPENED

status-firefox105: fixed → ---

Flags: needinfo?(mgoossens)

Resolution: FIXED → ---

Joel Maher ( :jmaher ) (UTC -8)

Comment 209

•

2 years ago

linting/source-test specific tasks

Phabricator Automation

Updated

•

2 years ago

Attachment #9285172 - Attachment is obsolete: true

Comment hidden (Intermittent Failures Robot)

Michelle Goossens [:masterwayz]

Comment 211

•

2 years ago

This was not supposed to end up being closed.

Flags: needinfo?(mgoossens)

Keywords: leave-open

Whiteboard: [stockwell disable-recommended]

Joel Maher ( :jmaher ) (UTC -8)

Comment 212

•

2 years ago

sounds like we need this for source-test and lint, then uplift to mozilla-beta. That won't get everything, but the large majority. If we want everything, there are instances in update-generate-sources* which affect windows/osx.

Joel Maher ( :jmaher ) (UTC -8)

Comment 213

•

2 years ago

Comment on attachment 9287705 [details]
Bug 1623747 - retry task on exit code 137. r=ahal!

Beta/Release Uplift Approval Request

User impact if declined: n/a
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): this cleans up some infrastructure failures in CI by setting them to auto_retry which seems to solve the problem!
String changes made/needed:
Is Android affected?: No

Attachment #9287705 - Flags: approval-mozilla-beta?

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

2 years ago

Attachment #9287772 - Flags: approval-mozilla-beta?

Joel Maher ( :jmaher ) (UTC -8)

Comment 214

•

2 years ago

Attached file Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal! — Details

Comment hidden (Intermittent Failures Robot)

Dianna Smith [:diannaS]

Comment 216

•

2 years ago

Comment on attachment 9287705 [details]
Bug 1623747 - retry task on exit code 137. r=ahal!

Approved for 104.0b5

Attachment #9287705 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Dianna Smith [:diannaS]

Updated

•

2 years ago

Attachment #9287772 - Flags: approval-mozilla-beta?

Dianna Smith [:diannaS]

Comment 217

•

2 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/bfef6b9aedc7

status-firefox104: --- → fixed

Pulsebot

Comment 218

•

2 years ago

Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/683d337f92f2
Auto retry source-test jobs on exit 137. r=ahal

Cristina Cozmuta (:CrissCozmuta)

Comment 219

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/683d337f92f2

Joel Maher ( :jmaher ) (UTC -8)

Comment 220

•

2 years ago

Comment on attachment 9288035 [details]
Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal!

Beta/Release Uplift Approval Request

User impact if declined: n/a
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): helps retry tasks that fail on linux with a known infrastructure error.
String changes made/needed:
Is Android affected?: No

Attachment #9288035 - Flags: approval-mozilla-beta?

Comment hidden (Intermittent Failures Robot)

Dianna Smith [:diannaS]

Comment 222

•

2 years ago

Comment on attachment 9288035 [details]
Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal!

Approved for 104.0b6

Attachment #9288035 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Dianna Smith [:diannaS]

Comment 223

•

2 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/dd6503f6943c

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended]

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended]

Comment hidden (Intermittent Failures Robot)

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Assignee: mgoossens → nobody

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Priority: P2 → P3

Comment hidden (Intermittent Failures Robot)

Bug 1623747 - Temporarily fix for exit code 137 failures r=#releng-reviewers 2 years ago Michelle Goossens [:masterwayz] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1623747 - Run large tests on xlarge to reduce error rates r=ahal!,jmaher! 2 years ago Michelle Goossens [:masterwayz] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1623747 - retry task on exit code 137. r=ahal! 2 years ago Joel Maher ( :jmaher ) (UTC -8) 48 bytes, text/x-phabricator-request	diannaS : approval-mozilla-beta+	Details \| Review
Bug 1623747 - Backed out 1 changesets (bug 1623747) r=#releng-reviewers 2 years ago Michelle Goossens [:masterwayz] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal! 2 years ago Joel Maher ( :jmaher ) (UTC -8) 48 bytes, text/x-phabricator-request	diannaS : approval-mozilla-beta+	Details \| Review