Open Bug 1623747 Opened 4 years ago Updated 3 days ago

Unsuccessful task run with exit code: 137 completed in X seconds

Categories

(Release Engineering :: Firefox-CI Administration, defect, P3)

Tracking

(firefox104 fixed)

REOPENED
Tracking Status
firefox104 --- fixed

People

(Reporter: NarcisB, Unassigned)

References

Details

(Keywords: intermittent-failure, leave-open)

Attachments

(3 files, 2 obsolete files)

https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=293933882&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&revision=f753bf2c8d70cd31970e42dad254c54b17705da7&searchStr=android%2C5.0%2Caarch64%2Copt%2Cbuild-android-aarch64%2Fopt%2C%28b%29

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293933882&repo=autoland&lineNumber=122

[fetches 2020-03-19T20:07:15.071Z] Extracting /builds/worker/fetches/android-ndk.tar.xz to /builds/worker/fetches
[fetches 2020-03-19T20:07:17.135Z] /builds/worker/fetches/android-gradle-dependencies.tar.xz extracted in 26.889s
[fetches 2020-03-19T20:07:17.135Z] Removing /builds/worker/fetches/android-gradle-dependencies.tar.xz
[fetches 2020-03-19T20:07:36.518Z] http://taskcluster/api/queue/v1/task/SBZuExGLRVGdZnLaYSsaig/artifacts/project/gecko/android-sdk/android-sdk-linux.tar.xz resolved to 321341423 bytes with sha256 40fde7d48c5c71a5afea101e430b8934f26541a6a3ef7a9f45a614b0b863b639 in 48.882s
[fetches 2020-03-19T20:07:36.518Z] Extracting /builds/worker/fetches/android-sdk-linux.tar.xz to /builds/worker/fetches
[taskcluster 2020-03-19 20:07:50.070Z] === Task Finished ===
[taskcluster 2020-03-19 20:07:50.073Z] Artifact "public/build/maven" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview/maven/"
[taskcluster 2020-03-19 20:07:50.074Z] Artifact "public/build/geckoview_example.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview_example/outputs/apk/withGeckoBinaries/debug/geckoview_example-withGeckoBinaries-debug.apk"
[taskcluster 2020-03-19 20:07:50.075Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2020-03-19 20:07:50.076Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2020-03-19 20:07:50.077Z] Artifact "public/build/geckoview-androidTest.apk" not found at "/builds/worker/workspace/obj-build/gradle/build/mobile/android/geckoview/outputs/apk/androidTest/withGeckoBinaries/debug/geckoview-withGeckoBinaries-debug-androidTest.apk"
[taskcluster 2020-03-19 20:07:50.314Z] Unsuccessful task run with exit code: 137 completed in 78.134 seconds

See Also: → 1654892

Recent failure here are investigated in bug 1668111.

Whiteboard: [stockwell disable-recommended]

In the last 7 days, there have been 53 occurrences, most on linux1804-64 debug and opt.

Recent failure: https://treeherder.mozilla.org/logviewer?job_id=322531461&repo=autoland&lineNumber=820

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INACTIVE

Moving to General as this is a target for intermittent filing and not an actionable bug.

Component: Workers → General
Product: Taskcluster → Firefox

There have been 41 failures in the last 7 days.

Happens on:

  • linux1804-64-asan-qr opt
  • linux1804-64-qr debug and opt
  • linux1804-64-tsan-qr opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=381481869&repo=autoland&lineNumber=1540

Hi Dave, could you please take a look or assign this to someone?
Thank you.

There are 76 total failures in the last 7 days on

  • linux1804-64-asan-qr opt
  • linux1804-64-qr opt and debug
  • linux1804-64-tsan-qr opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=382082644&repo=autoland&lineNumber=52637

[task 2022-06-22T00:02:23.891Z] 00:02:23     INFO -  [Parent 24414, IPDL Background] WARNING: quota manager shutdown step: '0.008251s: stopCrashBrowserTimer', file /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:3792
[task 2022-06-22T00:02:23.895Z] 00:02:23     INFO -  DEBUG: Starting phase profile-before-change-telemetry
[task 2022-06-22T00:02:23.897Z] 00:02:23     INFO -  DEBUG: Spinning the event loop
[task 2022-06-22T00:02:23.899Z] 00:02:23     INFO -  [Child 24550, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.906Z] 00:02:23     INFO -  [Child 24550, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.915Z] 00:02:23     INFO -  [Child 24550, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:23.942Z] 00:02:23     INFO -  [Child 24473, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.948Z] 00:02:23     INFO -  [Child 24473, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.961Z] 00:02:23     INFO -  [Child 24473, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:23.972Z] 00:02:23     INFO -  [Child 24497, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:23.976Z] 00:02:23     INFO -  [Child 24497, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:23.983Z] 00:02:23     INFO -  [Child 24497, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:24.099Z] 00:02:24     INFO -  DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.125Z] 00:02:24     INFO -  DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.144Z] 00:02:24     INFO -  DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.148Z] 00:02:24     INFO -  DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-06-22T00:02:24.166Z] 00:02:24     INFO -  DEBUG: Completed blocker TelemetryController: shutting down for phase profile-before-change-telemetry
[task 2022-06-22T00:02:24.167Z] 00:02:24     INFO -  DEBUG: Finished phase profile-before-change-telemetry
[task 2022-06-22T00:02:24.168Z] 00:02:24     INFO -  DEBUG: Starting phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.169Z] 00:02:24     INFO -  DEBUG: Spinning the event loop
[task 2022-06-22T00:02:24.173Z] 00:02:24     INFO -  DEBUG: Completed blocker OS.File: flush pending requests, warn about unclosed files, shut down service. for phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.174Z] 00:02:24     INFO -  DEBUG: Finished phase xpcom-will-shutdown
[task 2022-06-22T00:02:24.178Z] 00:02:24     INFO -  DEBUG: Starting phase web-workers-shutdown
[task 2022-06-22T00:02:24.179Z] 00:02:24     INFO -  DEBUG: Spinning the event loop
[task 2022-06-22T00:02:24.183Z] 00:02:24     INFO -  DEBUG: Finished phase web-workers-shutdown
[task 2022-06-22T00:02:24.198Z] 00:02:24     INFO -  [Parent 24414, IPDL Background] WARNING: IPC Connection Error: [Parent][PBackgroundParent] RunMessage(msgname=PRemoteWorkerService::Msg___delete__) Channel closing: too late to send/recv, messages will be lost: file /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1876
[task 2022-06-22T00:02:24.812Z] 00:02:24     INFO -  [Parent 24414, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3359
[task 2022-06-22T00:02:24.823Z] 00:02:24     INFO -  [Parent 24414, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/obj-build/dist/include/mozilla/Preferences.h:129
[task 2022-06-22T00:02:24.841Z] 00:02:24     INFO -  [Parent 24414, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4569
[task 2022-06-22T00:02:24.844Z] 00:02:24     INFO -  nsStringStats
[task 2022-06-22T00:02:24.845Z] 00:02:24     INFO -   => mAllocCount:          71579
[task 2022-06-22T00:02:24.846Z] 00:02:24     INFO -   => mReallocCount:            0
[task 2022-06-22T00:02:24.846Z] 00:02:24     INFO -   => mFreeCount:           71578  --  LEAKED 1 !!!
[task 2022-06-22T00:02:24.846Z] 00:02:24     INFO -   => mShareCount:          50252
[task 2022-06-22T00:02:24.847Z] 00:02:24     INFO -   => mAdoptCount:           1679
[task 2022-06-22T00:02:24.847Z] 00:02:24     INFO -   => mAdoptFreeCount:       1771
[task 2022-06-22T00:02:24.847Z] 00:02:24     INFO -   => Process ID: 24414, Thread ID: 139628161734528
[task 2022-06-22T00:02:25.205Z] 00:02:25     INFO -  DEBUG: Adding blocker PermissionManager: Flushing data for phase xpcom-will-shutdown
[taskcluster 2022-06-22 00:02:26.463Z] === Task Finished ===
[taskcluster 2022-06-22 00:02:26.538Z] Artifact "public/logs" not found at "/builds/worker/workspace/logs/"
[taskcluster 2022-06-22 00:02:26.540Z] Artifact "public/test" not found at "/builds/worker/artifacts/"
[taskcluster 2022-06-22 00:02:26.541Z] Artifact "public/test_info" not found at "/builds/worker/workspace/build/blobber_upload_dir/"
[taskcluster 2022-06-22 00:02:26.600Z] Unsuccessful task run with exit code: 137 completed in 1259.169 seconds
Flags: needinfo?(dtownsend)
Whiteboard: [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Does the frequency increase for these docker crashes align with worker changes?

Flags: needinfo?(dtownsend) → needinfo?(mgoossens)

The jobs that I see in there all run on GCP yes (from what I remember), the timing matches.
That's not ideal.

Flags: needinfo?(mgoossens)

What's the next here to bring the failure frequency down and who owns it?

Flags: needinfo?(mgoossens)

Well not knowing what is going wrong I have no clue myself, maybe ahal knows who we could forward it to.

Flags: needinfo?(mgoossens) → needinfo?(ahal)

After some quick Googling, looks like exit code 137 means we ran out of memory. Are there particular suites this happens with? If so maybe we can bump those up to the xlarge pool. If it is happening in lots of places, maybe we need to increase the instance type of the large pool?

Component: General → Firefox-CI Administration
Flags: needinfo?(ahal)
Product: Firefox → Release Engineering
QA Contact: mgoossens
Version: Trunk → unspecified
Severity: normal → S3
Priority: -- → P3

There aren't any particular test suites where this is happening, they're mochitest, xpcs, crashtests, gtests, marionette. The only thing in common is that this is a linux only fail:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-05-31&endday=2022-06-30&tree=trunk&bug=1623747
It was a pretty steady increase until last week when it spiked considerably, in the last few days is failing very very often.

Hey Michelle, could you take a look at how much RAM the AWS instances have vs GCP? Could we bump the RAM up in the large pool in GCP without switching to a new instance?

I suspect this bug is going to be the top priority w.r.t to GCP migration until it is fixed.

Flags: needinfo?(mgoossens)

Both AWS and GCP instances appear to have 8GB of memory from what I can see
n2-standard-2 vs m5.large

Flags: needinfo?(mgoossens)

I wonder if something in the image is causing us to use more memory than the AWS pools.. Can we increase memory for GCP regardless to see if it works in the short term?

Maybe longer term we can look into improving the image.

given that this is the #1 intermittent, we should prioritize investigating this. here is a case where the failure happens in <4 minutes, typically it is 10+ minutes.

Do we know for a fact this is memory? I imagine it is based on the error code, but how can we tell (is here a console for gcp instances)? Is the memory managed on GCP the same as AWS? I assume trying larger instances or something with more memory could reduce this problem.

No it's not a guarantee that it's OOM. From the Exit code 137 heading of this article, when a docker container exits with this error it means it got a SIGKILL. Though from that article and from others, if the SIGKILL was not initiated manually, it usually comes from the Docker daemon itself killing it due to OOM.

on this try push:
https://treeherder.mozilla.org/jobs?repo=try&revision=0817494e1d36f4f75806ef28cfe261bd704811bd

there are:
1747 total test jobs
99 failed jobs
21 Exit code 137 failures

1.2% of the test jobs result in Exit code 137.

How can we test on a different instance type?

Flags: needinfo?(ahal)

The fastest way would be to change the {alias} in this line to a hardcoded t-linux-xlarge-gcp. This will make all the things configured for the large workertype use the xlarge one instead. Then do try push as normal.

The t-linux-xlarge workertype has 16GB of RAM instead of 8.

To solve this outside of a try push, we can either switch the instance type from n2-standard-2 to n2-highmem-2 (also has 16GB of RAM), or we can create a custom instance with e.g 12GB of RAM if that's all we need. These changes will need to happen in ci-configuration.

Flags: needinfo?(ahal)

Since this is getting out of hand with failures, am prioritizing this as number one spot on my plate.
Try push is baking to see if more RAM fixes this.

Assignee: nobody → mgoossens
Status: REOPENED → ASSIGNED
Priority: P3 → P2

Try push: https://treeherder.mozilla.org/jobs?repo=try&revision=5a0ff2119a63053e18bb1b08616daca2c712e264
ahal, jmaher; does that push look any better with the exit code 137?
It runs mochitest-browser-chrome on xlarge, -plain not for some reason, but it should be a start!

Flags: needinfo?(jmaher)
Flags: needinfo?(ahal)

I did a push as well with mochitest-plain (not yet landed) using xlarge and it reduced the failure rate, but out of ~1800 tasks I had 6 137 errors. 1 of them in 34 seconds (still setting up the py venv stuff):
https://firefoxci.taskcluster-artifacts.net/W43QEQmIQxqErW_4BmL5OQ/0/public/logs/live_backing.log

this hints strongly that something else is going on, OOM isn't our single problem, either we need to look elsewhere, or we have multiple problems.

It is promising that we have had a reduction of 137 errors with the xlarge instances, in fact, almost a 50% reduction in other intermittents on the try push as well.

for people that like to hack in redash, here is a query I put together in a few minutes (caveat, this would probably fail an interview for sql skills) to show failures given a pushid (found via devtools::network panel while loading a try push and finding the url with pushid=XXXXXXXX):

set @PUSHID=1091826; /* jmaher mochitest-plain with xlarge */
/* set @PUSHID=1090928; /* masterwayz mochitest-plain */
set @PUSHID=1091837; /* masterwayz browser-chrome with xlarge */

select
  oom.counter as OOM,
  failures.counter as total_failures,
  count(j.id) as total_jobs
from
  (select
   count(tle.line) as counter
  from
   job j,
   job_type jt,
   text_log_error tle
  where
    j.push_id=@PUSHID AND
    j.job_type_id=jt.id AND
    jt.name like 'test-linux1804-64-%'
    and result='testfailed'
    and tle.job_id = j.id
    and tle.line like 'Unsuccessful task run with exit code: 137%') as oom,
  (select
    count(j.id) as counter
   from
     job_type jt,
     job j
   where
     j.push_id=@PUSHID AND
     j.job_type_id=jt.id AND
     jt.name like 'test-linux1804-64-%'
     and result='testfailed'
  ) as failures,
  job_type jt,
  job j
where
  j.push_id=@PUSHID AND
  j.job_type_id=jt.id AND
  jt.name like 'test-linux1804-64-%'
Flags: needinfo?(jmaher)

this hints strongly that something else is going on, OOM isn't our single problem

I think something else might be going on, but that it's likely still memory related. Maybe there's a limit on the per-container memory allocation here that didn't exist with the other pool.

Dave, do we configure docker in the image? If so is it possible we're restricting the amount of memory available to running containers to a greater degree than we are in the AWS image?

Flags: needinfo?(ahal) → needinfo?(dhouse)

(In reply to Andrew Halberstadt [:ahal] from comment #162)

this hints strongly that something else is going on, OOM isn't our single problem

I think something else might be going on, but that it's likely still memory related. Maybe there's a limit on the per-container memory allocation here that didn't exist with the other pool.

Dave, do we configure docker in the image? If so is it possible we're restricting the amount of memory available to running containers to a greater degree than we are in the AWS image?

This is a direct disk image from aws. So the config should be the same except for what is available on the vm when it boots the disk. I'll verify to see what memory is available to docker on a running instance.

Do the task logs possibly record how much memory is used or available? We might compare with aws runs to see how much is being used.

Since the large instance type reduced the failures, could we run on 2xlarge or greater to workaround this for now?

I think we should test on other instances. Are we sure each instance gets the memory allocated, or is there a collection of instances that share from a pool of memory. I find it near impossible to believe that the example I shared above failed due to OOM during setup- if it were that close to OOM then we would hit this >30% of the time during setup, let alone the first Firefox browser launch.

Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/e33cc022c0df
Temporarily fix for exit code 137 failures r=releng-reviewers,ahal

Michelle's fix went live a day ago. We should pay attention to how this affects the intermittent rate.

Just parsing the data from treeherder, I see:
date: count
2022-07-14: 36
2022-07-13: 72
2022-07-12: 146
2022-07-11: 91

The fix went live I assume 25% through 07-13, and we are about halfway done with 07-14. I am doing another try push to see the statistics compared to previous try pushes.

I did a push of linux mochitest-plain with --rebuild 10:
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=e8c4bd66be522b6843a25b64456d896c1a747e48

this is running on n2-highmem, I ran the query from comment 160 (mod: set @PUSHID=1093896;). Overall there are:
1878 total jobs
117 failures
15 OOM

This doesn't yield a lot of confidence- high mem is probably reducing our failures, but by 30-40%, I doubt we are close to half reduced. Which is more evidence that this isn't related to memory at all, or if it is then we are not allocated a stable fixed amount of memory per instance/container.

I'd guess it's related to memory, but not to to the total memory available to the host.

can we configure these error 137 to be TBPL_RETRY? Maybe we can switch from n2-highmem to xlarge?

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Assignee: mgoossens → nobody
Status: ASSIGNED → NEW

Nothing solid yet from the system side:

I was hopeful when I saw an oom_adj deprecation notice,

Jul 20 16:42:24 dhouse-gecko-t-xlarge-mem-check-image-2 kernel: [   30.189261] start-worker (2281): /proc/2281/oom_adj is deprecated, please use /proc/2281/oom_score_adj instead.

but we see the same on aws:

Jul 20 16:47:33 ip-10-145-79-109 kernel [   70.056394] start-worker (2360): /proc/2360/oom_adj is deprecated, please use /proc/2360/oom_score_adj instead.

From what I can find, the gcp instances have as much or more memory.total that docker-worker is finding compared to aws.

gecko-t.t-linux-large-gcp 16827727872
vs
gecko-t.t-linux-large.m5large 8105631744

and
gecko-t.t-linux-xlarge-gcp 16827449344
vs
gecko-t.t-linux-xlarge.m5axlarge 16431046656

I checked from logs in papertrail for just over 24h to get the total memory available that docker-worker sees on gcp and aws (averaged, all are within a few bytes):

gecko-t.misc.c5dxlarge 8031952896
gecko-t.misc.m5dxlarge 16428949504
gecko-t.misc.r5dxlarge 33252298752
gecko-t.t-linux-large-gcp 16827727872
gecko-t.t-linux-large.m5large 8105631744
gecko-t.t-linux-metal.m5metal 405176481792
gecko-t.t-linux-metal.r5metal 811050600448
gecko-t.t-linux-xlarge-gcp 16827449344
gecko-t.t-linux-xlarge-source.c5xlarge 8031952896
gecko-t.t-linux-xlarge-source.m5axlarge 16431046656
gecko-t.t-linux-xlarge-source.m5dxlarge 16428949504
gecko-t.t-linux-xlarge-source.m5xlarge 16428949504
gecko-t.t-linux-xlarge.c5xlarge 8031952896
gecko-t.t-linux-xlarge.m5axlarge 16431046656
gecko-t.t-linux-xlarge.m5dxlarge 16428949504
gecko-t.t-linux-xlarge.m5xlarge 16428949504

The log lines look like:

Jul 18 20:15:40 gecko-t-t-linux-xlarge-gcp-tojjarhtrie7im-weuaocq docker-worker: 2022/07/18 20:15:40 {"EnvVersion":"2.0","Fields":{"key":"memory.total","v":1,"val":16827449344},"Hostname":"gecko-t-t-linux-xlarge-gcp-tojjarhtrie7im-weuaocq","Logger":"taskcluster.docker-worker.gecko-t.t-linux-xlarge-gcp.projects/887720501152/machineTypes/n2-standard-4","Pid":2435,"Severity":6,"Timestamp":1658175340287000000,"Type":"monitor.measure","serviceContext":{"service":"docker-worker"},"severity":"INFO"}

this is good info dhouse- can we check the allocated memory used in the docker container? possibly there is some setting which is using more memory on gcp than on aws.

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #182)

this is good info dhouse- can we check the allocated memory used in the docker container? possibly there is some setting which is using more memory on gcp than on aws.

Maybe we could check this from a task, and re-run it many times to see if we catch when a task gets this failure? I took the total from memory metrics recorded by docker-worker. But checking inside the task could give us more information.

I'll collect the memory.free from the logs for different worker types and look for low/min to compare.

Also, I'll look more for logs on some of the specific instances/tasks with failures from the intermittent failures search/view on this bug. There could be something we're missing.

Assignee: nobody → mgoossens
Status: NEW → ASSIGNED

No, bad bot.

Assignee: mgoossens → nobody
Status: ASSIGNED → NEW
Assignee: nobody → mgoossens
Attachment #9286865 - Attachment description: Bug 1623747 - Run large tests on xlarge to reduce errors rates r=ahal!,jmaher! → Bug 1623747 - Run large tests on xlarge to reduce error rates r=ahal!,jmaher!
Status: NEW → ASSIGNED
Pushed by mgoossens@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/317a9ea4c9ad
Run large tests on xlarge to reduce error rates r=ahal,jmaher

I'd be surprised if this patch actually caused that test to fail. Afaict, it just a refactor and shouldn't have any impact on the task definitions. I'd guess the test is permafail (or very highly intermittent) and we just haven't noticed that fact yet. I ran a backfill on Michelle's push to confirm.

so far this looks to be 100% reproduced on that push and previous pushes are not showing any failures. I sanity checked log files. Looking at reftest viewer the reference image is "timed out after 2000ms", so possibly the larger machine changes the timing?

I believe we can set this to retry via:
https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/transforms/job/mozharness.py#210

it looks to be an array, so maybe setting it to [4, 137] would work?

See Also: → 1731862

so far this looks to be 100% reproduced on that push and previous pushes are not showing any failures. I sanity checked log files. Looking at reftest viewer the reference image is "timed out after 2000ms", so possibly the larger machine changes the timing?

Ok that makes sense. There's an existing intermittent on file here (bug 1731862). I mistakenly thought the patch that landed was simply a cleanup and the switch to larger instance had already happened.

I believe we can set this to retry via:
https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/transforms/job/mozharness.py#210

If the worker is crashing due to running out of memory, I'm not sure that mozharness will still be running to do the retry. I know Taskcluster has a retry mechanism built-in, we'd probably need to use that instead. Something like this:
https://searchfox.org/mozilla-central/source/taskcluster/ci/release-final-verify/kind.yml#25

that is the same as the link I had for the mozharness transform, basically we need to set the task definition to accept exit code 137 as a retry code:

    # Retry if mozharness returns TBPL_RETRY
    worker["retry-exit-status"] = [4, 137]

Maybe :masterwayz could give some try pushes to see if we can retry on the 137 errors, and then consider switching back to the regular instances. Then we can keep tabs on the % cpu usage over time and make sure it isn't increasing >2% of our total.

for math, would the cost of xlarge everywhere outstrip the cost of large + 2% ?

I'll work on a patch with that and try things out!

Attachment #9286865 - Attachment is obsolete: true

Backed out changeset e33cc022c0df (bug 1623747) as it is no longer needed;

Pushed by mgoossens@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/0732258f3f67
Backed out 1 changesets (bug 1623747) r=releng-reviewers,ahal
Flags: needinfo?(mgoossens)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → FIXED

This is still happening

Status: RESOLVED → REOPENED
Flags: needinfo?(mgoossens)
Resolution: FIXED → ---

linting/source-test specific tasks

Attachment #9285172 - Attachment is obsolete: true

This was not supposed to end up being closed.

Flags: needinfo?(mgoossens)
Keywords: leave-open
Whiteboard: [stockwell disable-recommended]

sounds like we need this for source-test and lint, then uplift to mozilla-beta. That won't get everything, but the large majority. If we want everything, there are instances in update-generate-sources* which affect windows/osx.

Comment on attachment 9287705 [details]
Bug 1623747 - retry task on exit code 137. r=ahal!

Beta/Release Uplift Approval Request

  • User impact if declined: n/a
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): this cleans up some infrastructure failures in CI by setting them to auto_retry which seems to solve the problem!
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9287705 - Flags: approval-mozilla-beta?
Attachment #9287772 - Flags: approval-mozilla-beta?

Comment on attachment 9287705 [details]
Bug 1623747 - retry task on exit code 137. r=ahal!

Approved for 104.0b5

Attachment #9287705 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9287772 - Flags: approval-mozilla-beta?
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/683d337f92f2
Auto retry source-test jobs on exit 137. r=ahal

Comment on attachment 9288035 [details]
Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal!

Beta/Release Uplift Approval Request

  • User impact if declined: n/a
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): helps retry tasks that fail on linux with a known infrastructure error.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9288035 - Flags: approval-mozilla-beta?

Comment on attachment 9288035 [details]
Bug 1623747 - Auto retry source-test jobs on exit 137. r=ahal!

Approved for 104.0b6

Attachment #9288035 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended]
Assignee: mgoossens → nobody
Priority: P2 → P3
Flags: needinfo?(dhouse)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: