Closed Bug 1803831 Opened 1 year ago Closed 1 year ago

Migrate valgrind task from AWS -> GCP

Categories

(Firefox Build System :: Task Configuration, task)

task

Tracking

(firefox110 fixed)

RESOLVED FIXED
110 Branch
Tracking Status
firefox110 --- fixed

People

(Reporter: ahal, Assigned: jcristau)

References

Details

Attachments

(1 file)

This bug will track migrating the valgrind task to GCP.

Keywords: leave-open

When I run this on try, the task fails:
https://firefox-ci-tc.services.mozilla.com/tasks/SxEpRKsKR6mPgUqz9b2tJA

Log snippet:

[task 2022-12-02T17:06:41.092Z] 17:06:41     INFO -  --29980-- memcheck GC: 31966 nodes, 31853 survivors (99.6%)
[task 2022-12-02T17:06:41.092Z] 17:06:41     INFO -  --29980-- memcheck GC: 45206 new table size (stepup)
[task 2022-12-02T17:06:42.243Z] 17:06:42     INFO -  --29980-- WARNING: Serious error when reading debug info
[task 2022-12-02T17:06:42.244Z] 17:06:42     INFO -  --29980-- When reading debug info from /memfd:mozilla-ipc (deleted):
[task 2022-12-02T17:06:42.244Z] 17:06:42     INFO -  --29980-- failed to stat64/stat this file
[task 2022-12-02T17:06:47.545Z] 17:06:47     INFO -  PERFHERDER_DATA: {"framework": {"name": "build_metrics"}, "suites": [{"name": "valgrind", "value": 41.65303177700025, "lowerIsBetter": true, "shouldAlert": false, "subtests": [], "extraOptions": ["taskcluster-projects/887720501152/machineTypes/n2-custom-16-73728"]}]}
[task 2022-12-02T17:06:47.545Z] 17:06:47     INFO -  TEST-PASS | valgrind-test | valgrind found no errors
[task 2022-12-02T17:06:47.545Z] 17:06:47     INFO -  TEST-UNEXPECTED-FAIL | valgrind-test | non-zero exit code from Valgrind: -11
[task 2022-12-02T17:06:48.029Z] 17:06:48    ERROR - Return code: 2

Though I'm not sure what this means, or what could have changed in the host image to cause it. Mike, your name seems to come up the most in blame for this task, do you know what's going on or who I can ping to help me debug?

Flags: needinfo?(mh+mozilla)

Do we know the kernel version on the aws workers? It's possible it's too old to support memfd and we use a different code path there?

Nevermind, ignoring the stat failure on memfd doesn't actually stop valgrind from crashing.

I've reproduced this on an interactive task, unfortunately even after attaching gdb to the valgrind process I don't get useful information:

Attaching to process 19563
Reading symbols from /usr/libexec/valgrind/memcheck-amd64-linux...
Reading symbols from /usr/lib/debug/.build-id/9b/1fa60c727acfa38c726ec45680af7bf2edd433.debug...
0x00000000580c2b17 in get_slowcase (img=0x100c75fcf0, off=<optimized out>) at m_debuginfo/image.c:810
810     m_debuginfo/image.c: No such file or directory.
(gdb) c
Continuing.
[Detaching after fork from child process 19589]

Program received signal SIGSEGV, Segmentation fault.
0x000000100ba1ba57 in ?? ()
(gdb) bt
#0  0x000000100ba1ba57 in ?? ()
#1  0x0000001008fadf30 in ?? ()
#2  0x0000001008fadf18 in ?? ()
#3  0x0000001008fadf30 in ?? ()
#4  0x0000000000001c10 in ?? ()
#5  0x0000000000000001 in ?? ()
#6  0x0000001009819db0 in ?? ()
#7  0x0000000000000000 in ?? ()

Mike or Julian any advice on how to figure this out?

Flags: needinfo?(jseward)

I suggest trying with valgrind-3.20.0. 3.19 has (severe) problems reading
Dwarf5 debuginfo, and what seems to have happened here is a crash
in the debuginfo reader. Dwarf5 support is much improved in 3.20.

Flags: needinfo?(jseward)

I tried valgrind-3.20.0 per Julian's advice, unfortunately that didn't improve things.

Then I tried running the task on a different worker type (gecko-t/t-linux-kvm-gcp instead of gecko-1/b-linux-gcp), and that appears to work.

Some differences between those pools:

  • different VM image; I can't tell what the actual changes are
  • machine-type n2-standard-16 (t-linux-kvm-gcp) vs n2-custom-16-73728 (b-linux-gcp)
  • kvm and nested virtualization enabled in t-linux-kvm-gcp
  • different disk configuration, hopefully irrelevant
Assignee: ahal → jcristau
Flags: needinfo?(mh+mozilla)

For some reason when running on b-linux-gcp workers, valgrind crashes, but it
runs OK on t-linux-kvm-gcp, so use that.

Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5a65f0968b51
[ci] Migrate 'valgrind' tasks from AWS -> GCP, r=MasterWayZ,ahal,glandium
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 110 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: