Open Bug 1819763 Opened 2 years ago Updated 1 day ago

Intermittent test failure with fatal "Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!"

Categories

(Testing :: General, defect)

Default
defect

Tracking

(Not tracked)

People

(Reporter: dholbert, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

We've got a handful of intermittent-test-failure bugs that actually include this as the specific source of the error (which seems to trigger an abort):

Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!

It looks like this isn't specific to us; if I search the web for this message, I find e.g. this message from 2019 from someone launching several Chromium instances via puppeteer:
https://github.com/puppeteer/puppeteer/issues/2207#issuecomment-541550628

It looks like there's a launchpad issue filed about it, as a glibc bug, too:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1863162

...and as noted there, it's fixed in glibc 2.34, though I think Ubuntu 22.04 is the first LTS release that includes a version as new as that. (https://launchpad.net/ubuntu/focal/+source/glibc says Ubuntu 20.04 has version 2.31 -- and our test runners are on Ubuntu 18.04 which is presumably using an even older version)

So I think this issue will persist until either:
(1) we update our test runners to Ubuntu version 22.04 (or newer)
or:
(2) the glibc patch gets uplifted to 18.04 (as requested on launchpad, though I don't know how likely it is to happen)

Duplicate of this bug: 1819731

Here's one sample log where we hit this issue, btw (from dupe bug 1819731):
https://treeherder.mozilla.org/logviewer?job_id=407460102&repo=autoland

Duplicate of this bug: 1753108

bug 1747488 comment 0 and bug 1808652 comment 0 are instances of this; though their more recent failure-reports seem to be different issues, so I'm not duping them here.

(I found those via bugzilla search, which was lucky enough to match log snippet that we include automatically in the first comment of intermittent bugs. Probably we've hit this lots of other times and starred it as some known test-crash, too, though it's harder to find those instances since we don't automatically copypaste the log onto bugzilla when the bug's already been filed.)

Duplicate of this bug: 1778121
Duplicate of this bug: 1771772
Duplicate of this bug: 1822791
Duplicate of this bug: 1822912
Duplicate of this bug: 1831892
Duplicate of this bug: 1831885
Duplicate of this bug: 1836194
Duplicate of this bug: 1837566
Duplicate of this bug: 1839042
Duplicate of this bug: 1838610
Duplicate of this bug: 1847241
Duplicate of this bug: 1854178
Duplicate of this bug: 1860410
Duplicate of this bug: 1861245

(In reply to Daniel Holbert [:dholbert] from comment #0)

So I think this issue will persist until either:
(1) we update our test runners to Ubuntu version 22.04 (or newer)
or:
(2) the glibc patch gets uplifted to 18.04 (as requested on launchpad, though I don't know how likely it is to happen)

Hi Andrew, do we already have a bug that tracks the work for the 22.04 workers that we could make this bug dependent on? Thanks.

Flags: needinfo?(aerickson)
Duplicate of this bug: 1873804
Duplicate of this bug: 1834433

We have 22.04 gui/wayland worker pools up and running now.

# runs as the task_XYZ user
https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Ft-linux-2204-wayland-experimental
# if you need root access (we can't make l3 pools with this)
https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Ft-linux-2204-wayland-root-exp  

We don't have a Bugzilla ticket, but have a Jira tracking this at https://mozilla-hub.atlassian.net/browse/RELOPS-793.

Flags: needinfo?(aerickson)

Andrew, this is good to know. Did these actually already replace the formerly used Wayland workers or is this some work that still needs to be done?

The old Virtualbox-based Wayland pool is still around and usable, but ideally it won't be receiving updates and tasks should be migrated to these new non-Virtualbox pools. I haven't done enough testing to see if all of the tasks on the Virtualbox-based pool can be moved to the new pools above (I know we hit some issues with the task user not being in the audio group - needed to enable pulse audio use - but I think that can be resolved in the task payload).

Releng and users manage task assignments to pools, RelSRE focuses on providing worker images and hardware instances that can be used in pools. RelSRE doesn't have the bandwidth or expertise to manage task assignment. Worker pool creation and management is currently shared between Releng and RelSRE.

Long answer... please let me know if I can clarify anything.

No that's totally understandable and the separation of work makes sense. It was just a general question.

Joel, maybe you are the right person to ask then. Is there work already planned to move existing test suites to the new Wayland workers? If yes, on which bug is that tracked? This issue here seems to be mostly triggered by Mochitests, and for myself I would be also interested in the remote tests to be moved to check if bug 1861933 no longer occurs.

Flags: needinfo?(jmaher)

I don't have anything on my radar, but it is something we should do. Right now the worker pool looks to be experimental, so we can try a few things there. I suspect the migration will fall more into later this month/early march assuming we get the OS issues sorted out.

Flags: needinfo?(jmaher)
Duplicate of this bug: 1881844
Duplicate of this bug: 1821413
Duplicate of this bug: 1937043
Duplicate of this bug: 1945696
Duplicate of this bug: 1940190
Duplicate of this bug: 1959887
Duplicate of this bug: 1950004
Duplicate of this bug: 1947578
Duplicate of this bug: 1917122
Duplicate of this bug: 1907750
Duplicate of this bug: 1906209
Duplicate of this bug: 1904834
Duplicate of this bug: 1903653
Duplicate of this bug: 1902573
Duplicate of this bug: 1900994
Duplicate of this bug: 1897236
You need to log in before you can comment on or make changes to this bug.