Closed Bug 1641619 Opened 5 years ago Closed 5 years ago

Intermittent segfault in Python testing

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: mdroettboom, Assigned: mdroettboom)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Link to GitHub pull-request: https://github.com/mozilla/glean/pull/945 5 years ago GitHub Bugzilla PR Linker 41 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1011 5 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1018 5 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review

Michael Droettboom [:mdroettboom]

Assignee

Description

•

5 years ago

https://circleci.com/gh/mozilla/glean/65637

glean-core/python/tests/test_dispatcher.py ........                      [  7%]
glean-core/python/tests/test_glean.py .......s...s....ss..............   [ 35%]
glean-core/python/tests/test_loader.py make: *** [Makefile:68: test-python] Segmentation fault (core dumped)

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Priority: P3 → P2

Whiteboard: [telemetry:glean-rs:m?]

Jan-Erik Rediger [:janerik]

Comment 2

•

5 years ago

Another crash: https://app.circleci.com/pipelines/github/mozilla/glean/4605/workflows/9658e5a8-88f0-4d49-9cf7-d220fcdcc2c3/jobs/66708/steps

It might be time we reproduce this.

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Assignee: nobody → mdroettboom

GitHub Bugzilla PR Linker

Comment 3

•

5 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/945 — Details

Michael Droettboom [:mdroettboom]

Assignee

Comment 4

•

5 years ago

Added saving coredump artifacts to CI. That alone seems to have fixed it, since I haven't seen this since ;)

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Assignee: mdroettboom → nobody

Michael Droettboom [:mdroettboom]

Assignee

Comment 5

•

5 years ago

Here's a recent Python testing segfault: https://app.circleci.com/pipelines/github/mozilla/glean/4937/workflows/820ec954-f0b1-481e-baee-d35fc0fad978/jobs/73951/steps

GitHub Bugzilla PR Linker

Comment 6

•

5 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1011 — Details

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Assignee: nobody → mdroettboom

Michael Droettboom [:mdroettboom]

Assignee

Comment 7

•

5 years ago

Here's another segfault, this time with a coredump: https://app.circleci.com/pipelines/github/mozilla/glean/4958/workflows/0cd2547e-a819-4c33-a102-f3218d4926dd/jobs/74356

Michael Droettboom [:mdroettboom]

Assignee

Comment 8

•

5 years ago

Here's the backtrace:

#0  0x00007f04d8d9ea0d in mdb_env_reader_dest (ptr=0x7f04da61c080) at /home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/lmdb-rkv-sys-0.9.6/lmdb/libraries/liblmdb/mdb.c:4483
#1  0x00007f04dae9d1a1 in __nptl_deallocate_tsd () at pthread_create.c:301
#2  0x00007f04dae9dfc4 in __nptl_deallocate_tsd () at pthread_create.c:256
#3  start_thread (arg=<optimized out>) at pthread_create.c:497
#4  0x00007f04dac414cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Michael Droettboom [:mdroettboom]

Assignee

Comment 9

•

5 years ago

I guess, ouch?

:jan-erik Flagging you in case you have any ideas of where to look next or any ideas based on things you make know about lmdb.

Flags: needinfo?(jrediger)

Michael Droettboom [:mdroettboom]

Assignee

Comment 10

•

5 years ago

At least there's some logic to the backtrace -- the segfaults started to appear around the time that threading was added to the Python bindings.

There's some evidence that this kind of thing occurs if the parent thread exits before joining on the child thread: https://stackoverflow.com/questions/26308066/segmentation-fault-in-pthread-create

The Glean Python bindings actually do that already, but there might be a race condition around that somewhere. In any event, it's possible that cleaning up the threading details may be a legitimate fix to this short of figuring out what the heck is going on in lmdb.

Michael Droettboom [:mdroettboom]

Assignee

Comment 11

•

5 years ago

Possible issue with mdb_env_reader_dest: https://www.openldap.org/lists/openldap-bugs/201809/msg00009.html

Michael Droettboom [:mdroettboom]

Assignee

Comment 12

•

5 years ago

Also related: https://github.com/erthink/ReOpenLDAP/issues/48

Jan-Erik Rediger [:janerik]

Comment 13

•

5 years ago

I now Firefox encountered some other crashes with LMDB in the past.
:vporof, have you seen the above issue before?

Flags: needinfo?(jrediger) → needinfo?(vporof)

Michael Droettboom [:mdroettboom]

Assignee

Comment 14

•

5 years ago

Another data point -- this is quite likely a testing-environment-only problem. In the unit tests we create/destroy the LMDB environment with every test. In normal use, this only happens once per process so it would be hard to hit.

GitHub Bugzilla PR Linker

Comment 15

•

5 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1018 — Details

Victor Porof [:vporof][:vp]

Comment 16

•

5 years ago

I haven't seen that issue yet. Because of the prevalence of these new crashes, current plan is to move all stores that use RKV to the safe-mode storage driver and away from LMDB. The work is happening in https://bugzilla.mozilla.org/show_bug.cgi?id=1612550

I'll make this block rkv-perf-mode in the meantime though.

Flags: needinfo?(vporof)

Victor Porof [:vporof][:vp]

Updated

•

5 years ago

Blocks: rkv-perf-mode

Michael Droettboom [:mdroettboom]

Assignee

Comment 17

•

5 years ago

We haven't seen this crash in a long while, so it may now be less prevalent given recent Python threading changes. In addition, I'm confident this is a testing-framework-only thing.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.