Closed Bug 1641619 Opened 4 years ago Closed 4 years ago

Intermittent segfault in Python testing

Categories

(Data Platform and Tools :: Glean: SDK, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mdroettboom, Assigned: mdroettboom)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

https://circleci.com/gh/mozilla/glean/65637

glean-core/python/tests/test_dispatcher.py ........                      [  7%]
glean-core/python/tests/test_glean.py .......s...s....ss..............   [ 35%]
glean-core/python/tests/test_loader.py make: *** [Makefile:68: test-python] Segmentation fault (core dumped)
Priority: P3 → P2
Whiteboard: [telemetry:glean-rs:m?]
Assignee: nobody → mdroettboom

Added saving coredump artifacts to CI. That alone seems to have fixed it, since I haven't seen this since ;)

Assignee: mdroettboom → nobody
Assignee: nobody → mdroettboom

Here's the backtrace:

#0  0x00007f04d8d9ea0d in mdb_env_reader_dest (ptr=0x7f04da61c080) at /home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/lmdb-rkv-sys-0.9.6/lmdb/libraries/liblmdb/mdb.c:4483
#1  0x00007f04dae9d1a1 in __nptl_deallocate_tsd () at pthread_create.c:301
#2  0x00007f04dae9dfc4 in __nptl_deallocate_tsd () at pthread_create.c:256
#3  start_thread (arg=<optimized out>) at pthread_create.c:497
#4  0x00007f04dac414cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I guess, ouch?

:jan-erik Flagging you in case you have any ideas of where to look next or any ideas based on things you make know about lmdb.

Flags: needinfo?(jrediger)

At least there's some logic to the backtrace -- the segfaults started to appear around the time that threading was added to the Python bindings.

There's some evidence that this kind of thing occurs if the parent thread exits before joining on the child thread: https://stackoverflow.com/questions/26308066/segmentation-fault-in-pthread-create

The Glean Python bindings actually do that already, but there might be a race condition around that somewhere. In any event, it's possible that cleaning up the threading details may be a legitimate fix to this short of figuring out what the heck is going on in lmdb.

I now Firefox encountered some other crashes with LMDB in the past.
:vporof, have you seen the above issue before?

Flags: needinfo?(jrediger) → needinfo?(vporof)

Another data point -- this is quite likely a testing-environment-only problem. In the unit tests we create/destroy the LMDB environment with every test. In normal use, this only happens once per process so it would be hard to hit.

I haven't seen that issue yet. Because of the prevalence of these new crashes, current plan is to move all stores that use RKV to the safe-mode storage driver and away from LMDB. The work is happening in https://bugzilla.mozilla.org/show_bug.cgi?id=1612550

I'll make this block rkv-perf-mode in the meantime though.

Flags: needinfo?(vporof)

We haven't seen this crash in a long while, so it may now be less prevalent given recent Python threading changes. In addition, I'm confident this is a testing-framework-only thing.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: