Closed Bug 1634310 Opened 5 years ago Closed 5 years ago

No database found error for Glean Python SDK version 28.0.0 on Linux

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: raphael, Assigned: mdroettboom)

References

Details

(Whiteboard: [telemetry:glean-rs:m13][glean-py])

Attachments

(1 file)

Log messages

I'm seeing the following error when submitting pings from Python on Linux:

cli2        | DEBUG:burnham.missions:Completed mission 'MISSION G: FIVE WARPS, FOUR JUMPS'
cli2        | DEBUG:burnham.missions:Submitting ping for mission 'MISSION G: FIVE WARPS, FOUR JUMPS'
cli2        | ERROR:glean._dispatcher:Timeout sending Glean telemetry
cli2        | thread '<unnamed>' panicked at 'No database found', glean-core/src/lib.rs:453:10
cli2        | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
cli2        | [2020-04-30T09:14:54Z ERROR ffi_support::error] Caught a panic calling rust code: "No database found"
cli2        | thread '<unnamed>' panicked at 'No database found', glean-core/src/lib.rs:453:10
cli2        | [2020-04-30T09:14:54Z ERROR ffi_support::error] Caught a panic calling rust code: "No database found"
cli2        | thread '<unnamed>' panicked at 'assertion failed: error.get_code().is_success()', glean-core/ffi/src/handlemap_ext.rs:125:9
cli2        | fatal runtime error: failed to initiate panic, error 5

Steps to reproduce:

Clone https://github.com/hackebrot/burnham and run:

docker-compose up --build

The client, which produces the error, runs the following inside Docker:

burnham \
--verbose \
--telemetry \
--platform http://platform:5000 \
--test-run "11111111-aaaa-bbbb-cccc-123455555555" \
--test-name "test_cli1" \
--spore-drive "tardigrade-dna" \
"MISSION A: ONE WARP" \
"MISSION B: TWO WARPS" \
"MISSION D: TWO JUMPS" \
"MISSION E: ONE JUMP, ONE METRIC ERROR" \
"MISSION F: TWO WARPS, ONE JUMP" \
"MISSION G: FIVE WARPS, FOUR JUMPS"

Here is the race condition I've found:

Glean currently has two atexit handlers: (a) to make sure the thread worked completes all of its tasks, and (b) that (among other things) deletes the data directory if it's a tmpdir. atexit handlers are run sequentially on the main thread, but the ordering is based on the order in which they are registered, which is somewhat non-deterministic in Glean.

If (b) runs before (a), the data directory is deleted, and then any operations that might be waiting the thread queue will fail with "Database not found".

The fix is to combine the atexit handlers into one, and join on the thread queue before deleting the tempdir.

This raised another issue in my mind that using a tempdir by default is probably not a good choice, and we are seeing this bug in burnham only because burnham doesn't override the data dir (as other "real" apps, such as mozregression have done). Changing the default to a retained directory probably makes sense, and I created bug 1634410 to track that work.

Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m13][glean-py]
Assignee: nobody → mdroettboom
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
See Also: → 1646173
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: