No database found error for Glean Python SDK version 28.0.0 on Linux
Categories
(Data Platform and Tools :: Glean: SDK, defect, P3)
Tracking
(Not tracked)
People
(Reporter: raphael, Assigned: mdroettboom)
References
Details
(Whiteboard: [telemetry:glean-rs:m13][glean-py])
Attachments
(1 file)
Log messages
I'm seeing the following error when submitting pings from Python on Linux:
cli2 | DEBUG:burnham.missions:Completed mission 'MISSION G: FIVE WARPS, FOUR JUMPS'
cli2 | DEBUG:burnham.missions:Submitting ping for mission 'MISSION G: FIVE WARPS, FOUR JUMPS'
cli2 | ERROR:glean._dispatcher:Timeout sending Glean telemetry
cli2 | thread '<unnamed>' panicked at 'No database found', glean-core/src/lib.rs:453:10
cli2 | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
cli2 | [2020-04-30T09:14:54Z ERROR ffi_support::error] Caught a panic calling rust code: "No database found"
cli2 | thread '<unnamed>' panicked at 'No database found', glean-core/src/lib.rs:453:10
cli2 | [2020-04-30T09:14:54Z ERROR ffi_support::error] Caught a panic calling rust code: "No database found"
cli2 | thread '<unnamed>' panicked at 'assertion failed: error.get_code().is_success()', glean-core/ffi/src/handlemap_ext.rs:125:9
cli2 | fatal runtime error: failed to initiate panic, error 5
Steps to reproduce:
Clone https://github.com/hackebrot/burnham and run:
docker-compose up --build
The client, which produces the error, runs the following inside Docker:
burnham \
--verbose \
--telemetry \
--platform http://platform:5000 \
--test-run "11111111-aaaa-bbbb-cccc-123455555555" \
--test-name "test_cli1" \
--spore-drive "tardigrade-dna" \
"MISSION A: ONE WARP" \
"MISSION B: TWO WARPS" \
"MISSION D: TWO JUMPS" \
"MISSION E: ONE JUMP, ONE METRIC ERROR" \
"MISSION F: TWO WARPS, ONE JUMP" \
"MISSION G: FIVE WARPS, FOUR JUMPS"
Assignee | ||
Comment 1•5 years ago
•
|
||
Here is the race condition I've found:
Glean currently has two atexit
handlers: (a) to make sure the thread worked completes all of its tasks, and (b) that (among other things) deletes the data directory if it's a tmpdir
. atexit handlers are run sequentially on the main thread, but the ordering is based on the order in which they are registered, which is somewhat non-deterministic in Glean.
If (b) runs before (a), the data directory is deleted, and then any operations that might be waiting the thread queue will fail with "Database not found".
The fix is to combine the atexit handlers into one, and join on the thread queue before deleting the tempdir.
This raised another issue in my mind that using a tempdir
by default is probably not a good choice, and we are seeing this bug in burnham
only because burnham
doesn't override the data dir (as other "real" apps, such as mozregression
have done). Changing the default to a retained directory probably makes sense, and I created bug 1634410 to track that work.
Comment 2•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Description
•