Closed Bug 1646402 Opened 5 years ago Closed 5 years ago

mozregression isn't uploading inside of mach or console

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: mdroettboom)

Details

(Whiteboard: [telemetry:glean-rs:m?])

Attachments

(1 file)

:wlach pointed out in bug 1646173:

"""
Not sure if this is related but there has been no reported mozregression usage on mach (which I gather would be using the latest glean sdk, unlike the GUI which uses a hardcoded version-- 30.1.0 currently) since May 27:

https://sql.telemetry.mozilla.org/queries/70610#177730
"""

I'm not sure this is related to bug 1646173, so breaking it out into its own bug.

Summary: mozregression isn't uploading inside of mach → mozregression isn't uploading inside of mach or console

This looks to be happening with both console and mach, actually. There have been almost no pings from mozregression versions later than 4.05 (the release that came out around the time of the Glean SDK):

https://sql.telemetry.mozilla.org/queries/72098/source

Running things on the command-line, I'm not seeing any Glean messages being emitted on the mozregression console... there's this bit of indirection which forks off a process, but I'm still not seeing anything even with that code commented out:

https://github.com/mozilla/mozregression/blob/dc2a6498c33d57cca5940dc6d5a35ccf75929ae4/mozregression/telemetry.py#L43

I'm having trouble reproducing this. Here's what I did:

  1. Made a fresh venv
  2. Installed mozregression's dependencies and mozregression into it
  3. Upgraded glean_sdk to make sure I was getting the broken 31.1.1 version
  4. Modified mozregression to send a tagged ping by adding ping_tag="foo" to the glean.config.Configuration constructor in mozregression's telemetry.py.
  5. Ran the GUI from the commandline using python gui/build.py run
  6. Ran a single run from the GUI
  7. Pings show up in the debug pings viewer

So... something is different about my environment from yours, but I wonder what?

Flags: needinfo?(wlachance)

I should add -- this is on Linux at the command line. Maybe this is platform-specific?

Ah, I see -- If I do

mozregression -b 2020-06-16 -g 2020-06-17

pings don't seem to be sent. Now that I have something to reproduce, looking further...

(In reply to Michael Droettboom [:mdroettboom] from comment #4)

Ah, I see -- If I do

mozregression -b 2020-06-16 -g 2020-06-17

pings don't seem to be sent. Now that I have something to reproduce, looking further...

Yes! The GUI seems to be working fine. Sorry, I forgot to point that out (I guess I figured it was implied...).

Flags: needinfo?(wlachance)

It's looking like this is the issue:

https://stackoverflow.com/questions/34506638/how-to-register-atexit-function-in-pythons-multiprocessing-subprocess

Glean uses an atexit handler to make sure that all of it's threaded work completes before shutting down the process. It turns out that these atexit handlers are not called when using multiprocessing to spawn a separate process.

Replacing _send_telemetry_ping_oop with the following does resolve the issue.

def _send_telemetry_ping_oop(variant, appname, upload_enabled):
    try:
        initialize_telemetry(upload_enabled, allow_multiprocessing=True)
        if upload_enabled:
            _send_telemetry_ping(variant, appname)
    finally:
        atexit._run_exitfuncs()

I'm not sure this is the best solution however. As the SO post points out -- other things could be using atexit that might interfere. Glean should probably grow a public API to call for this case. It's not great -- it requires documenting "if you're using multiprocessing, also make sure you do this other thing to shut things down cleanly".

Another possibility is to turn off multithreading when inside of a multiprocessing process (assuming that's detectable). That wouldn't put this burden on our users.

Another thing to attack would be the reason mozregression is using multiprocessing in the first place -- because mach may need its own instance of Glean and we don't want the data intermingled / going to different endpoints in each context etc. An original design assumption was that components always send their data "as if" coming from the app, but that's not the use case we want for mozregression.

I'm not sure why it broke with this particular revision -- but given that it's kind of race-conditiony, I think something with the timing / mutexes changed enough to hit this. It could also explain (possibly) the lower mach numbers if it were flaky all along.

Assignee: nobody → mdroettboom
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: