Closed Bug 1734262 Opened 3 years ago Closed 3 years ago

Rewrite pingsender as a Gecko Background Task

Categories

(Toolkit :: Telemetry, task)

task

Tracking

()

RESOLVED FIXED
95 Branch
Tracking Status
firefox95 --- fixed

People

(Reporter: bholley, Assigned: bholley)

References

(Blocks 1 open bug, )

Details

Attachments

(7 files)

Most telemetry gets submitted through the Gecko networking stack. However, in a few cases — when Firefox is shutting down, or after it has crashed — we submit telemetry through a standalone tool called pingsender, which uses the OS networking stack.

The status quo with pingsender is suboptimal for two reasons:
(1) Using the OS networking stack is not ideal from a reliability and privacy standpoint.
(2) The pingsender code is a mess of platform-specific C++ code, and it would be nice if it were more maintainable (this was the motivation for bug 1339035).

Over the past year or so, the installer and updater team built out a capability for running standalone lightweight background tasks on top of libXUL. We opted for this strategy in part to provide a reusable architecture for other use-cases such as this one. Now that we have it, we should be able to rewrite pingsender in a few lines of JS and invoke it via a dedicated launch mode from the main firefox binary.

Nick, is there any context that someone would need to do this beyond what's written in the docs?

Flags: needinfo?(nalexander)

How would this work for the crash reporter client? Would we need to load and initialize libxul from it? If we're sufficiently unlucky we might crash again if we hit the same issue that took down the browser in the first place.

This is a very interesting idea from a couple of directions.

  1. We have recent evidence showing that using the OS networking stack is undesirable if only because Gecko's networking is so much newer and is kept more up-to-date on older OSes (like, say, Windows 7). This is no longer a theory: we can measure how important using Gecko networking is to getting Firefox Telemetry pings quickly. See bug 1733953 (( this isn't news for everyone, but I'm putting it here in the interests of having all the context nearby ))
  2. Whatever effort we put into this will unlikely be useful for FOG and would be removed when the rest of Firefox Telemetry is removed after the migration to Glean. Unlike Firefox Telemetry, Glean doesn't tend to send pings at shutdown (which is where a pingsender truly shines in reducing the latency of data transmission).
    • For "crash" kinds of pings on the other hand we'll need some sort of solution. I've added a sub-item to the Crash Reporting All Hands Agenda to discuss this.

I think we should (after figuring out how "crash" ping reporting would be affected) size this problem before proceeding. We have some data on how many pings are received from pingsenders running with less-than-ideal network stacks: we should evaluate if a) the size is large enough, and b) the distribution across populations is important enough (ie, if this affects Windows 7 clients more heavily, are Windows 7 clients especially important to the business?) to be worth the effort of changing the status quo.

(In reply to Bobby Holley (:bholley) from comment #1)

Nick, is there any context that someone would need to do this beyond what's written in the docs?

No, I believe that the basics are there. And we have some signal from prior art: Valentin corrected a small docs issue but otherwise was able to build a --backgroundtask PoC for purging some caches in Bug 1705676.

My belief is that --backgroundtask pingsender would not need any data from the profile of the crashing process, although if it did the background task could be provided the path to the correct profile on its command line. That greatly simplifies things.

After that, the tricky part is if we want operational metrics and telemetry via Glean from the background task. There is prior art but no reusable pattern established in the background update task.

(In reply to Gabriele Svelto [:gsvelto] from comment #2)

How would this work for the crash reporter client? Would we need to load and initialize libxul from it? If we're sufficiently unlucky we might crash again if we hit the same issue that took down the browser in the first place.

Yes, --backgroundtask invocations load and initialize libxul; yes, very early crashes can impact background tasks. This is a trade-off between the maintenance burden and limitations of a separate pingsender executable and the robustness of multiple independent implementations. In the specific case of a (pingsender) background task, we should collect crash reports for crashing background tasks (Bug 1679440). We don't specifically annotate crash reports for background tasks yet (Bug 1697875), so it's possible we see background update task crashes that aren't sufficiently high volume to be surfaced to the team. Or the mechanism could not work in the wild :/

I will flag that there is a risk of background task recursion, where a crashing pingsender task tries to invoke itself to report its own crash. Presumably that can be avoided easily.

Flags: needinfo?(nalexander)

(In reply to Chris H-C :chutten from comment #3)

I think we should (after figuring out how "crash" ping reporting would be affected) size this problem before proceeding. We have some data on how many pings are received from pingsenders running with less-than-ideal network stacks: we should evaluate if a) the size is large enough, and b) the distribution across populations is important enough (ie, if this affects Windows 7 clients more heavily, are Windows 7 clients especially important to the business?) to be worth the effort of changing the status quo.

Well, the idea is that it's a very easy change to make. I wrote some patches this morning, which I'll post shortly.

Assignee: nobody → bholley

(In reply to Gabriele Svelto [:gsvelto] from comment #2)

How would this work for the crash reporter client? Would we need to load and initialize libxul from it? If we're sufficiently unlucky we might crash again if we hit the same issue that took down the browser in the first place.

As various folks have pointed out, there would be some lost redundancy in using this for crash pings, and we'd risk losing pings from libXUL instances that couldn't get through startup. There are a few possible ways to approach this:

(1) Decide it's an acceptable trade-off for the reduced maintenance burden (we'd still get crash reports through the crash reporter's orthogonal network stack).
(2) Fold pingsender into the crash reporter and run it on top of the crash report's network stack.
(3) Continue using the standalone pingsender for crash pings only.

Given that the implementation of a BackgroundTask-based pingsender is trivial, it doesn't introduce much in the way of additional maintenance burden, and so we can still start using it for everything else to ensure our telemetry goes through Necko. The crash ping question then just becomes a tactical trade-off between effort and redundancy.

I ended up writing more or less exactly this before discovering it.

The submission was always failing regardless of host because it was trying to
submit a ping that had already been deleted. Separating this segment out
into its own test makes it easier to spot to mistake.

Unfortunately, even with this fix, the test doesn't properly fail when
the logic-under-test is missing, because the connections to example.com
cause gecko to crash during testing (because MOZ_DISABLE_NONLOCAL_CONNECTIONS is set),
and the process machinery doesn't distinguish a crash from an exception.
I haven't thought of a good solution to this — suggestions welcome — but
I think it's probably not worth spending more time on.

Unclear why it was disabled before, but the rewrite appears to fix it.

Depends on: 1734750

(In reply to Bobby Holley (:bholley) from comment #11)

Unclear why it was disabled before, but the rewrite appears to fix it.

libcurl was not installed in our VM images and I was unfamiliar with how they were configured at the time so I just disabled the test

Pushed by bholley@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/845c7749c2c3 Hoist standalone telemetry ping helper from PingCentre to TelemetrySend. r=nanj https://hg.mozilla.org/integration/autoland/rev/19eb67d15f2a Implement pingsender as a background task. r=chutten,nalexander https://hg.mozilla.org/integration/autoland/rev/1e4cafb39d62 Partially fix broken test for pingsender url allowlist. r=gsvelto https://hg.mozilla.org/integration/autoland/rev/a6ecb9d9697a Switch telemetry submission to use the background task pingsender. r=chutten,nalexander https://hg.mozilla.org/integration/autoland/rev/c3b0edc896e9 Reenable pingsender test on linux32. r=chutten

Looks like running the background task instance under TSAN is unconvering a latent race.

Here's a separate try push: https://treeherder.mozilla.org/jobs?repo=try&revision=fc7362396ec9383a64bb1da3e0108095f34c790f

And here's the relevant part of the log:

[task 2021-10-13T18:59:49.511Z] 18:59:49 INFO - PID 18212 | ==================
[task 2021-10-13T18:59:49.512Z] 18:59:49 INFO - PID 18212 | WARNING: ThreadSanitizer: use of an invalid mutex (e.g. uninitialized or destroyed) (pid=18212)
[task 2021-10-13T18:59:49.512Z] 18:59:49 INFO - PID 18212 | #0 pthread_mutex_lock /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4233:3 (xpcshell+0x755a6)
[task 2021-10-13T18:59:49.513Z] 18:59:49 INFO - PID 18212 | #1 mutexLock /builds/worker/checkouts/gecko/mozglue/misc/Mutex_posix.cpp:96:3 (xpcshell+0x130f7b)
[task 2021-10-13T18:59:49.513Z] 18:59:49 INFO - PID 18212 | #2 mozilla::detail::MutexImpl::lock() /builds/worker/checkouts/gecko/mozglue/misc/Mutex_posix.cpp:118:43 (xpcshell+0x130f7b)
[task 2021-10-13T18:59:49.514Z] 18:59:49 INFO - PID 18212 | #3 Lock /builds/worker/workspace/obj-build/dist/include/mozilla/BaseProfilerDetail.h:54:35 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.514Z] 18:59:49 INFO - PID 18212 | #4 BaseProfilerAutoLock /builds/worker/workspace/obj-build/dist/include/mozilla/BaseProfilerDetail.h:101:12 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.515Z] 18:59:49 INFO - PID 18212 | #5 PSAutoLock /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:294:9 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.516Z] 18:59:49 INFO - PID 18212 | #6 mozilla::profiler::ThreadRegistry::Unregister(mozilla::profiler::ThreadRegistration::OnThreadRef) /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:5422:14 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.516Z] 18:59:49 INFO - PID 18212 | #7 mozilla::profiler::ThreadRegistration::~ThreadRegistration() /builds/worker/checkouts/gecko/tools/profiler/core/ProfilerThreadRegistration.cpp:75:5 (libxul.so+0x66f5ee4)
[task 2021-10-13T18:59:49.517Z] 18:59:49 INFO - PID 18212 | #8 mozilla::profiler::ThreadRegistration::UnregisterThread() /builds/worker/checkouts/gecko/tools/profiler/core/ProfilerThreadRegistration.cpp:163:5 (libxul.so+0x66f6ba8)
[task 2021-10-13T18:59:49.518Z] 18:59:49 INFO - PID 18212 | #9 profiler_unregister_thread() /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:5388:3 (libxul.so+0x66e7d92)
[task 2021-10-13T18:59:49.518Z] 18:59:49 INFO - PID 18212 | #10 ~AutoProfilerRegisterThread /builds/worker/workspace/obj-build/dist/include/GeckoProfiler.h:554:35 (libxul.so+0xc9c27f)
[task 2021-10-13T18:59:49.519Z] 18:59:49 INFO - PID 18212 | #11 ~MaybeStorage /builds/worker/workspace/obj-build/dist/include/mozilla/Maybe.h:268:25 (libxul.so+0xc9c27f)
[task 2021-10-13T18:59:49.519Z] 18:59:49 INFO - PID 18212 | #12 nsProcess::Monitor(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsProcessCommon.cpp:190:1 (libxul.so+0xc9c27f)
[task 2021-10-13T18:59:49.520Z] 18:59:49 INFO - PID 18212 | #13 _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5 (libnspr4.so+0x44cd6)
[task 2021-10-13T18:59:49.520Z] 18:59:49 INFO - PID 18212 | Location is global 'PSAutoLock::gPSMutex' of size 56 at 0x7f92aa324db8 (libxul.so+0x00000bc1fdb8)
[task 2021-10-13T18:59:49.520Z] 18:59:49 INFO - PID 18212 | Mutex M0 (0x7f92aa324db8) created at:
[task 2021-10-13T18:59:49.521Z] 18:59:49 INFO - PID 18212 | [failed to restore the stack]
[task 2021-10-13T18:59:49.521Z] 18:59:49 INFO - PID 18212 | SUMMARY: ThreadSanitizer: use of an invalid mutex (e.g. uninitialized or destroyed) /builds/worker/checkouts/gecko/mozglue/misc/Mutex_posix.cpp:96:3 in mutexLock

Gerald — any thoughts?

Flags: needinfo?(gsquelart)

Quick first thoughts, just looking at the report and code:

That's really strange, the mutex being locked is PSAutoLock::gPSMutex, which is static!

Does this thread end during/after static objects are destroyed? (Or even before during/before static initialization? But that seems even more unlikely.)
If that's the case, I'm not sure how to solve this. 🤔

But I'll keep investigating...

Debugging with printfs, the log shows:

12:21:26.621 **** [17866] ~PSMutex@0x7f99711c5db8()
12:21:26.622 **** [17866] nsProcess@0x7b2400052b90::Monitor() about to unregister
12:21:26.622 ==================
12:21:26.623 WARNING: ThreadSanitizer: use of an invalid mutex (e.g. uninitialized or destroyed) (pid=17866)
12:21:26.623     #0 pthread_mutex_lock /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4233:3 (xpcshell+0x755a6)
...

As I suspected, the static gPSMutex gets destroyed, and later nsProcess::Monitor exits and tries to unregister its thread, which requires the gPSMutex. Boom!

I didn't look into why this thread exits so late, or why these patches make this happen, you'll probably have a better idea.

If you don't see how to make the Monitor exit earlier (or unregister earlier, or not register at all?), I guess one way to solve this would be to allocate the mutex on the heap, and intentionally leak it, so it never gets destroyed.
Other ideas?
Please let me know if you'd like me to implement that fix, or look deeper... Good luck!

Flags: needinfo?(gsquelart)

Oh, and if we needed to make the profiler mutex live longer/forever, it would not be the end, because the list of registered thread is also a static object with its own static mutex. Hopefully that's all.

These fancy things happening after main() ends are quite dangerous! 😅

Thanks Gerald! I think we should just not register this thread with the profiler. I'll attach a patch with an explanation as to the failure mode and why this is the right fix.

Flags: needinfo?(bholley)

This thread waits on a syscall for the completion of an
asynchronously-spawned process. Since there's no general guarantee that
Gecko will outlive the proccesses it spawns, this thread can continue
running all the way up until the process terminates. And while we
generally endeavor to have teardown logic join any spawned threads, we
don't have an easy way to interrupt the platform-specific routines that
the thread might be waiting on.

That leaves us with instrumenting the profiler code to handle APIs
arbitrarily late during shutdown (i.e. after static destructors have
run), or simply not registering the thread with the profiler. The latter
is much simpler, and doesn't cost us much because these threads never do
any interesting computation, and thus aren't likely to be relevant to
the profiler. This registration was added in bug 1323100, which was just
a grep-driven activity which didn't identify any particular need for
profiling this thread.

For the same reasons discussed in the previous commit, it's impractical
to join these threads on shutdown, and so we should suppress thread leak
reports for them.

Pushed by bholley@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e1b4e111ed5b Hoist standalone telemetry ping helper from PingCentre to TelemetrySend. r=chutten,nanj https://hg.mozilla.org/integration/autoland/rev/51e70fc95fad Implement pingsender as a background task. r=chutten,nalexander https://hg.mozilla.org/integration/autoland/rev/f2c30c3abfa6 Partially fix broken test for pingsender url allowlist. r=gsvelto https://hg.mozilla.org/integration/autoland/rev/e1f824e84f03 Switch telemetry submission to use the background task pingsender. r=chutten,nalexander https://hg.mozilla.org/integration/autoland/rev/ee6b89941d51 Reenable pingsender test on linux32. r=chutten https://hg.mozilla.org/integration/autoland/rev/06a1e5c71b52 Stop registering the async process monitor thread with the profiler. r=gerald https://hg.mozilla.org/integration/autoland/rev/c0e719d87617 Suppress thread leak reports for the monitor thread. r=decoder
Regressions: 1736179

Bobby, was those console.log calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?

It's spamming the terminal quite a lot when closing Firefox

Flags: needinfo?(bholley)

In https://phabricator.services.mozilla.com/D127715, I don't think the defineModuleGetter() line is correct. It's importing TelemetrySend but sendStandalonePing() isn't a method on this class, it is a standalone function. Either the import should reference this function or the function should become a member of the TelemetrySend class. In any case, the current patch throws errors in Nightly:

TelemetrySend.sendStandalonePing is not a function PingCentre.jsm:112
    _sendStandalonePing resource:///modules/PingCentre.jsm:112
    sendStructuredIngestionPing resource:///modules/PingCentre.jsm:140
    sendStructuredIngestionEvent resource://activity-stream/lib/TelemetryFeed.jsm:816
    sendDiscoveryStreamLoadedContent resource://activity-stream/lib/TelemetryFeed.jsm:503
    forEach self-hosted:205
    sendDiscoveryStreamLoadedContent resource://activity-stream/lib/TelemetryFeed.jsm:496
    endSession resource://activity-stream/lib/TelemetryFeed.jsm:418
    onAction resource://activity-stream/lib/TelemetryFeed.jsm:978
    _middleware resource://activity-stream/lib/Store.jsm:63
    method resource://activity-stream/lib/Store.jsm:39
    onActionFromContent resource://activity-stream/lib/ActivityStreamMessageChannel.jsm:108
    onNewTabUnload resource://activity-stream/lib/ActivityStreamMessageChannel.jsm:315
    onNewTabUnload self-hosted:1175
    callListeners resource://gre/modules/remotepagemanager/MessagePort.jsm:36
    portMessageReceived resource://gre/modules/remotepagemanager/RemotePageManagerParent.jsm:89
    portMessageReceived self-hosted:1175
    callListeners resource://gre/modules/remotepagemanager/MessagePort.jsm:36
    observe resource://gre/modules/remotepagemanager/RemotePageManagerParent.jsm:237
Depends on: 1736438

(In reply to William Durand [:willdurand] from comment #27)

In https://phabricator.services.mozilla.com/D127715, I don't think the defineModuleGetter() line is correct. It's importing TelemetrySend but sendStandalonePing() isn't a method on this class, it is a standalone function. Either the import should reference this function or the function should become a member of the TelemetrySend class. In any case, the current patch throws errors in Nightly:

Thanks! Filed bug 1736438 with a patch.

(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)

Bobby, was those console.log calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?

It's spamming the terminal quite a lot when closing Firefox

Nick, what do you recommend here? Should I be using Log.jsm instead?

Flags: needinfo?(bholley) → needinfo?(nalexander)

(In reply to Bobby Holley (:bholley) from comment #29)

(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)

Bobby, was those console.log calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?

It's spamming the terminal quite a lot when closing Firefox

Nick, what do you recommend here? Should I be using Log.jsm instead?

You certainly can use Log.jsm, but you'll need to mirror a pref from the default profile in order to be able to configure the setting. See https://searchfox.org/mozilla-central/rev/483cfde5a54f6c1cd94c6295564993aeb4f10980/toolkit/mozapps/update/BackgroundTask_backgroundupdate.jsm#238. Not sure it's worth the effort; you might just drop the logging, or make it depend on an environment variable.

Flags: needinfo?(nalexander)
See Also: → 1736466

(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)

Bobby, was those console.log calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?

It's spamming the terminal quite a lot when closing Firefox

Can you say more about the configuration? Is this about added logging noise for debug builds, or logging in situation where we were previously quiet?

I noticed that background tasks unconditionally print a message on startup to the terminal [1]. Just getting rid of the console.log invocations won't fix that.

[1] https://searchfox.org/mozilla-central/rev/483cfde5a54f6c1cd94c6295564993aeb4f10980/toolkit/xre/nsAppRunner.cpp#3581

Flags: needinfo?(nchevobbe)

(In reply to Bobby Holley (:bholley) from comment #31)

(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)

Bobby, was those console.log calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?

It's spamming the terminal quite a lot when closing Firefox

Can you say more about the configuration? Is this about added logging noise for debug builds, or logging in situation where we were previously quiet?

On a regular build, when closing (or restarting), it prints more than 200 000 characters. So it does add noise to sdtout where it was previously quiet (note that I have devtools.console.stdout.chrome set to true). It doesn't have any impact on my workflow for now, it's just quite verbose (and would make detecting error printed to stdout while Firefox is closing a bit harder I say?)

Flags: needinfo?(nchevobbe)
Regressions: 1736524
Depends on: 1736623

(In reply to Nicolas Chevobbe [:nchevobbe] from comment #32)

On a regular build, when closing (or restarting), it prints more than 200 000 characters. So it does add noise to sdtout where it was previously quiet (note that I have devtools.console.stdout.chrome set to true). It doesn't have any impact on my workflow for now, it's just quite verbose (and would make detecting error printed to stdout while Firefox is closing a bit harder I say?)

Filed bug 1736623 for this with a patch.

Regressions: 1736796
Regressions: 1736189
Regressions: 1736910
Regressions: 1722777
No longer regressions: 1722777
Blocks: 1746983
No longer regressions: 1736910
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: