Rewrite pingsender as a Gecko Background Task
Categories
(Toolkit :: Telemetry, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox95 | --- | fixed |
People
(Reporter: bholley, Assigned: bholley)
References
(Blocks 1 open bug, )
Details
Attachments
(7 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Most telemetry gets submitted through the Gecko networking stack. However, in a few cases — when Firefox is shutting down, or after it has crashed — we submit telemetry through a standalone tool called pingsender, which uses the OS networking stack.
The status quo with pingsender is suboptimal for two reasons:
(1) Using the OS networking stack is not ideal from a reliability and privacy standpoint.
(2) The pingsender code is a mess of platform-specific C++ code, and it would be nice if it were more maintainable (this was the motivation for bug 1339035).
Over the past year or so, the installer and updater team built out a capability for running standalone lightweight background tasks on top of libXUL. We opted for this strategy in part to provide a reusable architecture for other use-cases such as this one. Now that we have it, we should be able to rewrite pingsender in a few lines of JS and invoke it via a dedicated launch mode from the main firefox binary.
Assignee | ||
Comment 1•3 years ago
|
||
Nick, is there any context that someone would need to do this beyond what's written in the docs?
Comment 2•3 years ago
|
||
How would this work for the crash reporter client? Would we need to load and initialize libxul from it? If we're sufficiently unlucky we might crash again if we hit the same issue that took down the browser in the first place.
Comment 3•3 years ago
|
||
This is a very interesting idea from a couple of directions.
- We have recent evidence showing that using the OS networking stack is undesirable if only because Gecko's networking is so much newer and is kept more up-to-date on older OSes (like, say, Windows 7). This is no longer a theory: we can measure how important using Gecko networking is to getting Firefox Telemetry pings quickly. See bug 1733953 (( this isn't news for everyone, but I'm putting it here in the interests of having all the context nearby ))
- Whatever effort we put into this will unlikely be useful for FOG and would be removed when the rest of Firefox Telemetry is removed after the migration to Glean. Unlike Firefox Telemetry, Glean doesn't tend to send pings at shutdown (which is where a pingsender truly shines in reducing the latency of data transmission).
- For "crash" kinds of pings on the other hand we'll need some sort of solution. I've added a sub-item to the Crash Reporting All Hands Agenda to discuss this.
I think we should (after figuring out how "crash" ping reporting would be affected) size this problem before proceeding. We have some data on how many pings are received from pingsenders running with less-than-ideal network stacks: we should evaluate if a) the size is large enough, and b) the distribution across populations is important enough (ie, if this affects Windows 7 clients more heavily, are Windows 7 clients especially important to the business?) to be worth the effort of changing the status quo.
Comment 4•3 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #1)
Nick, is there any context that someone would need to do this beyond what's written in the docs?
No, I believe that the basics are there. And we have some signal from prior art: Valentin corrected a small docs issue but otherwise was able to build a --backgroundtask
PoC for purging some caches in Bug 1705676.
My belief is that --backgroundtask pingsender
would not need any data from the profile of the crashing process, although if it did the background task could be provided the path to the correct profile on its command line. That greatly simplifies things.
After that, the tricky part is if we want operational metrics and telemetry via Glean from the background task. There is prior art but no reusable pattern established in the background update task.
(In reply to Gabriele Svelto [:gsvelto] from comment #2)
How would this work for the crash reporter client? Would we need to load and initialize libxul from it? If we're sufficiently unlucky we might crash again if we hit the same issue that took down the browser in the first place.
Yes, --backgroundtask
invocations load and initialize libxul; yes, very early crashes can impact background tasks. This is a trade-off between the maintenance burden and limitations of a separate pingsender
executable and the robustness of multiple independent implementations. In the specific case of a (pingsender) background task, we should collect crash reports for crashing background tasks (Bug 1679440). We don't specifically annotate crash reports for background tasks yet (Bug 1697875), so it's possible we see background update task crashes that aren't sufficiently high volume to be surfaced to the team. Or the mechanism could not work in the wild :/
I will flag that there is a risk of background task recursion, where a crashing pingsender task tries to invoke itself to report its own crash. Presumably that can be avoided easily.
Assignee | ||
Comment 5•3 years ago
|
||
(In reply to Chris H-C :chutten from comment #3)
I think we should (after figuring out how "crash" ping reporting would be affected) size this problem before proceeding. We have some data on how many pings are received from pingsenders running with less-than-ideal network stacks: we should evaluate if a) the size is large enough, and b) the distribution across populations is important enough (ie, if this affects Windows 7 clients more heavily, are Windows 7 clients especially important to the business?) to be worth the effort of changing the status quo.
Well, the idea is that it's a very easy change to make. I wrote some patches this morning, which I'll post shortly.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 6•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #2)
How would this work for the crash reporter client? Would we need to load and initialize libxul from it? If we're sufficiently unlucky we might crash again if we hit the same issue that took down the browser in the first place.
As various folks have pointed out, there would be some lost redundancy in using this for crash pings, and we'd risk losing pings from libXUL instances that couldn't get through startup. There are a few possible ways to approach this:
(1) Decide it's an acceptable trade-off for the reduced maintenance burden (we'd still get crash reports through the crash reporter's orthogonal network stack).
(2) Fold pingsender into the crash reporter and run it on top of the crash report's network stack.
(3) Continue using the standalone pingsender for crash pings only.
Given that the implementation of a BackgroundTask-based pingsender is trivial, it doesn't introduce much in the way of additional maintenance burden, and so we can still start using it for everything else to ensure our telemetry goes through Necko. The crash ping question then just becomes a tactical trade-off between effort and redundancy.
Assignee | ||
Comment 7•3 years ago
|
||
I ended up writing more or less exactly this before discovering it.
Assignee | ||
Comment 8•3 years ago
|
||
Assignee | ||
Comment 9•3 years ago
|
||
Assignee | ||
Comment 10•3 years ago
|
||
The submission was always failing regardless of host because it was trying to
submit a ping that had already been deleted. Separating this segment out
into its own test makes it easier to spot to mistake.
Unfortunately, even with this fix, the test doesn't properly fail when
the logic-under-test is missing, because the connections to example.com
cause gecko to crash during testing (because MOZ_DISABLE_NONLOCAL_CONNECTIONS is set),
and the process machinery doesn't distinguish a crash from an exception.
I haven't thought of a good solution to this — suggestions welcome — but
I think it's probably not worth spending more time on.
Assignee | ||
Comment 11•3 years ago
|
||
Unclear why it was disabled before, but the rewrite appears to fix it.
Comment 12•3 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #11)
Unclear why it was disabled before, but the rewrite appears to fix it.
libcurl was not installed in our VM images and I was unfamiliar with how they were configured at the time so I just disabled the test
Comment 13•3 years ago
|
||
Comment 14•3 years ago
|
||
Backed out 5 changesets (Bug 1734262) for causing failures in test_HealthPing.js CLOSED TREE
Log: https://treeherder.mozilla.org/logviewer?job_id=354537782&repo=autoland&lineNumber=2832
TV failures: https://treeherder.mozilla.org/logviewer?job_id=354533999&repo=autoland&lineNumber=2738
Backout: https://hg.mozilla.org/integration/autoland/rev/816fcfe32b402026374054bd15064258c9ff6388
Assignee | ||
Comment 15•3 years ago
|
||
Looks like running the background task instance under TSAN is unconvering a latent race.
Here's a separate try push: https://treeherder.mozilla.org/jobs?repo=try&revision=fc7362396ec9383a64bb1da3e0108095f34c790f
And here's the relevant part of the log:
[task 2021-10-13T18:59:49.511Z] 18:59:49 INFO - PID 18212 | ==================
[task 2021-10-13T18:59:49.512Z] 18:59:49 INFO - PID 18212 | WARNING: ThreadSanitizer: use of an invalid mutex (e.g. uninitialized or destroyed) (pid=18212)
[task 2021-10-13T18:59:49.512Z] 18:59:49 INFO - PID 18212 | #0 pthread_mutex_lock /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4233:3 (xpcshell+0x755a6)
[task 2021-10-13T18:59:49.513Z] 18:59:49 INFO - PID 18212 | #1 mutexLock /builds/worker/checkouts/gecko/mozglue/misc/Mutex_posix.cpp:96:3 (xpcshell+0x130f7b)
[task 2021-10-13T18:59:49.513Z] 18:59:49 INFO - PID 18212 | #2 mozilla::detail::MutexImpl::lock() /builds/worker/checkouts/gecko/mozglue/misc/Mutex_posix.cpp:118:43 (xpcshell+0x130f7b)
[task 2021-10-13T18:59:49.514Z] 18:59:49 INFO - PID 18212 | #3 Lock /builds/worker/workspace/obj-build/dist/include/mozilla/BaseProfilerDetail.h:54:35 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.514Z] 18:59:49 INFO - PID 18212 | #4 BaseProfilerAutoLock /builds/worker/workspace/obj-build/dist/include/mozilla/BaseProfilerDetail.h:101:12 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.515Z] 18:59:49 INFO - PID 18212 | #5 PSAutoLock /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:294:9 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.516Z] 18:59:49 INFO - PID 18212 | #6 mozilla::profiler::ThreadRegistry::Unregister(mozilla::profiler::ThreadRegistration::OnThreadRef) /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:5422:14 (libxul.so+0x66f61f4)
[task 2021-10-13T18:59:49.516Z] 18:59:49 INFO - PID 18212 | #7 mozilla::profiler::ThreadRegistration::~ThreadRegistration() /builds/worker/checkouts/gecko/tools/profiler/core/ProfilerThreadRegistration.cpp:75:5 (libxul.so+0x66f5ee4)
[task 2021-10-13T18:59:49.517Z] 18:59:49 INFO - PID 18212 | #8 mozilla::profiler::ThreadRegistration::UnregisterThread() /builds/worker/checkouts/gecko/tools/profiler/core/ProfilerThreadRegistration.cpp:163:5 (libxul.so+0x66f6ba8)
[task 2021-10-13T18:59:49.518Z] 18:59:49 INFO - PID 18212 | #9 profiler_unregister_thread() /builds/worker/checkouts/gecko/tools/profiler/core/platform.cpp:5388:3 (libxul.so+0x66e7d92)
[task 2021-10-13T18:59:49.518Z] 18:59:49 INFO - PID 18212 | #10 ~AutoProfilerRegisterThread /builds/worker/workspace/obj-build/dist/include/GeckoProfiler.h:554:35 (libxul.so+0xc9c27f)
[task 2021-10-13T18:59:49.519Z] 18:59:49 INFO - PID 18212 | #11 ~MaybeStorage /builds/worker/workspace/obj-build/dist/include/mozilla/Maybe.h:268:25 (libxul.so+0xc9c27f)
[task 2021-10-13T18:59:49.519Z] 18:59:49 INFO - PID 18212 | #12 nsProcess::Monitor(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsProcessCommon.cpp:190:1 (libxul.so+0xc9c27f)
[task 2021-10-13T18:59:49.520Z] 18:59:49 INFO - PID 18212 | #13 _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5 (libnspr4.so+0x44cd6)
[task 2021-10-13T18:59:49.520Z] 18:59:49 INFO - PID 18212 | Location is global 'PSAutoLock::gPSMutex' of size 56 at 0x7f92aa324db8 (libxul.so+0x00000bc1fdb8)
[task 2021-10-13T18:59:49.520Z] 18:59:49 INFO - PID 18212 | Mutex M0 (0x7f92aa324db8) created at:
[task 2021-10-13T18:59:49.521Z] 18:59:49 INFO - PID 18212 | [failed to restore the stack]
[task 2021-10-13T18:59:49.521Z] 18:59:49 INFO - PID 18212 | SUMMARY: ThreadSanitizer: use of an invalid mutex (e.g. uninitialized or destroyed) /builds/worker/checkouts/gecko/mozglue/misc/Mutex_posix.cpp:96:3 in mutexLock
Gerald — any thoughts?
Quick first thoughts, just looking at the report and code:
That's really strange, the mutex being locked is PSAutoLock::gPSMutex
, which is static
!
Does this thread end during/after static objects are destroyed? (Or even before during/before static initialization? But that seems even more unlikely.)
If that's the case, I'm not sure how to solve this. 🤔
But I'll keep investigating...
Debugging with printfs, the log shows:
12:21:26.621 **** [17866] ~PSMutex@0x7f99711c5db8()
12:21:26.622 **** [17866] nsProcess@0x7b2400052b90::Monitor() about to unregister
12:21:26.622 ==================
12:21:26.623 WARNING: ThreadSanitizer: use of an invalid mutex (e.g. uninitialized or destroyed) (pid=17866)
12:21:26.623 #0 pthread_mutex_lock /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4233:3 (xpcshell+0x755a6)
...
As I suspected, the static
gPSMutex
gets destroyed, and later nsProcess::Monitor
exits and tries to unregister its thread, which requires the gPSMutex
. Boom!
I didn't look into why this thread exits so late, or why these patches make this happen, you'll probably have a better idea.
If you don't see how to make the Monitor exit earlier (or unregister earlier, or not register at all?), I guess one way to solve this would be to allocate the mutex on the heap, and intentionally leak it, so it never gets destroyed.
Other ideas?
Please let me know if you'd like me to implement that fix, or look deeper... Good luck!
Oh, and if we needed to make the profiler mutex live longer/forever, it would not be the end, because the list of registered thread is also a static object with its own static mutex. Hopefully that's all.
These fancy things happening after main() ends are quite dangerous! 😅
Assignee | ||
Comment 19•3 years ago
•
|
||
Thanks Gerald! I think we should just not register this thread with the profiler. I'll attach a patch with an explanation as to the failure mode and why this is the right fix.
Assignee | ||
Comment 20•3 years ago
|
||
This thread waits on a syscall for the completion of an
asynchronously-spawned process. Since there's no general guarantee that
Gecko will outlive the proccesses it spawns, this thread can continue
running all the way up until the process terminates. And while we
generally endeavor to have teardown logic join any spawned threads, we
don't have an easy way to interrupt the platform-specific routines that
the thread might be waiting on.
That leaves us with instrumenting the profiler code to handle APIs
arbitrarily late during shutdown (i.e. after static destructors have
run), or simply not registering the thread with the profiler. The latter
is much simpler, and doesn't cost us much because these threads never do
any interesting computation, and thus aren't likely to be relevant to
the profiler. This registration was added in bug 1323100, which was just
a grep-driven activity which didn't identify any particular need for
profiling this thread.
Assignee | ||
Comment 21•3 years ago
|
||
Assignee | ||
Comment 22•3 years ago
|
||
Assignee | ||
Comment 23•3 years ago
|
||
For the same reasons discussed in the previous commit, it's impractical
to join these threads on shutdown, and so we should suppress thread leak
reports for them.
Comment 24•3 years ago
|
||
Comment 25•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e1b4e111ed5b
https://hg.mozilla.org/mozilla-central/rev/51e70fc95fad
https://hg.mozilla.org/mozilla-central/rev/f2c30c3abfa6
https://hg.mozilla.org/mozilla-central/rev/e1f824e84f03
https://hg.mozilla.org/mozilla-central/rev/ee6b89941d51
https://hg.mozilla.org/mozilla-central/rev/06a1e5c71b52
https://hg.mozilla.org/mozilla-central/rev/c0e719d87617
Comment 26•3 years ago
|
||
Bobby, was those console.log
calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?
It's spamming the terminal quite a lot when closing Firefox
Comment 27•3 years ago
|
||
In https://phabricator.services.mozilla.com/D127715, I don't think the defineModuleGetter()
line is correct. It's importing TelemetrySend
but sendStandalonePing()
isn't a method on this class, it is a standalone function. Either the import should reference this function or the function should become a member of the TelemetrySend
class. In any case, the current patch throws errors in Nightly:
TelemetrySend.sendStandalonePing is not a function PingCentre.jsm:112
_sendStandalonePing resource:///modules/PingCentre.jsm:112
sendStructuredIngestionPing resource:///modules/PingCentre.jsm:140
sendStructuredIngestionEvent resource://activity-stream/lib/TelemetryFeed.jsm:816
sendDiscoveryStreamLoadedContent resource://activity-stream/lib/TelemetryFeed.jsm:503
forEach self-hosted:205
sendDiscoveryStreamLoadedContent resource://activity-stream/lib/TelemetryFeed.jsm:496
endSession resource://activity-stream/lib/TelemetryFeed.jsm:418
onAction resource://activity-stream/lib/TelemetryFeed.jsm:978
_middleware resource://activity-stream/lib/Store.jsm:63
method resource://activity-stream/lib/Store.jsm:39
onActionFromContent resource://activity-stream/lib/ActivityStreamMessageChannel.jsm:108
onNewTabUnload resource://activity-stream/lib/ActivityStreamMessageChannel.jsm:315
onNewTabUnload self-hosted:1175
callListeners resource://gre/modules/remotepagemanager/MessagePort.jsm:36
portMessageReceived resource://gre/modules/remotepagemanager/RemotePageManagerParent.jsm:89
portMessageReceived self-hosted:1175
callListeners resource://gre/modules/remotepagemanager/MessagePort.jsm:36
observe resource://gre/modules/remotepagemanager/RemotePageManagerParent.jsm:237
Assignee | ||
Comment 28•3 years ago
|
||
(In reply to William Durand [:willdurand] from comment #27)
In https://phabricator.services.mozilla.com/D127715, I don't think the
defineModuleGetter()
line is correct. It's importingTelemetrySend
butsendStandalonePing()
isn't a method on this class, it is a standalone function. Either the import should reference this function or the function should become a member of theTelemetrySend
class. In any case, the current patch throws errors in Nightly:
Thanks! Filed bug 1736438 with a patch.
Assignee | ||
Comment 29•3 years ago
|
||
(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)
Bobby, was those
console.log
calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?It's spamming the terminal quite a lot when closing Firefox
Nick, what do you recommend here? Should I be using Log.jsm instead?
Comment 30•3 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #29)
(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)
Bobby, was those
console.log
calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?It's spamming the terminal quite a lot when closing Firefox
Nick, what do you recommend here? Should I be using Log.jsm instead?
You certainly can use Log.jsm
, but you'll need to mirror a pref from the default profile in order to be able to configure the setting. See https://searchfox.org/mozilla-central/rev/483cfde5a54f6c1cd94c6295564993aeb4f10980/toolkit/mozapps/update/BackgroundTask_backgroundupdate.jsm#238. Not sure it's worth the effort; you might just drop the logging, or make it depend on an environment variable.
Assignee | ||
Comment 31•3 years ago
|
||
(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)
Bobby, was those
console.log
calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?It's spamming the terminal quite a lot when closing Firefox
Can you say more about the configuration? Is this about added logging noise for debug builds, or logging in situation where we were previously quiet?
I noticed that background tasks unconditionally print a message on startup to the terminal [1]. Just getting rid of the console.log invocations won't fix that.
Comment 32•3 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #31)
(In reply to Nicolas Chevobbe [:nchevobbe] from comment #26)
Bobby, was those
console.log
calls meant to be in the commit https://searchfox.org/mozilla-central/rev/5122357c497684e01c5bb2d4a9bf8be1fe97a413/toolkit/components/telemetry/pings/BackgroundTask_pingsender.jsm#32,40 ?It's spamming the terminal quite a lot when closing Firefox
Can you say more about the configuration? Is this about added logging noise for debug builds, or logging in situation where we were previously quiet?
On a regular build, when closing (or restarting), it prints more than 200 000 characters. So it does add noise to sdtout where it was previously quiet (note that I have devtools.console.stdout.chrome
set to true). It doesn't have any impact on my workflow for now, it's just quite verbose (and would make detecting error printed to stdout while Firefox is closing a bit harder I say?)
Assignee | ||
Comment 33•3 years ago
|
||
(In reply to Nicolas Chevobbe [:nchevobbe] from comment #32)
On a regular build, when closing (or restarting), it prints more than 200 000 characters. So it does add noise to sdtout where it was previously quiet (note that I have
devtools.console.stdout.chrome
set to true). It doesn't have any impact on my workflow for now, it's just quite verbose (and would make detecting error printed to stdout while Firefox is closing a bit harder I say?)
Filed bug 1736623 for this with a patch.
Description
•