Closed Bug 1665041 Opened 2 years ago Closed 2 years ago

Hundreds of "Processing ping at [..}" log messages on start up

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sebastian, Assigned: janerik)

References

Details

Attachments

(4 files)

When launching Fenix Nightly (200914) I get a ton of those "Processing ping at" log messages. The following snippet is only a tiny subset of log messages. Is this normal? This happens on every app start.

I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/381af193-f305-4c3c-83df-10cce57aa008
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/1989c625-c3de-48ab-a38a-5ab47b80b14c
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/15f3f276-a9c8-4acb-86fe-428c608d1cac
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/872633aa-2be4-4fb2-8df0-713e796dbffc
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/03ed122e-2a15-4af9-9399-ca47d4fe9abb
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/c88a0954-07ae-40e0-ba51-7a238ccfc151
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/1c5b46cf-5c90-4436-82fb-e37eedc1b03f
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/8bb66eb2-1952-457a-a241-6d11a0f61393
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/c8a0bd29-41c5-40d7-86e0-02f963b419e3
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/81de3cac-5bb5-4af8-847b-89d61036ed53
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/b84b6c54-2a93-4ab0-9e0f-b5719686ec8d
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/9ace92c9-a14c-41e2-a6a5-924ef7d4d04d
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/5a306116-772b-4a53-88f4-1404b3900a2f
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/aa14b29e-b6b1-48c1-811f-0f18f6b9e0b7
I  glean_core::upload::directory: Processing ping at: /data/user/0/org.mozilla.fenix/glean_data/pending_pings/55ea7f30-5a1a-458d-ba3c-07628a4c2e90
Priority: -- → P3
Whiteboard: [telemetry:glean-rs:m?]

Everytime Glean is initialized it scans the pending pings directories and each scanned ping gets logged, but it's not really expected that these directories would have hundreds of pings. Is this a testing app or something of the sort that would have more pending pings than usual?

Assignee: nobody → tlong
Priority: P3 → P1

Sebastian, are there any other Glean log messages from these runs, and is this from a simulator or an actual device?

Flags: needinfo?(s.kaspari)

Looks like I'm able to reproduce this on the simulator. Every time I start up it adds 3 pings to the list in the log

Those are from my actual device (Pixel 3, Android 11) and I am actively using Fenix Nightly as my browser. Let me grab a full log.

Attached file startup.txt
Flags: needinfo?(s.kaspari)
Whiteboard: [telemetry:glean-rs:m?]

Thanks Sebastian, I'm able to confirm and reproduce this consistently now. Now to just figure out what's going wrong and fix it. To clarify what I have found out so far, it appears that Glean is "processing" a baseline ping with reason:foreground but not submitting it.

At first I thought that this was something to do with processing the pending pings directory in a background thread and not synchronizing, but I'm not so sure that is it or not. I looked at the overall volume of pings by glean version and while the new release is very fresh, it was looking pretty much like an uptake curve so far.

I tried rolling this back to a previous version of Glean and kept running into issues trying this on the emulator, so I finally switched to testing on my Android tablet which appeared to work, but my tablet is only running Android SDK 26, while in the emulator on SDK 28 I continue to see the problem.

On my emulator, it really appears like the WorkManager task is never being executed.

After staring at this and my machine slowing down further and further, I decided to stop digging for today and hopefully Jan-Erik can shed a little more light on this in the EU morning.

Flags: needinfo?(jrediger)

Yesterday I downloaded a Nightly apk (x86):

versionCode='2015764021'
versionName='Nightly 200915 06:08'

I ran that yesterday in the emulator and I could accumulate pings and not see them get uploaded.
I tried again today and I can accumulate pending pings by opening the app, then force-closing it.
Most of the time if I then open the app again and let it sit it correctly processes the pending pings and uploads them.

Now, after trying again however it doesn't process them. Even if triggering a new ping by putting Fenix into the background, then bringing it back right away doesn't lead to launching the uploader.

I will next build Fenix locally again in order to be able to add additional logging.

Flags: needinfo?(jrediger)

Accumulating a whole bunch of pings (30-40) seems to reliably cause it to completely stall (= not uploading pings anymore), even in my local build.

I added additional logging to Glean to better understand what's going on. I am running of commit 67fa67acc0650520137878ad2a533d70e3a9b3be of mozilla/glean.
Everything was run using the glean-sample-app in the repository, runing on the emulator.

First I opened the app and force-closed it quickly. I redid that a couple of times. This accumulates pings in /data/user/0/org.mozilla.samples.gleancore/glean_data/pending_pings.
(I saved those to be quicker to re-do this analysis locally: adb pull /data/user/0/org.mozilla.samples.gleancore/glean_data/pending_pings /tmp)

I then opened the app, moved it to background (not killing it) and opening it again. This triggers a baseline ping, which should trigger upload.
It starts uploading the first 15 pings quickly and then gets throttled (policy: maxium of 15 pings in 60 seconds).
The upload worker returns RETRY, which means it follows the backoff policy (wait 10s, then retry, doing that in a loop, with exponential backoff).
Glean won't let it upload again before the 60s are over.
If I then let the app sit idle it will eventually upload pings, get throttled again, wait 60s more, then upload again.

Force-closing the app, then opening it again will trigger a baseline ping again, and thus a upload task.
This picks up the pending pings just fine and starts uploading again until it is throttled.

This means as pending pings accumulate new pings could get delayed heavily because the worker never catches up.

Now when I run the Fenix Nightly build I have and let it sit it also doesn't ever launch the upload task, even after more than 60s nothing is happening.
Even triggering a baseline ping (backgrounding, then foregrounding) submits that ping, but doesn't trigger the upload.

Looking at Sebastian's log this lists over 5000 pending pings.
With the above I guess accumulating THAT many pings is possible, but I also don't see a way that we ever be able to process them all, so we might need to add a failsafe.
We already do now limit the pending pings directory to 10 MB, maybe we need to limit it to a certain amount of files too.

Sebastian, could you do me a favor and:

  1. Launch Fenix and let it sit idle for more than 60s.
  2. Trigger a baseline ping (backgrounding, then foregrounding)
    • You should see a line like glean_core: The ping 'baseline' was submitted and will be sent as soon as possible in the log
    • You might see a line that says it uploaded a ping and got a HTTP response 200
  3. Check how big the pending pings directory is: adb shell du -h /data/user/0/org.mozilla.fenix/glean_data/pending_pings
Flags: needinfo?(s.kaspari)
Attached file submit.txt
  1. Launch Fenix and let it sit idle for more than 60s.
  2. Trigger a baseline ping (backgrounding, then foregrounding)
    • You should see a line like glean_core: The ping 'baseline' was submitted and will be sent as soon as possible in the log
    • You might see a line that says it uploaded a ping and got a HTTP response 200

I attached the log, starting from before backgrounding the app, waiting some seconds after the foregrounding. I see multiple "submitted" log statements. I don't think I see any actual upload, at least in that part of the log.

  1. Check how big the pending pings directory is: adb shell du -h /data/user/0/org.mozilla.fenix/glean_data/pending_pings

I'm unfortunately not able to do that on my production phone (no root) with a "production" build (no debug) of Firefox Nightly.

$ adb shell du -h /data/user/0/org.mozilla.fenix/glean_data/pending_pings                                                                        4s
du: /data/user/0/org.mozilla.fenix/glean_data/pending_pings: Permission denied
$ adb shell
blueline:/ $ run-as org.mozilla.fenix
run-as: package not debuggable: org.mozilla.fenix
Flags: needinfo?(s.kaspari)

Thanks!
I forgot that looking into that directory requires root (or debuggable app access).
Given that I can reproduce it locally we have something to work with nonetheless.

Together with :travis we might have figured out why pings are send delayed:
With Glean v32.2.0 we moved wait/retry functionality into the Rust core.
The rate limiter itself throttles to no more than 15 pings in a 60s interval. If throttled it tells the uploader to wait.
We leverage the default behavior of the WorkManager and on wait return a Result.retry() which causes it to follow the default backoff policy:
It waits for 10s, then runs the upload worker again.
If that happens again it then waits for 20s, runs again, 40s, runs again, ... with an exponential backoff.

Now with that many pending pings a single upload worker will always send a burst of pings in a loop, before it returns wait.
With every wait it suspends the task for longer and longer.
If the workmanager task isn't killed the backoff might increase for a long long time, therefore never catching up to the pending pings (especially as each backgrounding/foregrounding creates new pings)

There's now 2 things we need to deal with:

Dealing with too many pending pings

We now need to handle cases like sebastian's in some way. There's no point in uploading 5000 pings just to catch up.
We should limit it to a reasonable number and delete others, based on oldest first.

Handling backoff

What we really want is for the workmanager to just run our stuff:
Upload until throttled, then wait 60s (the default throttle time) and ask again. No additional backoff.

(In reply to Jan-Erik Rediger [:janerik] from comment #15)

There's now 2 things we need to deal with:

Dealing with too many pending pings

We now need to handle cases like sebastian's in some way. There's no point in uploading 5000 pings just to catch up.
We should limit it to a reasonable number and delete others, based on oldest first.

Maybe we need to lower our 10mb pending ping quota? Because I think that's exactly what's it for :) How much space do 5k pings take?

One note:

With Glean v32.2.0 we moved wait/retry functionality into the Rust core.

In theory that shouldn't have changed the actual behaviour, when the max number of wait attempts is reached the WorkManager should return with a Success result. Before that release it would return with a Failure result on this case, but both results have the same functionality (stop the WorkManager) as far as I remember.

Which means this might be a problem that has been happening since we limited wait attempts.

Regarding [:Dexter]s comment, I think the amount of space a 5k ping takes depends on the contents of the ping, probably super variable. I think maybe limiting either on number of pings or actual size could be best.

(In reply to Alessio Placitelli [:Dexter] from comment #16)

Maybe we need to lower our 10mb pending ping quota? Because I think that's exactly what's it for :) How much space do 5k pings take?

~600 bytes for a baseline ping x 5000 = ~3 MB.

(In reply to Beatriz Rizental from comment #17)

One note:

With Glean v32.2.0 we moved wait/retry functionality into the Rust core.

In theory that shouldn't have changed the actual behaviour, when the max number of wait attempts is reached the WorkManager should return with a Success result. Before that release it would return with a Failure result on this case, but both results have the same functionality (stop the WorkManager) as far as I remember.

Which means this might be a problem that has been happening since we limited wait attempts.

Right, I was thinking about other functionality. So yeah, more likely the introduction of throttling itself caused this.

I have tested out jan-erik's idea of using a fixed sleep rather than relying on the WorkManager backoff policy good enough to feel comfortable with it. From what I can tell, the 60s sleep works pretty well to clean out and keep the pings cleaned out.

In regards to the situation for Sebastian's case, where a large number of pings have already been collected, I think I would be in favor of limiting the number of pending pings, rather than the limiting this by the size of the directory. This also has the added benefit of being able to be limited at the point of ping submission rather than at ping collection. And maybe it's a good idea to do a little of both? 10MB obviously accounts for more pings than we wanted, so what if we limited it instead to 100 pings and 1MB? In order to facilitate being able to clear out 100 pings, I would also suggest changing the rate limiting to 20 pings, with a retry of 5 waits (or something else that makes sense that would clean out a full set of pings from pending_pings in a single workmanager invocation).

In the case of Sebastian specifically? He could always toggle telemetry off and back on to clear out those 5k pings in a sort of "manual" way.

(In reply to Sebastian Kaspari (:sebastian; :pocmo) from comment #20)

Note that you can also change the backoff policy of workmanager:
https://developer.android.com/reference/androidx/work/WorkRequest.Builder#setBackoffCriteria(androidx.work.BackoffPolicy,%20long,%20java.util.concurrent.TimeUnit)

That won't help. This only allows to switch between linear or exponential backoff, it will still increase with every retry.
The problem here is that we already enforce the backoff internally by throttling. If we also rely on the workmanager we have 2 backoff policies that add up (the reason we do this internally now is that we get consistent behavior across implementations).
We therefore can't rely on WorkManager's backoff/retry strategy.

I merged the PR without thinking and pushed a release already.
Luckily that doesn't mean it's landing in products already. I'm holding back on those until data-review was done.
Sorry for rushing this.

Comment on attachment 9176254 [details]
data-review-request-1665041.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jan-Erik Rediger is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9176254 - Flags: data-review?(liuche) → data-review+
Assignee: tlong → jrediger
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.