Android: Add Sync Telemetry Ping

RESOLVED FIXED in Firefox 55

Status

()

P1
normal
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: adavis, Assigned: Grisha)

Tracking

(Depends on: 1 bug, Blocks: 1 bug)

Firefox 44
Firefox 55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 fixed)

Details

(Whiteboard: [send-tab-funnel])

User Story

As the product manager of sync, I would like to measure the health of sync on Android because it will help me better prioritize (and understand the impact of) new features and enhancements.

Further more, this is a prerequisite for the Sync team to start using event telemetry to measure feature funnels and product performance across devices.

Attachments

(26 attachments)

59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
Details
Comment hidden (empty)
(Assignee)

Comment 1

2 years ago
Meta bug tracking the first round of this work: Bug 1180321.
See Also: → bug 1180321
(Reporter)

Updated

2 years ago
User Story: (updated)
(Reporter)

Updated

2 years ago
Blocks: 1324597
(Assignee)

Comment 2

2 years ago
Sync ping docs: http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/sync-ping.html

Java telemetry entry point & "how to add a new ping type": https://dxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryDispatcher.java#41-51

Steps, roughly:
- define a new ping type & ping builder
- define an upload scheduler
- figure out & use the new ping builder throughout sync
(Assignee)

Updated

2 years ago
Blocks: 1343985
(Assignee)

Updated

2 years ago
Component: Metrics → Android Sync
Product: Firefox for Android → Android Background Services
Version: 44 Branch → Firefox 44
(Assignee)

Updated

2 years ago
Assignee: nobody → gkruglov
Status: NEW → ASSIGNED
Priority: -- → P1
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1196478
(Assignee)

Updated

2 years ago
Blocks: 1180321
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Comment 8

2 years ago
Nick, you probably thought the most about sync telemetry - judging from blame, anyway. Could you you take a brief look at the WIP patches above, to double check if the approach I'm taking makes sense?

Telemetry collection itself is very much a WIP and probably will look a bit differently (although it's not that far off).
I think LBS approach for separating out different sides (sender/receiver) should work out well, but perhaps I'm not aware of some weird pitfalls. The contract between them is likely going to change though.
Flags: needinfo?(nalexander)

Comment 9

2 years ago
mozreview-review
Comment on attachment 8857292 [details]
Bug 1308337 - Part 1: Instrument named sync stages and broadcast collected telemetry

https://reviewboard.mozilla.org/r/129248/#review132096

Marking r- just since it's not done.

Fundamentally, is the telemetry data you care about an event stream or is it a summary report?  There are trade-offs to both, but streams are definitely easier to implement and are more general.  If you think it's a stream, then I don't think this callback based approach adds value -- just stick the "send event" calls in the appropriate places.  (And maybe tag events with a "Sync id", so that you can know with confidence which Sync is happening.  Be interesting to see if Syncs overlap in the wild... that shouldn't happen due to Android scheduling, but what do you bet we got some threading things wrong?  It's my belief that the "new telemetry" systems are intended to be event stream based, so this would agree with the pings you'll be creating.

If you're confident you want a single report per Sync, then this might make sense, where you roll up telemetry for an entire sync and shuttle it off to the uploader all at once.  In that case, though, what are you gaining by having the uploader listen in `GeckoApplication` and using `LocalBroadcastManager`?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/GlobalSyncStage.java:84
(Diff revision 1)
>      private Stage(final String name) {
>        this.repositoryName = name;
>      }
>    }
>  
> +  // TODO we probably don't need a non-instrumented `execute`, except for tests perhaps

Eh, an explicit dummy collector, or a mocked out collector, should be good.
Attachment #8857292 - Flags: review-
(Assignee)

Comment 10

2 years ago
Some additional things to consider adding, besides what's in the docs:
- top level list of failed stages in case of errors, to make querying easier
- top level storage node id
(Assignee)

Comment 11

2 years ago
(In reply to Nick Alexander :nalexander from comment #9)
> Fundamentally, is the telemetry data you care about an event stream or is it
> a summary report?

Sync ping itself is a summary report. One might treat the process of building it as a stream of events produced during a sync which then get rolled up into a report and scheduled for upload. I feel that this will be both conceptually and implementationally more complex. A sync is a process which has an explicit beginning and an end, and so I think imposing stream semantics on top might unnecessary complicate things on the receiving end of that telemetry before its uploaded.

> (And maybe tag events with a
> "Sync id", so that you can know with confidence which Sync is happening.  Be
> interesting to see if Syncs overlap in the wild... that shouldn't happen due
> to Android scheduling, but what do you bet we got some threading things
> wrong?

With the current approach, one should be able to figure this out by looking at the `begin` and `took` timestamps across sync pings, and see if there's an overlap. Sync ping is emitted via SessionCallback, and so we'll skip recording things like "not syncing because already in progress". Although that class of events would be interesting to see as well.

> It's my belief that the "new telemetry" systems are intended to be
> event stream based, so this would agree with the pings you'll be creating.

I don't think it's unreasonable to think of individual syncs as "events", either.

> In that case, though, what are you gaining by
> having the uploader listen in `GeckoApplication` and using
> `LocalBroadcastManager`?

This is simply a by-product of not being able to access our Java-side telemetry stuff directly from the background services. I've opted for a broadcast approach as a seemingly better alternative to doing introspection.
(In reply to :Grisha Kruglov from comment #11)
> (In reply to Nick Alexander :nalexander from comment #9)
> > Fundamentally, is the telemetry data you care about an event stream or is it
> > a summary report?
> 
> Sync ping itself is a summary report. One might treat the process of
> building it as a stream of events produced during a sync which then get
> rolled up into a report and scheduled for upload.

FWIW, desktop works roughly like:

* watch for observer events which report events during the sync (eg, engine starts, engine reports record counts, engine completes with some status).
* On the sync-finished notification, stash the collected report away - don't yet upload.

So up until here, it is roughly a "stream" that's being collected into a single representation of a sync.

* Occasionally we'll also get notifications of "events" - these aren't part of a single sync and are much closer to a true stream and are appended to a queue of things to submit.

* Ever 12 hours, or after shutdown, or after we detect an account change, submit all the recorded syncs and events in a single ping. We did this at the request of the telemetry team as submitting one ping per sync was causing some grief.

It's not clear if that helps the above discussion, but thought I'd mention it anyway.
(Assignee)

Comment 13

2 years ago
(In reply to Mark Hammond [:markh] from comment #12)
> * watch for observer events which report events during the sync (eg, engine
> starts, engine reports record counts, engine completes with some status).
> * On the sync-finished notification, stash the collected report away - don't
> yet upload.

One benefit of this approach that what I'm currently doing lacks is that you'll be able to record syncs which got interrupted by a crash, or some other "unscheduled interruption" which falls outside of the regular flow of events.

> * Occasionally we'll also get notifications of "events" - these aren't part
> of a single sync and are much closer to a true stream and are appended to a
> queue of things to submit.

When you say "events", do you mean the client commands?

> * Ever 12 hours, or after shutdown, or after we detect an account change,
> submit all the recorded syncs and events in a single ping. We did this at
> the request of the telemetry team as submitting one ping per sync was
> causing some grief.

That's good to know. My current approach is "upload right away", which would work well for the client, but I'll need to re-assess this.
(In reply to :Grisha Kruglov from comment #13)
> When you say "events", do you mean the client commands?

In practice, both that and the repair process.

"events" in general are the same as used in the main ping - some basic docs at https://docs.google.com/document/d/1hNuS9lUJMvMqgntZXbFA6xZBU9zBpQgo7x73-sXKRpI/edit. Desktop sync usage is basically https://dxr.mozilla.org/mozilla-central/search?q=recordTelemetryEvent&redirect=true

> That's good to know. My current approach is "upload right away", which would
> work well for the client, but I'll need to re-assess this.

Firefox already had the submission handling for we could reuse for free - if that doesn't already exist on Fennec it might be possible to chat with the telemetry team and agree to do better batching in a followup.
(In reply to :Grisha Kruglov from comment #11)
> (In reply to Nick Alexander :nalexander from comment #9)
> > Fundamentally, is the telemetry data you care about an event stream or is it
> > a summary report?
> 
> Sync ping itself is a summary report. One might treat the process of
> building it as a stream of events produced during a sync which then get
> rolled up into a report and scheduled for upload. I feel that this will be
> both conceptually and implementationally more complex. A sync is a process
> which has an explicit beginning and an end, and so I think imposing stream
> semantics on top might unnecessary complicate things on the receiving end of
> that telemetry before its uploaded.

So I just saw email from Georg Fritzsche (https://groups.google.com/d/msg/mozilla.dev.platform/jZxJUqYRPw4/hP0tj945DwAJ) talking about "Sync events".  It's my belief that you will have more success streaming events (with batch uploading) than you will with roll-up reports.  Streaming events have less of a schema and using them pushes processing (the roll-up part!) to the server, where we can do lots of things "at our leisure".  I'd much rather have a very simple event stream, and then derive value from it server side, than have the various clients all push subtly different Sync telemetry reports that they roll-up themselves.

Phrased another way: if you're debugging a problem, would you rather have an event log (like, for example, the logging I do during push registration) or a summary report like the one you're generating?  I would kill for a detailed log, remotely available.  Let's not give up on that right away -- unless we have real concerns (performance hits to collect and store data on device, size of data over the wire, privacy violations).

> > (And maybe tag events with a
> > "Sync id", so that you can know with confidence which Sync is happening.  Be
> > interesting to see if Syncs overlap in the wild... that shouldn't happen due
> > to Android scheduling, but what do you bet we got some threading things
> > wrong?
> 
> With the current approach, one should be able to figure this out by looking
> at the `begin` and `took` timestamps across sync pings, and see if there's
> an overlap. Sync ping is emitted via SessionCallback, and so we'll skip
> recording things like "not syncing because already in progress". Although
> that class of events would be interesting to see as well.
> 
> > It's my belief that the "new telemetry" systems are intended to be
> > event stream based, so this would agree with the pings you'll be creating.
> 
> I don't think it's unreasonable to think of individual syncs as "events",
> either.
> 
> > In that case, though, what are you gaining by
> > having the uploader listen in `GeckoApplication` and using
> > `LocalBroadcastManager`?
> 
> This is simply a by-product of not being able to access our Java-side
> telemetry stuff directly from the background services. I've opted for a
> broadcast approach as a seemingly better alternative to doing introspection.

OK, I knew this (know this!) and still have to think through the division of concerns.  Thanks!
Flags: needinfo?(nalexander)
(In reply to Nick Alexander :nalexander from comment #15)
> So I just saw email from Georg Fritzsche
> (https://groups.google.com/d/msg/mozilla.dev.platform/jZxJUqYRPw4/
> hP0tj945DwAJ) talking about "Sync events".  It's my belief that you will
> have more success streaming events (with batch uploading) than you will with
> roll-up reports.  Streaming events have less of a schema and using them
> pushes processing (the roll-up part!) to the server, where we can do lots of
> things "at our leisure".  I'd much rather have a very simple event stream,
> and then derive value from it server side, than have the various clients all
> push subtly different Sync telemetry reports that they roll-up themselves.

IIUC though, it will be a real pain to merge events with the existing data sets we collect on desktop for the sync ping. Ideally we are not measuring the same thing in 2 different ways. I don't want our basic sync pings to be different across platforms and then stored differently.

Don't get me wrong, we do like the future of events telemetry. We plan to use events for new data types like measuring bookmark repair and even "send tab" but for backward compatibility, the roll-ups are likely to be a lot simpler for us.
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1216344
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Updated

2 years ago
Blocks: 1362208
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Updated

2 years ago
Blocks: 1363924
(Assignee)

Updated

2 years ago
Duplicate of this bug: 891600
(Assignee)

Updated

2 years ago
See Also: → bug 1366045
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 45

2 years ago
mozreview-review
Comment on attachment 8857291 [details]
Bug 1308337 - Pre: clean up wrong annotation

https://reviewboard.mozilla.org/r/129246/#review146640
Attachment #8857291 - Flags: review?(nalexander) → review+

Comment 46

2 years ago
mozreview-review
Comment on attachment 8857294 [details]
Bug 1308337 - Post: Remove old background telemetry code

https://reviewboard.mozilla.org/r/129252/#review146686
Attachment #8857294 - Flags: review?(nalexander) → review+

Comment 47

2 years ago
mozreview-review
Comment on attachment 8863964 [details]
Bug 1308337 - Pre: More granular tracking of record flow between repositories

https://reviewboard.mozilla.org/r/135692/#review146642

OK, this basically makes sense.  r- just so I remember to look at your response!

::: commit-message-4cb71:4
(Diff revision 3)
> +Bug 1308337 - Pre 2: More granular tracking of record flow between repositories r=nalexander
> +
> +This patch:
> +- introduces a way to signal that a record has been reconciled; this is not a "flow control"

What exactly does reconciled mean in the context of abstract repositories, sessions, and delegates?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:82
(Diff revision 3)
>    private final RecordsChannelDelegate delegate;
>    private long fetchEnd = -1;
>  
>    private volatile ReflowIsNecessaryException reflowException;
>  
> -  protected final AtomicInteger numFetched = new AtomicInteger();
> +  private final AtomicInteger numFetched = new AtomicInteger();

Is there an invariant here, like:

`attempted = failed + reconciled`

It's true that `reconciled <= attempted`, right?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:84
(Diff revision 3)
>  
>    private volatile ReflowIsNecessaryException reflowException;
>  
> -  protected final AtomicInteger numFetched = new AtomicInteger();
> -  protected final AtomicInteger numFetchFailed = new AtomicInteger();
> -  protected final AtomicInteger numStored = new AtomicInteger();
> +  private final AtomicInteger numFetched = new AtomicInteger();
> +  private final AtomicInteger numFetchFailed = new AtomicInteger();
> +  private final AtomicInteger numStoredAttempted = new AtomicInteger();

nit: `numStoresAttempted`?  `numStoreNeeded`?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:141
(Diff revision 3)
>     * Get the number of store attempts (successful or not) so far.
>     *
>     * @return number of stores attempted.
>     */
> -  public int getStoreCount() {
> -    return numStored.get();
> +  public int getStoreAttemptedCount() {
> +    return numStoredAttempted.get();

You can see the mismatch between the member name and the accessor name here.  I prefer the accessor name.
Attachment #8863964 - Flags: review?(nalexander) → review-

Comment 48

2 years ago
mozreview-review
Comment on attachment 8864703 [details]
Bug 1308337 - Pre: Don't throw away store and fetch exceptions as they're encountered

https://reviewboard.mozilla.org/r/136344/#review146644

I see from unrelated patches that the test change snuck in early.

In your later comments you link to bugs about reporting more errors, etc, etc.  Can you respond to my substantive comment, helping me understand the situation as it is right now and where you want it to get to?  It seems like you're not doing enough in this patch at the moment.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/SynchronizerSession.java:84
(Diff revision 3)
> -  protected final AtomicInteger numOutboundRecords = new AtomicInteger(-1);
> +  private final AtomicInteger numInboundRecordsStored = new AtomicInteger(-1);
> +  private final AtomicInteger numInboundRecordsFailed = new AtomicInteger(-1);
> +  private final AtomicInteger numOutboundRecords = new AtomicInteger(-1);
> +  private final AtomicInteger numOutboundRecordsStored = new AtomicInteger(-1);
> +  private final AtomicInteger numOutboundRecordsFailed = new AtomicInteger(-1);
> +  private final AtomicInteger numInboundRecordsReconciled = new AtomicInteger(-1);

nit: group with the other inbounds?  Or does this represent the flow?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/SynchronizerSession.java:205
(Diff revision 3)
>        }
>  
>        @Override
>        public void onFlowStoreFailed(RecordsChannel recordsChannel, Exception ex, String recordGuid) {
>          Logger.warn(LOG_TAG, "First RecordsChannel onFlowStoreFailed. Logging local store error.", ex);
> +        storeFailedCauseException = ex;

This is invoked for every failing record, and thus potentially for every record.  Are you intentionally keeping only one causing exception (the last one witnessed)?  If you are, name your exceptions as such.  `examplarExceptionCausingStoreFail` or `representativeExceptionCausingStoreFail`.  (I'll see this as I read more of the patch.)

Is this reasonable?  I expect it is: most syncs will fail due to some catatastrophic failure (full disk or dropped connection), and all the errors will be IO errors or network errors or ...

Consider categorizing your failures based on the underlying exceptions, so that you can count in some way.

::: mobile/android/tests/background/junit4/src/org/mozilla/gecko/tokenserver/test/TestTokenServerClient.java:57
(Diff revision 3)
>  
>    public static final String TEST_TOKEN_RESPONSE = "{\"api_endpoint\": \"https://stage-aitc1.services.mozilla.com/1.0/1659259\"," +
>        "\"duration\": 300," +
>        "\"id\": \"eySHORTENED\"," +
>        "\"key\": \"-plSHORTENED\"," +
> +      "\"hashed_fxa_uid\": \"rAnD0MU1D\"," +

This is unrelated.  Is it necessary?
Attachment #8864703 - Flags: review?(nalexander) → review-

Comment 49

2 years ago
mozreview-review
Comment on attachment 8870971 [details]
Bug 1308337 - Pre: Read hashedFxaUid from the token server

https://reviewboard.mozilla.org/r/142534/#review146652

This more logically comes as Pre 1, but that's fine -- I often don't order my Pre: commits ;)  A hint in the commit message about why it's used would be nice.

::: commit-message-80eeb:3
(Diff revision 2)
> +Bug 1308337 - Pre 4: Read hashedFxaUid from the token server r=nalexander
> +
> +This is what we (and other platforms) use in place of either our local FxA Device ID or the sync client ID.

It's not clear _why_ you need this in this patch series, but I'll presumably get to it.

Migrate the snippet that got included in tests earlier into this patch.

Include a link with a Git hash to the tokenserver documentation, saying "Not documented as of ...".  Are you certain this is deployed everywhere you care about?  Changes to docs should be easy to get accepted.
Attachment #8870971 - Flags: review?(nalexander) → review+

Comment 50

2 years ago
mozreview-review
Comment on attachment 8857292 [details]
Bug 1308337 - Part 1: Instrument named sync stages and broadcast collected telemetry

https://reviewboard.mozilla.org/r/129248/#review146656

I didn't see a link to the telemetry ping format documentation.  Is that in Part 2, which I can't engage with this afternoon?  Consider pulling it forward, to the `build()` method.

Your commit message is excellent, thank you for it.

I do have two substantive issues (`ServerSyncStage`, `addDevice`).  Both would be clarified with tests (respectively: including "clients" and, say, "tabs"; showing what's expected on incremental syncs that do/do not witness an incoming client record).  (Aalthough again, perhaps Part 2 fills the void.)

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/sync/FxAccountSyncAdapter.java:671
(Diff revision 6)
>  
>              onSessionTokenStateReached(context, fxAccount);
>  
> -            final SessionCallback sessionCallback = new SessionCallback(syncDelegate, schedulePolicy);
> +            final SessionCallback sessionCallback = new InstrumentedSessionCallback(
> +                    telemetryCollector,
> +                    LocalBroadcastManager.getInstance(context),

This could leak the context, although I guess the scope of the callback is strictly less than the scope of context already.  Is this reasoning correct?  Or are you confident the context is the application context here, or should you work harder to get the application context?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/ServerSyncStage.java:543
(Diff revision 6)
>      final String name = getEngineName();
>      Logger.debug(LOG_TAG, "Starting execute for " + name);
>  
>      stageStartTimestamp = SystemClock.elapsedRealtime();
>  
> +    telemetryStageCollector.started = stageStartTimestamp;

This makes sense for the stages that use the synchronizer apparatus, but not for stages that don't.  (Clients, for example, does it's own thing.)  I don't see how you're catching the non-`ServerSyncStage` stages, now and in the future.  (You might not want to handle them, although you'll drop data if you don't -- you could see errors in i/c and m/g processing, for example.)

I think started/finished logic might be better handled in `GlobalSyncSession`.  It's parallel to constructing the `TelemetryStageCollector`, isn't it?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/SyncClientsEngineStage.java:279
(Diff revision 6)
>            handleDownloadedLocalRecord(r);
>          } else {
>            // Only need to store record if it isn't our local one.
>            wipeAndStore(r);
>            addCommands(r);
> +          telemetryStageCollector.getSyncCollector().addDevice(r);

Huh, this is not what I expected.  So we include in the telemetry ping whenever we fetch a new remote client record?

That seems ... hard to analyze.  The server will see a stream of Sync pings, most of which have no additional devices (and no information about our local device, IIUC) and then intermittent "new records", most of which will just be updated records.  How will you get the information you want, which is probably either summary statistics about the device constellation size or the exact constellation (device types, etc) a particular client with a particular error is witnessing in error circumstances?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:26
(Diff revision 6)
> +import org.mozilla.gecko.sync.repositories.StoreFailedException;
> +import org.mozilla.gecko.sync.repositories.domain.ClientRecord;
> +
> +import java.io.IOException;
> +import java.io.Serializable;
> +import java.nio.charset.StandardCharsets;

This is API level 19: https://developer.android.com/reference/java/nio/charset/StandardCharsets.html#US_ASCII.  We generally just use "US-ASCII"... but we really prefer "UTF-8".  Are you sure you want ASCII?  (My guess is you will always be handling strings where the ASCII and UTF-8 encodings are identical.)

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:53
(Diff revision 6)
> +    private String hashedDeviceID;
> +    private final ArrayList<Bundle> devices = new ArrayList<>();
> +
> +    private long started;
> +
> +    public TelemetryStageCollector collectorFor(String stageName) {

As a consumer I would expect this to get me the _same instance_ if I called it twice.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:85
(Diff revision 6)
> +        this.started = time;
> +    }
> +
> +    // In the current sync-ping parlance, "device" is really just a sync client.
> +    // At some point we will start recording actual FxA devices.
> +    public void addDevice(final ClientRecord client) {

Would you ever capture more than one device?  I don't think so.

Consider making this just a `deviceBundle` rather than a sequence of devices, since right now the API doesn't make clear that you should only call this once per client record.

If you really want multiple devices, I think you should key a map by client GUID.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:94
(Diff revision 6)
> +
> +        final Bundle device = new Bundle();
> +        device.putString(TelemetryContract.KEY_DEVICE_OS, client.os);
> +        device.putString(TelemetryContract.KEY_DEVICE_VERSION, client.version);
> +
> +        final String clientAndUid = client.guid.concat(this.hashedUID);

This is weird.  Client GUIDs are chosen by the clients, and historically they have not always been 8 or 9 "safe" characters.  The absence of a separator outside of the valid GUID and UID set means that there's a possibility for conflict across users.  Is that acceptable?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:157
(Diff revision 6)
> +        public ExtendedJSONObject build() {
> +            final ExtendedJSONObject error = new ExtendedJSONObject();
> +
> +            if (lastException instanceof CollectionConcurrentModificationException) {
> +                error.put("name", "httperror");
> +                error.put("code", 412);

In general, I prefer `Long` to `Integer` in EJO -- leads to less surprise with timestamps.  But as long as this agrees with the exact type of `getStatusCode()` below, we're good.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:181
(Diff revision 6)
> +            } else if (lastException instanceof StoreFailedException) {
> +                if (isNetworkError(storeException)) {
> +                    error.put("name", "networkerror");
> +                    error.put("error", "store:" + storeException.getClass().getSimpleName());
> +
> +                // Currently we only have access to one exception, the last one that happened. However, there

Ah, I think this addresses one of my earlier concerns.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:213
(Diff revision 6)
> +                    error.put("name", "httperror");
> +                    error.put("code", response.getStatusCode());
> +                }
> +
> +                try {
> +                    error.put("error", response.body().trim());

You're echoing unsanitized data, which could potentially leak PII (if the server is badly behaved).  Do we really want to echo the server body?  Wouldn't the HTTP error code be sufficient?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryStageCollector.java:12
(Diff revision 6)
> +import org.mozilla.gecko.sync.ExtendedJSONObject;
> +
> +/**
> + * Gathers telemetry details about an individual sync stage.
> + * Implementation note: there are no getters/setters to avoid unnecessary verboseness.
> + * This data expected to be write-only from within SyncStages, and read-only from TelemetryCollector.

nit: you might conclude that concurrency/locking isn't necessary.
Attachment #8857292 - Flags: review?(nalexander) → review-

Comment 51

2 years ago
mozreview-review
Comment on attachment 8857293 [details]
Bug 1308337 - Part 8: Receive sync telemetry, construct and upload sync pings

https://reviewboard.mozilla.org/r/129250/#review132098

Lots of comments, but this is looking pretty good.

I want to see some tests for the Sync pings bit; it will be much easier to visualize what you expect to happen if you mock out these things, show me stuff getting written to disk, show me the produced JSON, etc.

Good work!

::: commit-message-960af:4
(Diff revision 6)
> +Bug 1308337 - Part 2: Receive sync telemetry, construct and upload sync pings r=nalexander
> +
> +This patch includes some "pre" work, which should have been a separate patch (my apologies!):
> +- telemetry ping is (needlessly) coupled with information about where it should

nit: "be uploaded".

I'm confused: you talk about "where to upload", which I interpret to mean "HTTP endpoint that you will POST to"; and then you talk about "bundling" multiple pings.  Why does that change _where_ they get uploaded?  If they really do hit different endpoints, state that you introduce a member for the endpoint or whatever.

(Perhaps this will become clear as I read further.)

::: commit-message-960af:24
(Diff revision 6)
> +  which is persisted on disk.
> +- upload of the bundled "sync ping" is attempted
> +- as individual "local pings" are processed into outgoing "Sync ping" bundles, they are removed
> +  from disk
> +
> +Hooks for the upcoming event telemtry data are established, to make that work easier.

nit: "telemtry" is missing an "e".

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:38
(Diff revision 6)
> +import java.util.Set;
> +
> +/**
> + * Receives and processes telemetry broadcasts from background services, namely Sync.
> + * Nomenclature:
> + * - Bundled Sync Ping: a Sync Ping as documented at http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/sync-ping.html

Generally I prefer a concrete version/git hash, at least in the commit message.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:66
(Diff revision 6)
> +
> +    private static final String PREF_FILE_BACKGROUND_TELEMETRY = AppConstants.ANDROID_PACKAGE_NAME + ".telemetry.background";
> +    private static final String PREF_IDS = "ids";
> +    private static final String PREF_LAST_ATTEMPTED_UPLOADED = "last_attempted_upload";
> +
> +    // We don't currently support passing profile along with the background telemetry.

If you don't pass it, why do you need this?  I expect you want to say that "Firefox for Desktop identifies the current profile in its telemetry pings.  Firefox for Android assumes the Gecko profile is the default profile."

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:94
(Diff revision 6)
> +            throw new IllegalStateException("Received a background telemetry broadcast without type.");
> +        }
> +
> +        ThreadUtils.postToBackgroundThread(new Runnable() {
> +            @Override
> +            public void run() {

Worth commenting that you don't want to catch exceptions here 'cuz you want to crash if there's an error here.  If we silently catch exceptions, we'll never see that our telemetry collection code is busted.

It's very dangerous, though, 'cuz you're doing a bunch of key extraction/data munging and it's easy to make an error...

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:126
(Diff revision 6)
> +                        if (uid != null) {
> +                            localPingBuilder.withUID(uid);
> +                        }
> +
> +                        if (deviceID != null) {
> +                            localPingBuilder.withDeviceID(deviceID);

I expect builder patterns to use `with` when they return a new self and to use `set` when they modify in place.  Does your `with` both return self and modify in place?  That's unusual.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:195
(Diff revision 6)
> +                    final Set<String> localEventTelemetryToRemove = eventTelemetryStore.getStoredIDs();
> +
> +                    // Bundle up all that we have in our telemetry stores.
> +                    final TelemetryOutgoingPing syncPing = new TelemetrySyncPingBundleBuilder()
> +                            .withSyncStore(syncTelemetryStore)
> +                            .withEventStore(eventTelemetryStore)

It's my understanding that there's "event" telemetry in general and "Sync event" telemetry in particular.  Is the latter a subset of the former?  Distinct from the former?

In any case, let's not camp the general name -- make this `syncEvent...` throughout.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:203
(Diff revision 6)
> +
> +                    // Persist a Sync Ping Bundle.
> +                    boolean bundledSyncPingPersisted = true;
> +                    try {
> +                        syncPingStore.storePing(syncPing);
> +                    } catch (IOException e) {

Are you confident that _only_ `IOException` is relevant?  I might catch more here.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:204
(Diff revision 6)
> +                    // Persist a Sync Ping Bundle.
> +                    boolean bundledSyncPingPersisted = true;
> +                    try {
> +                        syncPingStore.storePing(syncPing);
> +                    } catch (IOException e) {
> +                        // If we fail to persist a bundled sync ping, we can either attempt to upload it,

Isn't the most likely error that the ping bundle is large and storage space is small?  And doesn't this imply that we'll just continuously fail, potentially filling the remainder of storage with small pings until there's actually zero storage space and even those pings fail to persist?

If we never upload and clear, we'll never make progress.

Processes that can consume storage in this unbounded manner are dangerous.  The simplest thing might be to cull pings older than some cutoff; or remove all pings once the total size of the pings exceeds some bound.

I'd be happy to address this in follow-up or see an explanation of why this can't happen in the wild.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:211
(Diff revision 6)
> +                        // In short, current approach is to skip an upload.
> +                        //
> +                        // If we choose to upload a Sync Ping that failed to persist locally, it becomes
> +                        // possible to upload the same telemetry multiple times. Since currently we do
> +                        // not have IDs as part of our sync and event telemetry objects, it is impossible
> +                        // to guarantee idempotence on the receiver's end. As such, we must take care of

nit: care to not upload ...

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:231
(Diff revision 6)
> +                        //
> +                        // Yet another solution is to alter server-side processing of this data such that
> +                        // a unique ID may be included alongside each sync/event telemetry object. This
> +                        // will result in a very straightforward client implementation with much better
> +                        // consistency guarantees.
> +                        Log.e(LOG_TAG, "Unable to write bundled sync ping to disk. Skipping upload.");

Include exception in log.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:248
(Diff revision 6)
> +                // Sync Ping Bundles. Otherwise, we'll attempt another upload next time telemetry is
> +                // processed.
> +                // If we have already have some persisted Sync Pings, that means a previous upload
> +                // failed - or, less likely, is in progress and did not yet succeed. It should be safe to
> +                // upload. Even if we raced with ourselves and uploaded some of the bundled sync pings more
> +                // than once, it's possible to guarantee idempotence on the receiver's end since we

If I understand correctly: once you've persisted, the bundle has a unique ID, which all uploads will include.  Therefore, multiple uploads are "safe": the receiver can de-duplicate.

However, as you collect the pings above, you don't want to maintain the sets of IDs of the constituent pings, _and_ you don't have a persistent unique bundle ID, so the receiver can't de-duplicate.

One alternate approach would be to make the bundle ID a deterministic function of the input bundle IDs, so you don't need to store a random bundle ID across invocations.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:253
(Diff revision 6)
> +                // than once, it's possible to guarantee idempotence on the receiver's end since we
> +                // include a unique ID with each ping.
> +                // Not uploading here means possibly delaying an already once-failed upload for a long
> +                // time. See Bug 1366045 for exploring scheduling options.
> +                if (reasonToUpload != null || syncPingStore.getCount() > 0) {
> +                    // Bump the "last-uploaded" timestamp. Even though we might fail to

This comment looks like a line got lost.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:269
(Diff revision 6)
> +            }
> +        });
> +    }
> +
> +    // There's no "scheduler" in a classic sense, and so we might end up not uploading pings at all
> +    // if there haven't been new incoming telemetry data. See Bug 1366045.

nit: s/haven't been/has been no/

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:291
(Diff revision 6)
> +        if (lastUploadAttempt == 0L) {
> +            return TelemetrySyncPingBundleBuilder.UPLOAD_REASON_FIRST;
> +        }
> +
> +        // Wall clock changed significantly; upload because we can't be sure of our timing anymore.
> +        if (lastUploadAttempt >= now) {

Clocks wiggle a little all the time -- NTP doesn't say what to do around leap seconds, IIUC.  Can we have a little buffer here, maybe millis, maybe a second?

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetrySyncUploadService.java:1
(Diff revision 6)
> +package org.mozilla.gecko.telemetry;

This is unused -- presumably you meant to remove it?

::: mobile/android/base/java/org/mozilla/gecko/telemetry/pingbuilders/TelemetrySyncPingBuilder.java:1
(Diff revision 6)
> +package org.mozilla.gecko.telemetry.pingbuilders;

nit: license header, throughout.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/pingbuilders/TelemetrySyncPingBundleBuilder.java:69
(Diff revision 6)
> +        return this;
> +    }
> +
> +    @Override
> +    public TelemetryOutgoingPing build() {
> +        final DateFormat pingCreationDateFormat = new SimpleDateFormat(

I see no guidance in the documentaiton for `creationDate`:

``` 
creationDate: <ISO date>, // the date the ping was generated
```

However, for `sessionStartDate`, I see more guidance:

```
sessionStartDate: <ISO date>, // hourly precision, ISO date in local time
```

Where are you pulling the UTC time zone from here?

::: mobile/android/base/java/org/mozilla/gecko/telemetry/pingbuilders/TelemetrySyncPingBundleBuilder.java:80
(Diff revision 6)
> +        payload.put("id", docID);
> +        payload.put("creationDate", pingCreationDateFormat.format(new Date()));
> +
> +        final ExtendedJSONObject application = new ExtendedJSONObject();
> +        // First ABI is the preferred one.
> +        if (Build.SUPPORTED_ABIS.length > 0) {

`SUPPORTED_ABIS` is SDK 21+.  I'm concerned that the services code isn't getting linted (in the IDE) correctly.  If you're not getting lint warnings with these issues, please file a ticket so that we can try to improve the Gradle configuration.
Attachment #8857293 - Flags: review?(nalexander) → review-
(Assignee)

Comment 52

2 years ago
mozreview-review
Comment on attachment 8863964 [details]
Bug 1308337 - Pre: More granular tracking of record flow between repositories

https://reviewboard.mozilla.org/r/135692/#review147140

::: commit-message-4cb71:4
(Diff revision 3)
> +Bug 1308337 - Pre 2: More granular tracking of record flow between repositories r=nalexander
> +
> +This patch:
> +- introduces a way to signal that a record has been reconciled; this is not a "flow control"

That's a good question. It might not mean anything concrete in various cases. I'm mixing up abstraction levels here, but I'm not sure we have a better immediate avenue for encoding these ideas into the flow.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:82
(Diff revision 3)
>    private final RecordsChannelDelegate delegate;
>    private long fetchEnd = -1;
>  
>    private volatile ReflowIsNecessaryException reflowException;
>  
> -  protected final AtomicInteger numFetched = new AtomicInteger();
> +  private final AtomicInteger numFetched = new AtomicInteger();

- attempted = accepted + failed.
- reconciled <= accepted <= attempted.
- reconciled = accepted - {new}, where {new} is implied.
(Assignee)

Comment 53

2 years ago
mozreview-review
Comment on attachment 8864703 [details]
Bug 1308337 - Pre: Don't throw away store and fetch exceptions as they're encountered

https://reviewboard.mozilla.org/r/136344/#review147186

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/SynchronizerSession.java:205
(Diff revision 3)
>        }
>  
>        @Override
>        public void onFlowStoreFailed(RecordsChannel recordsChannel, Exception ex, String recordGuid) {
>          Logger.warn(LOG_TAG, "First RecordsChannel onFlowStoreFailed. Logging local store error.", ex);
> +        storeFailedCauseException = ex;

Categorizing and counting these errors is my intention in Bug 1362208. You're right that this patch doesn't really achieve what it could have set out to do.

Right now we're just recording one error, which isn't as useful as reporting an aggregate view into errors which we have encountered - but that is what other platforms do, and that is what the telemetry pipeline assumes.

I'm somewhat concerned that this should be reporting the _first_ error encountered, as opposed to the last (due to overrides). For example, this exception might frequently end up as Server15PreviousPostFailedException, masking the original problem.
(Assignee)

Comment 54

2 years ago
mozreview-review-reply
Comment on attachment 8857292 [details]
Bug 1308337 - Part 1: Instrument named sync stages and broadcast collected telemetry

https://reviewboard.mozilla.org/r/129248/#review146656

> This could leak the context, although I guess the scope of the callback is strictly less than the scope of context already.  Is this reasoning correct?  Or are you confident the context is the application context here, or should you work harder to get the application context?

I think that reasoning is correct, yes. I also expect this to be an application context - see https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/fxa/sync/FxAccountSyncService.java#19

> nit: you might conclude that concurrency/locking isn't necessary.

I think it does make sense to mark these as volatile to ensure visibility, since we're likely to be reading and writing these values from different threads - although, not concurrently.
(Assignee)

Updated

2 years ago
See Also: → bug 1368579
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Comment 69

2 years ago
mozreview-review-reply
Comment on attachment 8857293 [details]
Bug 1308337 - Part 8: Receive sync telemetry, construct and upload sync pings

https://reviewboard.mozilla.org/r/129250/#review132098

> nit: "be uploaded".
> 
> I'm confused: you talk about "where to upload", which I interpret to mean "HTTP endpoint that you will POST to"; and then you talk about "bundling" multiple pings.  Why does that change _where_ they get uploaded?  If they really do hit different endpoints, state that you introduce a member for the endpoint or whatever.
> 
> (Perhaps this will become clear as I read further.)

I've clarified the commit message a little bit. From your other comments, I gather things became clearer.

> Are you confident that _only_ `IOException` is relevant?  I might catch more here.

Well, my alternative is to just catch Exception, but I fear that might be too broad. I expect we will have two classes of problems here: couldn't construct JSON string to be persisted, and couldn't persist it. The former we re-wrap in IOException to avoid leaking PII, and the latter will all stem from IOException.

> Isn't the most likely error that the ping bundle is large and storage space is small?  And doesn't this imply that we'll just continuously fail, potentially filling the remainder of storage with small pings until there's actually zero storage space and even those pings fail to persist?
> 
> If we never upload and clear, we'll never make progress.
> 
> Processes that can consume storage in this unbounded manner are dangerous.  The simplest thing might be to cull pings older than some cutoff; or remove all pings once the total size of the pings exceeds some bound.
> 
> I'd be happy to address this in follow-up or see an explanation of why this can't happen in the wild.

We do some culling already, but that only happens as part of the upload process itself. If the upload is never initiated for a store, its contents are never culled. I filed Bug 1368579 to keep track of this.

> Include exception in log.

I think it should be ok to log an exception here. We're already taking care to not throw exact exceptions which occur during json-construction to avoid leaking PII.

> If I understand correctly: once you've persisted, the bundle has a unique ID, which all uploads will include.  Therefore, multiple uploads are "safe": the receiver can de-duplicate.
> 
> However, as you collect the pings above, you don't want to maintain the sets of IDs of the constituent pings, _and_ you don't have a persistent unique bundle ID, so the receiver can't de-duplicate.
> 
> One alternate approach would be to make the bundle ID a deterministic function of the input bundle IDs, so you don't need to store a random bundle ID across invocations.

I take it that this is in relation to "skip an upload, since we failed to persist a sync ping bundle".

I'm not sure I understand how this will help if we can't guarantee that a single ping won't make it into more than one bundle. Consider attempting to upload a bundle even though we failed to persist it. We might fail, or we might succeed. If we succeed, we might still fail to fully process the success event locally - that is, we might fail to remove the constituent pings (provided we can somehow derive a set of ping IDs from a single bundle ID). If that happens, they're likely to be included in some future upload bundle, and the receiver won't be able to figure this out.

Generally, I don't see how we can make _any_ idempotence guarantees here in the face of either local or distributed IO errors. My current approach tries to walk the line of "don't outsmart ourselves", which comes at a manageable cost - you've pointed out the potential storage impact in another comment.

Situation seems to become much better from the client's POV if the constituent pings are uploaded along with their unique IDs. I should investigate if this is something we can try to take care of on the receiving end. Unless I'm missing something, it seems that any hardening efforts should be focused on this protocol-level deficiency, vs trying to make a more robust client.

I'm under the impression that other sync ping clients don't really care about this, and given how we're going to use this data - for general trend analysis - that just might be the correct approach.

> I see no guidance in the documentaiton for `creationDate`:
> 
> ``` 
> creationDate: <ISO date>, // the date the ping was generated
> ```
> 
> However, for `sessionStartDate`, I see more guidance:
> 
> ```
> sessionStartDate: <ISO date>, // hourly precision, ISO date in local time
> ```
> 
> Where are you pulling the UTC time zone from here?

Desktop reports creationDate in UTC (...and entirely omits `sessionStartDate`), so I figured it's a good idea to be consistent.

> `SUPPORTED_ABIS` is SDK 21+.  I'm concerned that the services code isn't getting linted (in the IDE) correctly.  If you're not getting lint warnings with these issues, please file a ticket so that we can try to improve the Gradle configuration.

I _am_ getting "deprecated" warnings when using features marked as Deprecated (which CPU_ABIs is), but _not_ getting warnings for using features which target above our min-supported API level. I'll file a ticket.
(In reply to :Grisha Kruglov from comment #69)
> Comment on attachment 8857293 [details]
> Bug 1308337 - Part 8: Receive sync telemetry, construct and upload sync pings
> 
> https://reviewboard.mozilla.org/r/129250/#review132098
> 
> > nit: "be uploaded".
> > 
> > I'm confused: you talk about "where to upload", which I interpret to mean "HTTP endpoint that you will POST to"; and then you talk about "bundling" multiple pings.  Why does that change _where_ they get uploaded?  If they really do hit different endpoints, state that you introduce a member for the endpoint or whatever.
> > 
> > (Perhaps this will become clear as I read further.)
> 
> I've clarified the commit message a little bit. From your other comments, I
> gather things became clearer.
> 
> > Are you confident that _only_ `IOException` is relevant?  I might catch more here.
> 
> Well, my alternative is to just catch Exception, but I fear that might be
> too broad. I expect we will have two classes of problems here: couldn't
> construct JSON string to be persisted, and couldn't persist it. The former
> we re-wrap in IOException to avoid leaking PII, and the latter will all stem
> from IOException.
> 
> > Isn't the most likely error that the ping bundle is large and storage space is small?  And doesn't this imply that we'll just continuously fail, potentially filling the remainder of storage with small pings until there's actually zero storage space and even those pings fail to persist?
> > 
> > If we never upload and clear, we'll never make progress.
> > 
> > Processes that can consume storage in this unbounded manner are dangerous.  The simplest thing might be to cull pings older than some cutoff; or remove all pings once the total size of the pings exceeds some bound.
> > 
> > I'd be happy to address this in follow-up or see an explanation of why this can't happen in the wild.
> 
> We do some culling already, but that only happens as part of the upload
> process itself. If the upload is never initiated for a store, its contents
> are never culled. I filed Bug 1368579 to keep track of this.
> 
> > Include exception in log.
> 
> I think it should be ok to log an exception here. We're already taking care
> to not throw exact exceptions which occur during json-construction to avoid
> leaking PII.
> 
> > If I understand correctly: once you've persisted, the bundle has a unique ID, which all uploads will include.  Therefore, multiple uploads are "safe": the receiver can de-duplicate.
> > 
> > However, as you collect the pings above, you don't want to maintain the sets of IDs of the constituent pings, _and_ you don't have a persistent unique bundle ID, so the receiver can't de-duplicate.
> > 
> > One alternate approach would be to make the bundle ID a deterministic function of the input bundle IDs, so you don't need to store a random bundle ID across invocations.
> 
> I take it that this is in relation to "skip an upload, since we failed to
> persist a sync ping bundle".
> 
> I'm not sure I understand how this will help if we can't guarantee that a
> single ping won't make it into more than one bundle. Consider attempting to
> upload a bundle even though we failed to persist it. We might fail, or we
> might succeed. If we succeed, we might still fail to fully process the
> success event locally - that is, we might fail to remove the constituent
> pings (provided we can somehow derive a set of ping IDs from a single bundle
> ID). If that happens, they're likely to be included in some future upload
> bundle, and the receiver won't be able to figure this out.
> 
> Generally, I don't see how we can make _any_ idempotence guarantees here in
> the face of either local or distributed IO errors. My current approach tries
> to walk the line of "don't outsmart ourselves", which comes at a manageable
> cost - you've pointed out the potential storage impact in another comment.
> 
> Situation seems to become much better from the client's POV if the
> constituent pings are uploaded along with their unique IDs.

I wasn't aware that the constituent pings _were not_ uploaded with their IDs, so this definitely needs to be called out.  I'm comfortable with your trade-offs, and think you should just fold in some of this discussion into your comments.

 I should
> investigate if this is something we can try to take care of on the receiving
> end. Unless I'm missing something, it seems that any hardening efforts
> should be focused on this protocol-level deficiency, vs trying to make a
> more robust client.

I absolutely concur.  I expect the pipeline to have to de-duplicate; we might as well aim to help the client de-duplicate!

> I'm under the impression that other sync ping clients don't really care
> about this, and given how we're going to use this data - for general trend
> analysis - that just might be the correct approach.
> 
> > I see no guidance in the documentaiton for `creationDate`:
> > 
> > ``` 
> > creationDate: <ISO date>, // the date the ping was generated
> > ```
> > 
> > However, for `sessionStartDate`, I see more guidance:
> > 
> > ```
> > sessionStartDate: <ISO date>, // hourly precision, ISO date in local time
> > ```
> > 
> > Where are you pulling the UTC time zone from here?
> 
> Desktop reports creationDate in UTC (...and entirely omits
> `sessionStartDate`), so I figured it's a good idea to be consistent. 

Can you file a doc bug, then?  If we're being consistent, we should document what the next client needs to be consistent with.  Thanks!

> 
> > `SUPPORTED_ABIS` is SDK 21+.  I'm concerned that the services code isn't getting linted (in the IDE) correctly.  If you're not getting lint warnings with these issues, please file a ticket so that we can try to improve the Gradle configuration.
> 
> I _am_ getting "deprecated" warnings when using features marked as
> Deprecated (which CPU_ABIs is), but _not_ getting warnings for using
> features which target above our min-supported API level. I'll file a ticket.

I wonder... are you using the `local` configuration?  That lies about the `minSdkVersion` in order to opt-in to a faster build process (http://searchfox.org/mozilla-central/rev/b318c7dca7392bd16c0b11929f55b1be133f0b31/mobile/android/app/build.gradle#94).  Now I realize that may have been unwise.  No need to file; it's on me to think about what the best path forward is.  You might test with `localOld` and ensure that you see an API level warning in that case.

Comment 71

2 years ago
mozreview-review
Comment on attachment 8863964 [details]
Bug 1308337 - Pre: More granular tracking of record flow between repositories

https://reviewboard.mozilla.org/r/135692/#review148096
Attachment #8863964 - Flags: review?(nalexander) → review+

Comment 72

2 years ago
mozreview-review
Comment on attachment 8864703 [details]
Bug 1308337 - Pre: Don't throw away store and fetch exceptions as they're encountered

https://reviewboard.mozilla.org/r/136344/#review148098

::: commit-message-8ceb6:3
(Diff revisions 3 - 4)
> -Bug 1308337 - Pre 3: Don't throw away store and fetch exceptions as they're encountered r=nalexander
> +Bug 1308337 - Pre: Don't throw away store and fetch exceptions as they're encountered r=nalexander
>  
> -We will need them later for telemetry reporting.
> +We will need them later for telemetry reporting. For now we're just keeping the last exception which

I think you said that this agrees with Desktop -- include that in the ticket if so.
Attachment #8864703 - Flags: review?(nalexander) → review+

Comment 73

2 years ago
mozreview-review
Comment on attachment 8872498 [details]
Bug 1308337 - Part 2: Instrument Clients engine

https://reviewboard.mozilla.org/r/144030/#review148138

I'm confident this does what you desire, but I'm not confident it's the optimal expression.  Right now, `GlobalSession` has a `Collector` which the `GlobalSession` uses to produce many `StageCollector` instances.  `GlobalSession` hands a `StageCollector` to each `Stage`... which then fills in details (like `startTime` and `endTime`).  But the `GlobalSession` knows when the `Stage` started and when it finished.  Is this the right separation of concerns?

r- 'cuz you don't seem to track any of the relevant record flow errors, etc, here.  Am I missing something?
Attachment #8872498 - Flags: review?(nalexander) → review-

Comment 74

2 years ago
mozreview-review
Comment on attachment 8872499 [details]
Bug 1308337 - Part 3: Handle sync restarts during telemetry collection

https://reviewboard.mozilla.org/r/144032/#review148140

::: commit-message-45e9a:5
(Diff revision 1)
> +Bug 1308337 - Part 3: Handle sync restarts during telemetry collection r=nalexander
> +
> +The approach here is to simply mark current TelemetryCollector as having restarted.
> +The downside of this approach is that two technically separate syncs are combined into one
> +telemetry ping. However, the the two syncs are logically connected to each other, and combining

nit: the the.

::: commit-message-45e9a:6
(Diff revision 1)
> +Bug 1308337 - Part 3: Handle sync restarts during telemetry collection r=nalexander
> +
> +The approach here is to simply mark current TelemetryCollector as having restarted.
> +The downside of this approach is that two technically separate syncs are combined into one
> +telemetry ping. However, the the two syncs are logically connected to each other, and combining
> +their telemetry will make it easier to figure out why a restart have occurred, as well as what

nit: s/have occurred/has occurred/ (or just "occurred").

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/GlobalSession.java:348
(Diff revision 1)
>    /**
>     * Stop this sync and start again.
>     * @throws AlreadySyncingException
>     */
>    protected void restart() throws AlreadySyncingException {
> +    telemetryCollector.setRestarted();

This is a very unusual flow -- very edge case.  You probably won't witness this in the wild.
Attachment #8872499 - Flags: review?(nalexander) → review+

Comment 75

2 years ago
mozreview-review
Comment on attachment 8872500 [details]
Bug 1308337 - Part 4: Instrument FetchMetaGlobal stage

https://reviewboard.mozilla.org/r/144034/#review148142

When I read this, I have the same impression as for "clients": this book-keeping is better done in the `GlobalSession`.

r+ since this one isn't missing things, and I'm sure you'll counter with why you pushed the details into each stage.
Attachment #8872500 - Flags: review?(nalexander) → review+

Comment 76

2 years ago
mozreview-review
Comment on attachment 8872501 [details]
Bug 1308337 - Part 5: Instrument FetchInfoCollectionsStage

https://reviewboard.mozilla.org/r/144036/#review148144

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/FetchInfoCollectionsStage.java:53
(Diff revision 1)
>    @Override
>    public void execute() throws NoSuchStageException {
>      try {
>        session.fetchInfoCollections(new StageInfoCollectionsDelegate());
>      } catch (URISyntaxException e) {
> +      telemetryStageCollector.finished = SystemClock.elapsedRealtime();

I know it's rare, but why doesn't this one get `setLastException(e)`?  If you do that, you can see how all the flows look the same... and should be lifted to `GlobalSession`.
Attachment #8872501 - Flags: review?(nalexander) → review+

Comment 77

2 years ago
mozreview-review
Comment on attachment 8872502 [details]
Bug 1308337 - Part 6: Instrument FetchInfoConfiguration stage

https://reviewboard.mozilla.org/r/144038/#review148146

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/stage/FetchInfoConfigurationStage.java:57
(Diff revision 1)
>                  return;
>              }
>  
>              // End-point might not be available (404) if server is running an older version.
>              // We will use default config values in this case.
> +            telemetryStageCollector.error = new TelemetryCollector

Is it an _error_ if we can recover?  It's _interesting_, for sure... but how do you interpret this error in the processing pipeline?
Attachment #8872502 - Flags: review?(nalexander) → review+

Comment 78

2 years ago
mozreview-review
Comment on attachment 8872503 [details]
Bug 1308337 - Part 7: Instrument Crypto Keys stage

https://reviewboard.mozilla.org/r/144040/#review148148

Stamp!
Attachment #8872503 - Flags: review?(nalexander) → review+

Comment 79

2 years ago
mozreview-review
Comment on attachment 8857293 [details]
Bug 1308337 - Part 8: Receive sync telemetry, construct and upload sync pings

https://reviewboard.mozilla.org/r/129250/#review148150

If you're happy, I'm happy.

::: commit-message-960af:8
(Diff revisions 6 - 8)
>  This patch includes some "pre" work, which should have been a separate patch (my apologies!):
>  - telemetry ping is (needlessly) coupled with information about where it should
> -  be upload. It wasn't a problem before, since core pings are "upload right away",
> -  and are never bundled together. However, for sync ping, we need to bundle a bunch
> -  of "syncs" and "events" (down the road) into one single "sync ping", and as such
> +  be uploaded. It wasn't a problem before, since core pings are "upload right away",
> +  and are never bundled together during an upload. However, for sync ping, we need to
> +  bundle a bunch of "syncs" and "events" (down the road) into one single "sync ping", and as such
>    we need separate representation for a "ping that is not meant to be uploaded directly".

nit: "a separate representation" ...

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:95
(Diff revisions 6 - 8)
>  
>          if (!intent.hasExtra(TelemetryContract.KEY_TYPE)) {
>              throw new IllegalStateException("Received a background telemetry broadcast without type.");
>          }
>  
> +        // We want to know if any of the below code is faulty in non-obvious ways, and as such there

nit: I think it helps to spell it out:  "We want to crash and get a crash report rather than ..."

Of course, we have to trust we get crash reports when Sync is running and Firefox for Android isn't -- but we often get "startup crashes" reported that are actually Sync-triggered-at-startup crashes.

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:305
(Diff revisions 6 - 8)
>              return TelemetrySyncPingBundleBuilder.UPLOAD_REASON_FIRST;
>          }
>  
>          // Wall clock changed significantly; upload because we can't be sure of our timing anymore.
> -        if (lastUploadAttempt >= now) {
> +        // Allow for some wiggle room to account for clocks jumping around insignificantly.
> +        final long DRIFT_BUFFER = 60 * 1000L;

tiny nit: `DRIFT_BUFFER_IN_MS` -- I'm crusading to include units in names, to prevent errors.
Attachment #8857293 - Flags: review?(nalexander) → review+

Comment 80

2 years ago
mozreview-review
Comment on attachment 8857292 [details]
Bug 1308337 - Part 1: Instrument named sync stages and broadcast collected telemetry

https://reviewboard.mozilla.org/r/129248/#review148132

I'm happy if you're happy.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/GlobalSession.java:307
(Diff revisions 6 - 7)
>      try {
> -      nextStage.execute(this, telemetryCollector.collectorFor(currentState.getRepositoryName()));
> +      nextStage.execute(this, stageCollector);
>      } catch (Exception ex) {
>        Logger.warn(LOG_TAG, "Caught exception " + ex + " running stage " + next);
> +      // We're not setting stageCollector's error since there's a chance the stage already set it
> +      // and we'll loose a root cause error by overriding it here. Call to `abort` will end up calling

nit: lose, not loose.
Attachment #8857292 - Flags: review?(nalexander) → review+
grisha: close to good from me: just like to work out the Sync clients engine stuff; I don't understand how this connects to the earlier work, which I realize might still be in place.  Tests will show!

The only thing I think you didn't address explicitly was the device bundle bits.  Is it the case that the clients engine _always_ downloads all the records from the server, so we'll always witness every current device in every sync ping?  (I honestly can't remember what we do on Android.)

I think we should have rnewman gives this an sr (super-review), since he has a different eye for systemic problems than I do.

I also thought of some data I would like us to collect: I'd like us to understand how the clients collection and the FxA devices collection compare.  We'll never migrate _away_ from clients if we don't have a concrete measure tracking how suitable the devices collection is as a replacement.  Is this tracked on Desktop?  Can we help by tracking it on Android?

Good work!
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Comment 95

2 years ago
(In reply to Nick Alexander :nalexander from comment #81)
> grisha: close to good from me: just like to work out the Sync clients engine
> stuff; I don't understand how this connects to the earlier work, which I
> realize might still be in place.  Tests will show!

I'll give your comments on lifting things into GlobalSession a thought. I've pushed instrumentation into individual stages because it seemed like that would give me most flexibility and the most straightforward implementation. Now that it's all in place, it's a good time to reflect on that decision :-)

> The only thing I think you didn't address explicitly was the device bundle
> bits.  Is it the case that the clients engine _always_ downloads all the
> records from the server, so we'll always witness every current device in
> every sync ping?  (I honestly can't remember what we do on Android.)

That is my understanding of the clients engine - we always download all records, and as such every sync ping should include the whole constellation as it's known at the time. Whether or not that's a _good_ thing is a different question (how do we systematically make sense of this data, so it's not just a waste of user's bandwidth?), but at least we're being consistent here across clients.

That is also why I didn't bother instrumenting the flow of records for that stage, since one could just look at the list of devices recorded. Perhaps it's worth counting outgoing clients (it's either our own record, or clients for which we're recording commands). Commands will get treated separately as part of the Sync Events follow up work. The main goal for the Clients engine was to keep track of any errors that occur. I'll think about this a bit more, however.

> I think we should have rnewman gives this an sr (super-review), since he has
> a different eye for systemic problems than I do.

Yup, I'll flag Richard.

> I also thought of some data I would like us to collect: I'd like us to
> understand how the clients collection and the FxA devices collection
> compare.  We'll never migrate _away_ from clients if we don't have a
> concrete measure tracking how suitable the devices collection is as a
> replacement.  Is this tracked on Desktop?  Can we help by tracking it on
> Android?

That certainly goes well together with our push towards FxA devices list! I'll file a follow-up to investigate.

> Good work!

Thank you for the reviews, Nick! I'll still be pushing up some revisions and a bit more tests.
(Assignee)

Comment 96

2 years ago
Hi Richard!

Would you mind giving this a look-over (not sure if there's a super-review flag)? I'm happy to talk through the changes if you think I can save you any time at all with this.
Flags: needinfo?(rnewman)
> Would you mind giving this a look-over (not sure if there's a super-review
> flag)? I'm happy to talk through the changes if you think I can save you any
> time at all with this.

I'm looking now; will let you know if I need a grisha jetlag guide :D
Flags: needinfo?(rnewman)
Attachment #8857293 - Flags: superreview?(rnewman)

Comment 98

2 years ago
mozreview-review
Comment on attachment 8857291 [details]
Bug 1308337 - Pre: clean up wrong annotation

https://reviewboard.mozilla.org/r/129244/#review148356

::: mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryBackgroundReceiver.java:247
(Diff revision 9)
> +                        Log.e(LOG_TAG, "Unable to write bundled sync ping to disk. Skipping upload.", e);
> +                        bundledSyncPingPersisted = false;
> +                    }
> +
> +                    if (bundledSyncPingPersisted) {
> +                        // It is now safe to delete persisted telemetry which we just bundled up.

Worth noting that this whole approach is only safe if `storePing` never successfully writes to disk if it throws!

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/sync/FxAccountSyncAdapter.java:195
(Diff revision 9)
> +    }
> +
> +    @Override
> +    public void handleAborted(GlobalSession globalSession, String reason) {
> +      super.handleAborted(globalSession, reason);
> +      this.telemetryCollector.setError(TelemetryCollector.KEY_ERROR_INTERNAL, "backoff");

Strictly speaking there are other reasons to abort than a backoff. This is a bug waiting to happen…

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/android/AndroidBrowserRepositorySession.java:468
(Diff revision 9)
>              trace("Both local and remote records have been modified.");
>              if (record.lastModified > existingRecord.lastModified) {
>                trace("Remote is newer, and deleted. Deleting local.");
> +              // Note that while this counts as "reconciliation", we're probably over-counting.
> +              // Currently, locallyModified above is _always_ true if a record exists locally,
> +              // and so we'll consider any deletions of already present records as reconciliations.

Is this documented somewhere that people consuming this data will read it?

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/delegates/RepositorySessionStoreDelegate.java:20
(Diff revision 9)
>   */
>  public interface RepositorySessionStoreDelegate {
>    void onRecordStoreFailed(Exception ex, String recordGuid);
>  
> +  // Meant for signaling that a record has been reconciled.
> +  // Only makes sense in context of local repositories.

This is a signal that this abstraction is wrong.

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/synchronizer/RecordsChannel.java:271
(Diff revision 9)
>    }
>  
>    @Override
> +  public void onRecordStoreReconciled(String guid) {
> +    Logger.trace(LOG_TAG, "Reconciled record with guid " + guid);
> +    storeReconciledCount.incrementAndGet();

While I take most of the blame for the framework of over-engineering around this code, this seems like it might be a step too far.

The other delegate methods are async and control-flow oriented: they actually _do something_. Here we're bumping a counter.

In order to bump that counter we go through two levels of delegate methods, a `Runnable` in an `Executor`, and touch an `AtomicInteger`. That's… a lot of work, and it's hard to understand.

Furthermore, taking the GUID as an argument to this method implies that calling `onRecordStoreReconciled("foo")` twice will only increment the counter once. That's not the case.

Contrast to the way this works on iOS: a bundle of counters are passed in to a sync and back out when we're done. Very simple, very fast, and easier to get right — there's less chance you'll end up calling a delegate method twice!

Why did you not choose to have a simple object, owned by the session, and following the start/end lifecycle?
Implementation-wise, I think this is OK.

I have two broader concerns.

The first is that I suspect this kind of counter-based rolled-up data collection is an evolutionary dead-end. Yes, it'll help to spot -- very much delayed! -- bugs that cause lots of record rejections, and it's some measure of how fast users are creating data, but because so much context is elided it'll be hard to cut through node reassignments, device detachments, crashes/process deaths, etc. etc.

That is: it doesn't model the state transitions we care about. E.g., if every Android client starts doing a from-scratch sync of logins once per day, would we be able to tell? I think we'd just see a small bump in reconciled records, with no clue as to why. Lost in the noise.

If Alex and crew are happy with this payload, then great; perhaps I'm leaping ahead to trying to look for deeper issues and behaviors in user (and code) activities. But I thought it worth mentioning.

The second and related concern is that this implementation is quite brittle (but mostly lacking the simplicity that a truly brittle solution like iOS's has!). Nick's point about event processing speaks to this. Your ability to shift to constructing a narrative from this approach to collection, or indeed extend the data to reflect simple causality or context, or produce events, is quite limited. Substantial rework will be needed if this moves beyond a simple set of counters, and I anticipate bugs as this code evolves, as is so often the case with mutable state scattered around the codebase.

I worry that untangling this later will be sufficiently intimidating that it'll impede later growth of analytics, which is a net loss.
(Assignee)

Comment 100

2 years ago
(In reply to Richard Newman [:rnewman] from comment #98)
> mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/
> android/AndroidBrowserRepositorySession.java:468
> (Diff revision 9)
> >              trace("Both local and remote records have been modified.");
> >              if (record.lastModified > existingRecord.lastModified) {
> >                trace("Remote is newer, and deleted. Deleting local.");
> > +              // Note that while this counts as "reconciliation", we're probably over-counting.
> > +              // Currently, locallyModified above is _always_ true if a record exists locally,
> > +              // and so we'll consider any deletions of already present records as reconciliations.
> 
> Is this documented somewhere that people consuming this data will read it?

I will need to include such notes in the updates sync-ping docs.

> 
> :::
> mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/
> delegates/RepositorySessionStoreDelegate.java:20
> (Diff revision 9)
> >   */
> >  public interface RepositorySessionStoreDelegate {
> >    void onRecordStoreFailed(Exception ex, String recordGuid);
> >  
> > +  // Meant for signaling that a record has been reconciled.
> > +  // Only makes sense in context of local repositories.
> 
> This is a signal that this abstraction is wrong.

Indeed, but I think there's a very convenient "message bus" aspect to this which reads between the lines. Some of the delegate methods are already treating this abstraction layer as such - e.g. calls to "failed" methods are treated differently based on context, and are simply bumping counters for local storage and fetch failures, IIRC.

> Furthermore, taking the GUID as an argument to this method implies that
> calling `onRecordStoreReconciled("foo")` twice will only increment the
> counter once. That's not the case.

Fair point, but it's worth noting that none of the other delegate methods enforce this type of idempotancy either.

> Contrast to the way this works on iOS: a bundle of counters are passed in to
> a sync and back out when we're done. Very simple, very fast, and easier to
> get right — there's less chance you'll end up calling a delegate method
> twice!
> 
> Why did you not choose to have a simple object, owned by the session, and
> following the start/end lifecycle?

I think I did choose a simple object which is passed around and populated (except it's owned by the sync adapter, to instrument issues outside of the session), and this particular counter is a somewhat unfortunate exception.

One alternative is pushing the collector object(s) all the way down into the repositories' data access plumbing. I don't think that maintaining a strict adherence to our already somewhat ad-hoc abstractions is worth that effort and the unpleasantries it'll produce.

A different approach altogether - and probably a better one, now that I think of it - would have been to introduce a telemetry message bus of sorts, further sprinkle telemetry calls throughout the internals of stages, and have the receiver process flow of events into something resembling a "sync ping". That would almost entirely uncouple telemetry collection from the sync flow, and avoid these types of changes.
(Assignee)

Comment 101

2 years ago
(In reply to Richard Newman [:rnewman] from comment #99)
> That is: it doesn't model the state transitions we care about. E.g., if
> every Android client starts doing a from-scratch sync of logins once per
> day, would we be able to tell? I think we'd just see a small bump in
> reconciled records, with no clue as to why. Lost in the noise.

That's an excellent discussion point for a broader discussion on moving sync telemetry forward.

> If Alex and crew are happy with this payload, then great; perhaps I'm
> leaping ahead to trying to look for deeper issues and behaviors in user (and
> code) activities. But I thought it worth mentioning.

I entirely agree that what you get out of the current iteration of the sync ping is somewhat limited in scope. I do think there's enough value in it to justify this effort and set a base line, and it does give us a platform to build upon and narrow down on the variety of subtleties - or rather, expand out into state-driven telemetry.

> The second and related concern is that this implementation is quite brittle
> (but mostly lacking the simplicity that a truly brittle solution like iOS's
> has!). Nick's point about event processing speaks to this. Your ability to
> shift to constructing a narrative from this approach to collection, or
> indeed extend the data to reflect simple causality or context, or produce
> events, is quite limited. Substantial rework will be needed if this moves
> beyond a simple set of counters, and I anticipate bugs as this code evolves,
> as is so often the case with mutable state scattered around the codebase.
> 
> I worry that untangling this later will be sufficiently intimidating that
> it'll impede later growth of analytics, which is a net loss.

The current implementation is mostly just sitting on top of the flow we already had in place. Istm that expanding it to build a more context-based narrative around individual repository-backed stages will look very much like instrumentation of the non-repository stages (see other patches). At the end of the day, we have have a number of collectors into which we're stuffing counters, flags, objects, whatever else. I don't see how this is somehow particularly difficult to expand upon. It might involve a certain amount of mechanical labour to push collectors deep enough - but then there's always an alternative of messaging _out_ of the depths of sync and into a telemetry listener, which might be a more flexible path forward.

Can you elaborate as to why you think this might cause problems down the line?
(Assignee)

Updated

2 years ago
See Also: → bug 1369186
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 111

2 years ago
mozreview-review
Comment on attachment 8872498 [details]
Bug 1308337 - Part 2: Instrument Clients engine

https://reviewboard.mozilla.org/r/144030/#review149260

Carry forward r+: if you're happy with the approach, I can be too.
Attachment #8872498 - Flags: review?(nalexander) → review+
(Assignee)

Updated

2 years ago
Blocks: 1369844
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 125

2 years ago
mozreview-review
Comment on attachment 8857293 [details]
Bug 1308337 - Part 8: Receive sync telemetry, construct and upload sync pings

https://reviewboard.mozilla.org/r/129250/#review149332
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 141

2 years ago
mozreview-review
Comment on attachment 8873977 [details]
Bug 1308337 - Pre: clean up wrong annotation

https://reviewboard.mozilla.org/r/145416/#review149346
Attachment #8873977 - Flags: review?(nalexander) → review+

Comment 142

2 years ago
mozreview-review
Comment on attachment 8873978 [details]
Bug 1308337 - Pre: More granular tracking of record flow between repositories

https://reviewboard.mozilla.org/r/145418/#review149348
Attachment #8873978 - Flags: review?(nalexander) → review+

Comment 143

2 years ago
mozreview-review
Comment on attachment 8873979 [details]
Bug 1308337 - Pre: Don't throw away store and fetch exceptions as they're encountered

https://reviewboard.mozilla.org/r/145420/#review149350
Attachment #8873979 - Flags: review?(nalexander) → review+

Comment 144

2 years ago
mozreview-review
Comment on attachment 8873980 [details]
Bug 1308337 - Pre: Read hashedFxaUid from the token server

https://reviewboard.mozilla.org/r/145422/#review149352
Attachment #8873980 - Flags: review?(nalexander) → review+

Comment 145

2 years ago
mozreview-review
Comment on attachment 8873981 [details]
Bug 1308337 - Part 1: Instrument named sync stages and broadcast collected telemetry

https://reviewboard.mozilla.org/r/145424/#review149354
Attachment #8873981 - Flags: review?(nalexander) → review+

Comment 146

2 years ago
mozreview-review
Comment on attachment 8873982 [details]
Bug 1308337 - Part 2: Instrument Clients engine

https://reviewboard.mozilla.org/r/145426/#review149356
Attachment #8873982 - Flags: review?(nalexander) → review+

Comment 147

2 years ago
mozreview-review
Comment on attachment 8873983 [details]
Bug 1308337 - Part 3: Handle sync restarts during telemetry collection

https://reviewboard.mozilla.org/r/145428/#review149358
Attachment #8873983 - Flags: review?(nalexander) → review+

Comment 148

2 years ago
mozreview-review
Comment on attachment 8873984 [details]
Bug 1308337 - Part 4: Instrument FetchMetaGlobal stage

https://reviewboard.mozilla.org/r/145430/#review149360
Attachment #8873984 - Flags: review?(nalexander) → review+

Comment 149

2 years ago
mozreview-review
Comment on attachment 8873985 [details]
Bug 1308337 - Part 5: Instrument FetchInfoCollectionsStage

https://reviewboard.mozilla.org/r/145432/#review149362
Attachment #8873985 - Flags: review?(nalexander) → review+

Comment 150

2 years ago
mozreview-review
Comment on attachment 8873986 [details]
Bug 1308337 - Part 6: Instrument FetchInfoConfiguration stage

https://reviewboard.mozilla.org/r/145434/#review149364
Attachment #8873986 - Flags: review?(nalexander) → review+

Comment 151

2 years ago
mozreview-review
Comment on attachment 8873987 [details]
Bug 1308337 - Part 7: Instrument Crypto Keys stage

https://reviewboard.mozilla.org/r/145436/#review149366
Attachment #8873987 - Flags: review?(nalexander) → review+

Comment 152

2 years ago
mozreview-review
Comment on attachment 8873988 [details]
Bug 1308337 - Part 8: Receive sync telemetry, construct and upload sync pings

https://reviewboard.mozilla.org/r/145438/#review149368
Attachment #8873988 - Flags: review?(nalexander) → review+

Comment 153

2 years ago
mozreview-review
Comment on attachment 8873989 [details]
Bug 1308337 - Post: Remove old background telemetry code

https://reviewboard.mozilla.org/r/145440/#review149370
Attachment #8873989 - Flags: review?(nalexander) → review+

Comment 154

2 years ago
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/88d65a029778
Pre: clean up wrong annotation r=nalexander
https://hg.mozilla.org/integration/autoland/rev/b129a46b714e
Pre: More granular tracking of record flow between repositories r=nalexander
https://hg.mozilla.org/integration/autoland/rev/7fb2a2562a8b
Pre: Don't throw away store and fetch exceptions as they're encountered r=nalexander
https://hg.mozilla.org/integration/autoland/rev/38894ad99464
Pre: Read hashedFxaUid from the token server r=nalexander
https://hg.mozilla.org/integration/autoland/rev/02d589c299a4
Part 1: Instrument named sync stages and broadcast collected telemetry r=nalexander
https://hg.mozilla.org/integration/autoland/rev/846114208d7b
Part 2: Instrument Clients engine r=nalexander
https://hg.mozilla.org/integration/autoland/rev/2597d0701648
Part 3: Handle sync restarts during telemetry collection r=nalexander
https://hg.mozilla.org/integration/autoland/rev/99d1101ae492
Part 4: Instrument FetchMetaGlobal stage r=nalexander
https://hg.mozilla.org/integration/autoland/rev/04eedff08896
Part 5: Instrument FetchInfoCollectionsStage r=nalexander
https://hg.mozilla.org/integration/autoland/rev/6e63f290b631
Part 6: Instrument FetchInfoConfiguration stage r=nalexander
https://hg.mozilla.org/integration/autoland/rev/4acec427f8b4
Part 7: Instrument Crypto Keys stage r=nalexander
https://hg.mozilla.org/integration/autoland/rev/bae470a1d397
Part 8: Receive sync telemetry, construct and upload sync pings r=nalexander
https://hg.mozilla.org/integration/autoland/rev/72a455b224e6
Post: Remove old background telemetry code r=nalexander
(Assignee)

Updated

2 years ago
Depends on: 1369891
(Assignee)

Updated

2 years ago
Depends on: 1369892
Depends on: 1370221
Depends on: 1370111
(Assignee)

Updated

2 years ago
Depends on: 1370656
(Assignee)

Updated

2 years ago
Depends on: 1370668
(Assignee)

Updated

2 years ago
Depends on: 1373439
(Assignee)

Updated

2 years ago
Depends on: 1374758
(Assignee)

Updated

a year ago
Depends on: 1386027
(Assignee)

Updated

a year ago
Depends on: 1389233
(Assignee)

Updated

a year ago
Depends on: 1382666
Attachment #8857293 - Flags: superreview?(rnewman)

Updated

a year ago
Product: Android Background Services → Firefox for Android
You need to log in before you can comment on or make changes to this bug.