Closed Bug 1370111 Opened 2 years ago Closed 2 years ago

Crash in java.lang.IllegalStateException: Telemetry missing ''started'' timestamp at org.mozilla.gecko.sync.telemetry.TelemetryCollector.build(TelemetryCollector.java)

Categories

(Firefox for Android :: Android Sync, defect, P1, critical)

55 Branch
Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
Firefox 55
Tracking Status
firefox54 --- unaffected
firefox55 blocking fixed

People

(Reporter: marcia, Assigned: Grisha)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-2ecd301c-edaa-47b8-8287-7e4d70170604.
=============================================================

New crash which showed up in Build ID 20170604100226: http://bit.ly/2rqn6F8. So far 20 crashes/30 installs

Part of the signature in the second line mentions Aurora: data@app@org.mozilla.fennec_aurora-2@base.apk@classes.dex@0xec1fc5
This regression has been introduced by patch [1] to fix bug 1308337.
:Grisha, could you investigate ?

[1] https://hg.mozilla.org/mozilla-central/rev/02d589c299a4
Blocks: 1308337
Flags: needinfo?(gkruglov)
Assignee: nobody → gkruglov
Status: NEW → ASSIGNED
Component: General → Android Sync
Flags: needinfo?(gkruglov)
Product: Firefox for Android → Android Background Services
Priority: -- → P1
Comment on attachment 8874636 [details]
Bug 1370111 - Post: More explicit global error handling for Sync Telemetry

https://reviewboard.mozilla.org/r/145974/#review149888

::: mobile/android/services/src/main/java/org/mozilla/gecko/sync/telemetry/TelemetryCollector.java:52
(Diff revision 2)
> -    @VisibleForTesting protected ExtendedJSONObject error;
> -    private String hashedUID;
> -    private String hashedDeviceID;
> +    // It's possible that these fields are read/written from different threads.
> +    // Volatile is used to ensure memory visibility.
> +    @VisibleForTesting protected volatile ExtendedJSONObject error;
> +    private volatile String hashedUID;
> +    private volatile String hashedDeviceID;
>      private final ArrayList<Bundle> devices = new ArrayList<>();

I wonder if visibility of contents of `devices` might be an issue, and if I should use a `ConcurrentLinkedQueue` or something similar. There's no concurrent access, but I _think_ it will be read (`build()`) in a thread that's different from the one in which it's populated.
Comment on attachment 8874635 [details]
Bug 1370111 - Set 'started' telemetry timestamp before attempting to talk to the Token Server

https://reviewboard.mozilla.org/r/145972/#review149930

This looks sensible.  You won't see the same problem with "finished" because it's set only once, immediately before sending the local broadcast.  Correct?
Attachment #8874635 - Flags: review?(nalexander) → review+
Comment on attachment 8874636 [details]
Bug 1370111 - Post: More explicit global error handling for Sync Telemetry

https://reviewboard.mozilla.org/r/145974/#review149932

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/sync/FxAccountSyncAdapter.java:177
(Diff revision 2)
>      @Override
>      public void handleError(GlobalSession globalSession, Exception ex, String reason) {
>        super.handleError(globalSession, ex, reason);
> -      this.telemetryCollector.setError(TelemetryCollector.KEY_ERROR_INTERNAL, reason);
> +      // If an error hasn't been set downstream, record what we know at this point.
> +      if (!telemetryCollector.hasError()) {
> +        telemetryCollector.setError(TelemetryCollector.KEY_ERROR_INTERNAL, reason, ex);

You could also make `setError` do this, looking ahead to a future where you keep the _first_ error rather than the _last_ error.
Attachment #8874636 - Flags: review?(nalexander) → review+
Comment on attachment 8874637 [details]
Bug 1370111 - Post: Use a helper function for grabbing monotonic time

https://reviewboard.mozilla.org/r/145976/#review149934

1) It doesn't make sense to upload `SystemClock.elapsedRealtime()`, ever.  You don't do that right now, but your naming doesn't guard against it.  Make it `.start()` and `.finish()`, push the clock management into the `TelemetryCollector`, and save the confusion later, when folks use millis since the epoch.
2) Let's not use `now()` -- rnewman and I definitely expect `now()` to be `System.currentTimeMillis()`, not `SystemClock.elapsedRealtime()`.
Attachment #8874637 - Flags: review?(nalexander) → review-
This crash is ranked #2 in FA topcrashers and it's a startup crash.
Keywords: topcrash
I experienced this crash this morning when I started to use my cell.
grisha: push the first two parts when they're green; we can talk about the post eventually.
Since this is a high volume top crash and startup crash, I'm marking this bug as a blocker for 55 nightly to move to beta.
Attachment #8874637 - Attachment is obsolete: true
Blocks: 1370221
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #15)
> Since this is a high volume top crash and startup crash, I'm marking this
> bug as a blocker for 55 nightly to move to beta.

Also relevant are Bug 1370221, Bug 1370656, and Bug 1370668, which all have patches ready to land (waiting for the tree to open).
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/84818c89ef9c
Set 'started' telemetry timestamp before attempting to talk to the Token Server r=nalexander
https://hg.mozilla.org/integration/autoland/rev/12af2c27c2b2
Post: More explicit global error handling for Sync Telemetry r=nalexander
OK, I figure those patches should land today sometime when the sheriffs get everything squared away with m-c. Thanks.
This crash seems to occur during a background sync when the browser is NOT active.  What is really odd is that it seems to be telemetry related, yet it occurs if telemetry is disabled.
https://hg.mozilla.org/mozilla-central/rev/84818c89ef9c
https://hg.mozilla.org/mozilla-central/rev/12af2c27c2b2
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 55
(In reply to Bill Gianopoulos [:WG9s] from comment #21)
> This crash seems to occur during a background sync when the browser is NOT
> active.  What is really odd is that it seems to be telemetry related, yet it
> occurs if telemetry is disabled.

Oh, dear -- I didn't check during review that this code is observing the Telemetry preferences.  grisha, did we just miss this?
Flags: needinfo?(gkruglov)
(In reply to Nick Alexander :nalexander from comment #23)
> (In reply to Bill Gianopoulos [:WG9s] from comment #21)
> > This crash seems to occur during a background sync when the browser is NOT
> > active.  What is really odd is that it seems to be telemetry related, yet it
> > occurs if telemetry is disabled.
> 
> Oh, dear -- I didn't check during review that this code is observing the
> Telemetry preferences.  grisha, did we just miss this?
I wonder if it really only fails if telemetry is disabled and that is why there is no a Telemetry started timestamp.
(In reply to Bill Gianopoulos [:WG9s] from comment #24)
> (In reply to Nick Alexander :nalexander from comment #23)
> > (In reply to Bill Gianopoulos [:WG9s] from comment #21)
> > > This crash seems to occur during a background sync when the browser is NOT
> > > active.  What is really odd is that it seems to be telemetry related, yet it
> > > occurs if telemetry is disabled.
> > 
> > Oh, dear -- I didn't check during review that this code is observing the
> > Telemetry preferences.  grisha, did we just miss this?
> I wonder if it really only fails if telemetry is disabled and that is why
> there is no a Telemetry started timestamp.

Reading a bit more, I think it's already covered.  What's happening is that Sync is always collecting data and shuttling it to the Telemetry uploader (in the same Android process).  The Telemetry uploader observes the Telemetry preferences, so I think we're OK -- but I should have verified this and understood the implications, including bugs like this.

What's happening is that bugs in the "collecting data" and "shuttling" phases still impact the system, even when the Telemetry preference is off.  This isn't ideal, but it's not a bad way to architect the system, so I don't think it needs to change -- we just need to iron out the bugs.

In this case, this bug is in the "shuttling" phase, so it bites regardless of the Telemetry preferences.
Well there are 2 reasons you might want to disable telemetry data.  One is the privacy thing and the other is a performance thing.  What is the point of wasting resources collecting a mess of data unnecessarily if nothing is go9ing to be done with it because telemetry reporting is disabled?
(In reply to Bill Gianopoulos [:WG9s] from comment #26)
> Well there are 2 reasons you might want to disable telemetry data.  One is
> the privacy thing and the other is a performance thing.  What is the point
> of wasting resources collecting a mess of data unnecessarily if nothing is
> go9ing to be done with it because telemetry reporting is disabled?

Trade-offs: complexity around managing the pref from multiple places, etc.  I agree, not ideal -- but not a show-stopper either.  We could file follow-up.
(In reply to Nick Alexander :nalexander from comment #27)
> (In reply to Bill Gianopoulos [:WG9s] from comment #26)
> > Well there are 2 reasons you might want to disable telemetry data.  One is
> > the privacy thing and the other is a performance thing.  What is the point
> > of wasting resources collecting a mess of data unnecessarily if nothing is
> > go9ing to be done with it because telemetry reporting is disabled?
> 
> Trade-offs: complexity around managing the pref from multiple places, etc. 
> I agree, not ideal -- but not a show-stopper either.  We could file
> follow-upro
Yes the whole reason i turned this off was performance because I had a browser with 6 tabs open and had 3 processes using most of the cpu seemingly dealing with telemetry data.  I turned this off to try to help performance and it seems it makes no difference because it lamely collects all the data and then does nothing with it in the telemetry disabled case.
(In reply to Bill Gianopoulos [:WG9s] from comment #28)
> (In reply to Nick Alexander :nalexander from comment #27)
> > (In reply to Bill Gianopoulos [:WG9s] from comment #26)
> > > Well there are 2 reasons you might want to disable telemetry data.  One is
> > > the privacy thing and the other is a performance thing.  What is the point
> > > of wasting resources collecting a mess of data unnecessarily if nothing is
> > > go9ing to be done with it because telemetry reporting is disabled?
> > 
> > Trade-offs: complexity around managing the pref from multiple places, etc. 
> > I agree, not ideal -- but not a show-stopper either.  We could file
> > follow-upro
> Yes the whole reason i turned this off was performance because I had a
> browser with 6 tabs open and had 3 processes using most of the cpu seemingly
> dealing with telemetry data.  I turned this off to try to help performance
> and it seems it makes no difference because it lamely collects all the data
> and then does nothing with it in the telemetry disabled case.

I think what you are seeing is a function of Sync starting, Sync crashing entire process, Sync re-starting, Sync crashing entire process, etc -- lots of stuff is happening with the re-start that will peg a CPU as Android frantically tries to get a successful sync to happen.

Please, file the follow-up ticket -- I'm trying to agree with you while also explaining why we took the initial action we took!
Correctly functioning sync telemetry should have a negligible performance impact.

However, what is somewhat concerning to me is that we are storing telemetry data regardless of the user preference. If I'm reading the code correctly, this also means that whenever user opts-in to telemetry, we will upload telemetry that was collected while user was opted out. This isn't specific to sync ping - I think our core ping "suffers" from the same issue.
Flags: needinfo?(gkruglov)
(In reply to :Grisha Kruglov from comment #30)
> Correctly functioning sync telemetry should have a negligible performance
> impact.
> 
> However, what is somewhat concerning to me is that we are storing telemetry
> data regardless of the user preference. If I'm reading the code correctly,
> this also means that whenever user opts-in to telemetry, we will upload
> telemetry that was collected while user was opted out. This isn't specific
> to sync ping - I think our core ping "suffers" from the same issue.

Yes lots of security privacy performance issues here with the current implementation.  Obviously needs to be in a new bug, however.  This one is done!
(In reply to Bill Gianopoulos [:WG9s] from comment #31)
> Yes lots of security privacy performance issues here with the current
> implementation.  Obviously needs to be in a new bug, however.  This one is
> done!

See Bug 1371400.
Product: Android Background Services → Firefox for Android
You need to log in before you can comment on or make changes to this bug.