Closed Bug 1142079 Opened 5 years ago Closed 5 years ago

crash in @0x0 | mozilla::Telemetry::Accumulate(mozilla::Telemetry::ID, unsigned int)

Categories

(Toolkit :: Telemetry, defect, critical)

All
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox36 --- wontfix
firefox37 + fixed
firefox38 --- fixed
firefox39 --- fixed

People

(Reporter: u421692, Assigned: avih)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-15f0d235-874e-4d25-b98d-52ef82150311.
=============================================================
Reproduced it on a Samsung Galaxy S5(Android 4.4.2).
Currently ranked #32 in Top Crashers for FennecAndroid 37.0b2; first crash appearance: 2012-02-12
bsmedberg - I hate seeing any negative impact from a system like Telemetry. Can you prioritize this and please find an owner?
Component: General → Telemetry
Flags: needinfo?(benjamin)
Product: Firefox for Android → Toolkit
Flags: needinfo?(benjamin) → needinfo?(vdjeric)
I think Avi's addition of the FX_REFRESH_DRIVER_FRAME_DELAY_MS histogram to Firefox 36 is aggravating some race condition.

David, do you have any insights into what's happening here?
Flags: needinfo?(vdjeric) → needinfo?(dmajor)
See Also: → 1141565
(In reply to Mihai Pop from comment #0)
> ...
> Currently ranked #32 in Top Crashers for FennecAndroid 37.0b2; first crash
> appearance: 2012-02-12

Can we deduce the date at which the offending patch landed on m-c?

Vladan, any reason you think FX_REFRESH_DRIVER_FRAME_DELAY_MS could be a cause? I can't find it on the report page from comment 0.
(In reply to Avi Halachmi (:avih) from comment #3)
> Vladan, any reason you think FX_REFRESH_DRIVER_FRAME_DELAY_MS could be a
> cause? I can't find it on the report page from comment 0.

A lot of the other crash reports have crash stacks where the RefreshDriver is calling Telemetry::Accumulate

https://crash-stats.mozilla.com/report/list?product=FennecAndroid&signature=%400x0+|+mozilla%3A%3ATelemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29#tab-reports

Also, this signature spiked in 36 and the frame delay histogram landed in 36. Finally, the frame delay histogram accumulates the most samples per session, so I think it makes sense that it would expose an existing race condition.

Please need-info me in the future, I'm CC'ed on way too many bugs
(In reply to Mihai Pop from comment #0)
> Reproduced it on a Samsung Galaxy S5(Android 4.4.2).

Can you reproduce it reliably?

If yes, I can do few try pushes where each eliminates a different a different probe.

The refresh driver has 3 telemetry accumulations: two FX_REFRESH_DRIVER_FRAME_DELAY_MS (one of which only reachable on Windows 7 and up so it can't be it) and one REFRESH_DRIVER_TICK which mchang added about the same time. Both probes have similar counts.

While it sounds like the race condition should be fixed, a quicker step would be to disable the offending probe[s] on affected platforms.

Do we know if this affects only/mostly android? comment 0 and comment 4 both point to android stacks.

If yes, we can disable it only on android.
Vladan pointed me over here based on bug 1141565, but this doesn't look like a race condition...unless at least one of the histograms in question is accumulated on multiple threads.  Is that true?
Flags: needinfo?(avihpit)
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #6)
> Vladan pointed me over here based on bug 1141565, but this doesn't look like
> a race condition...unless at least one of the histograms in question is
> accumulated on multiple threads.  Is that true?

I think it could happen. AFAIK there's a refresh driver on the parent thread and one for a content thread. They both use the exact same name since they measure the exact same thing, just on different threads.

Vladan said the telemetry infrastructure should know to distinguish between them and treat them differently.

That being said, AFAIK it's still possible that at some stage there will be more than one content thread, at which case the telemetry infrastructure should still know how to not race when the save probe is logged from two different threads.
Flags: needinfo?(avihpit)
(In reply to Avi Halachmi (:avih) from comment #7)
> ...
> should still know how to not race when the save probe is logged from two
> different threads.

s/save/same/
(In reply to Avi Halachmi (:avih) from comment #5)
> (In reply to Mihai Pop from comment #0)
> > Reproduced it on a Samsung Galaxy S5(Android 4.4.2).
> 
> Can you reproduce it reliably?

No, I don't have STR for the crash, I wasn't able to reproduce it, only once.
(In reply to Avi Halachmi (:avih) from comment #7)
> I think it could happen. AFAIK there's a refresh driver on the parent thread
> and one for a content thread. They both use the exact same name since they
> measure the exact same thing, just on different threads.
> 
> Vladan said the telemetry infrastructure should know to distinguish between
> them and treat them differently.

On E10S, the parent & child painting threads are in different processes, and each process has its own Telemetry.
E10S isn't enabled for Fennec.
I hesitate to blame this on the refresh driver change, since from a random spot-check I saw tons of other Accumulate callers too.

I find it interesting that we're *executing* at address 0 rather than reading. It looks like we've called through a busted vtable. My debuggers aren't good with android files; can anyone figure out exactly what assembly code happened before the call?
Flags: needinfo?(dmajor)
from irc, I'm ok deleting the :REFRESH_DRIVER_TICK telemetry probe in the refresh driver.
Just a note, the last 37 Beta goes to build this Thursday, Mar 19 since this release cycle is a week shorter than usual. If you are able land a fix we can get help from QA to verify it.
Are you okay with taking care of this in time for the final 37 Beta, Mason?
Flags: needinfo?(mchang)
(In reply to Mason Chang [:mchang] from comment #12)
> from irc, I'm ok deleting the :REFRESH_DRIVER_TICK telemetry probe in the
> refresh driver.

Woops, I'm ok deleting the :REFRESH_DRIVER_TICK telemetry probe for android only. From irc, Avih said he can take look into this.
Flags: needinfo?(mchang)
Note that one #ifndef is already inside an #ifdef XP_WIN section so it's a no-op, but I put it there anyway both for consistency and also in case someone copies the telemetry probe code to elsewhere within that file where it's not limited to Windows.

Also, I used #ifndef ANDROID which covers both Fennec and B2G (where MOZ_WIDGET_ANDROID is for Fennec only). We don't know which of the Android platforms are affected, so better play it safe. It's fine on any of these platforms also without those probes, and we can reconsider later if we want to turn it on again with finer granularity.
Attachment #8579266 - Flags: review?(nfroyd)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=90034b3cb11b

All green on all platforms except 3 tests which should be unrelated.
Comment on attachment 8579266 [details] [diff] [review]
Disable telemetry from the refresh driver on Android

Review of attachment 8579266 [details] [diff] [review]:
-----------------------------------------------------------------

It's too bad we can't have negative cpp_guards in Histograms.json to make sure folks can't use that histogram on Android...
Attachment #8579266 - Flags: review?(nfroyd) → review+
Thanks.

https://hg.mozilla.org/integration/mozilla-inbound/rev/e2c78b9d2254

(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #18)
> It's too bad we can't have negative cpp_guards in Histograms.json to make
> sure folks can't use that histogram on Android...

This histogram is not special. The special thing is that the accumulation frequency is high since the refresh driver can iterate up to 60 times/sec, so this increases the chances of triggering a race condition if the telemetry code has race bugs.
Comment on attachment 8579266 [details] [diff] [review]
Disable telemetry from the refresh driver on Android

Approval Request Comment
[Feature/regressing bug #]:
Bug 1107733, bug 1100920 added telemetry probes which probably hit a telemetry bug elsewhere. That other bug has not been identified yet other than the fact that it probably exist.

[User impact if declined]:
Top 32 crash on Android (exact platforms scope is unclear).

[Describe test coverage new/current, TreeHerder]:
Green try push on all platforms except 3 tests failures which should be unrelated.

[Risks and why]: 
Very low - it's 2 telemetry collection probes which the patch bypasses on Android. This data is not used further at the code except when uploading it to the telemetry servers as an aggregation.

[String/UUID change made/needed]:
None.
Attachment #8579266 - Flags: approval-mozilla-beta?
Attachment #8579266 - Flags: approval-mozilla-aurora?
Assignee: nobody → avihpit
(In reply to Liz Henry (:lizzard) from comment #13)
> Just a note, the last 37 Beta goes to build this Thursday, Mar 19 since this
> release cycle is a week shorter than usual. If you are able land a fix we
> can get help from QA to verify it.

We're getting a bit tight on schedule... if anyone could help in making it happen, would be appreciated.

I talked to ryanvm earlier and he said he'll land it soon such that it can get into the uplift review queue, but I guess there were issues preventing this, and now he's away.

Lawrence?
(In reply to Liz Henry (:lizzard) from comment #13)
> Just a note, the last 37 Beta goes to build this Thursday, Mar 19 since this
> release cycle is a week shorter than usual. If you are able land a fix we
> can get help from QA to verify it.

Also, according to comment 9 I don't think we have an STR. How would QA test it if we can't reproduce it?
https://hg.mozilla.org/mozilla-central/rev/e2c78b9d2254
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Comment on attachment 8579266 [details] [diff] [review]
Disable telemetry from the refresh driver on Android

Simple fix for an Android crash. No impact on desktop. Beta+ Aurora+
Attachment #8579266 - Flags: approval-mozilla-beta?
Attachment #8579266 - Flags: approval-mozilla-beta+
Attachment #8579266 - Flags: approval-mozilla-aurora?
Attachment #8579266 - Flags: approval-mozilla-aurora+
Thanks.
You need to log in before you can comment on or make changes to this bug.