Last Comment Bug 779687 - Telemetry around SIGKILL
: Telemetry around SIGKILL
Status: RESOLVED FIXED
:
Product: Firefox for Android
Classification: Client Software
Component: General (show other bugs)
: unspecified
: All All
: -- normal (vote)
: Firefox 17
Assigned To: Gian-Carlo Pascutto [:gcp]
:
Mentors:
: 779362 (view as bug list)
Depends on: 785960
Blocks: 779362
  Show dependency treegraph
 
Reported: 2012-08-01 16:56 PDT by Alex Keybl [:akeybl]
Modified: 2012-09-04 08:42 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed


Attachments
Patch 1. Record OOM exits. Send telemetry. (14.21 KB, patch)
2012-08-17 09:15 PDT, Gian-Carlo Pascutto [:gcp]
blassey.bugs: review+
akeybl: approval‑mozilla‑aurora+
Details | Diff | Review
Patch 2. Add ARMv7 detection and signaling (9.44 KB, patch)
2012-08-20 14:13 PDT, Gian-Carlo Pascutto [:gcp]
dougt: review+
akeybl: approval‑mozilla‑aurora+
Details | Diff | Review

Description Alex Keybl [:akeybl] 2012-08-01 16:56:11 PDT
We'd like to get Telemetry around SIGKILL asap in FN16 so that we can start putting together data about the correlation between

* processor architecture (ARMv6 vs ARMv7)
* # of OOM crashes
* device RAM

Taras and Brad let us know that the # of OOM crashes is the remaining puzzle piece here.
Comment 1 Gian-Carlo Pascutto [:gcp] 2012-08-15 05:25:11 PDT
Does Telemetry already the exact device model? I'm asking because there's nothing relating to that in toolkit/components/telemetry/TelemetryHistograms.h and it's not visible on the public telemetry interface. But comment 0 kinda implies it.
Comment 2 Nathan Froyd [:froydnj] 2012-08-15 08:09:25 PDT
(In reply to Gian-Carlo Pascutto (:gcp) from comment #1)
> Does Telemetry already the exact device model?

Well, we can get information on the CPU type:

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#366

So we need ARMv7 info and something about being SIGKILL'd due to OOM, right?
Comment 3 Gian-Carlo Pascutto [:gcp] 2012-08-15 08:19:32 PDT
I guess "arch", "version", "device", "manufacturer", "hardware" may be enough to know ARMv6 vs ARMv7, but adding a "hasARMv7" flag next to "hasARMv6" might be nicer.

I'll start with the OOM part as that's the most pressing.
Comment 4 Gian-Carlo Pascutto [:gcp] 2012-08-17 09:15:28 PDT
Created attachment 652789 [details] [diff] [review]
Patch 1. Record OOM exits. Send telemetry.

As far as I can tell, you can't intercept the SIGKILL itself, and in any case you can get that during "normal" operation and OOM in other ways, so I approached this by enumerating all cases and detecting them.

1) Fennec gets backgrounded. Users opens other applications. Fennec gets eventually killed to make room. This is the normal Android behavior, and we shouldn't report it.  

We'll get a onSaveInstanceState, at which point we will note that we were backgrounded. If upon restarting we note we're in RESTORE_OOM, but were in the background when being killed, we do not record an OOM telemetry event.

2) Fennec is in the foreground, but there is so much memory pressure that Android kills it while still being foregrounded. This is normally user-noticeable, so we want to count it as an OOM.

Same logic as previous case except that upon restart we will note that the savedInstaceState Bundle says we were in the foreground, so we *do* send telemetry.

3) Fennec runs out of memory while doing processing itself, in Java code. 

We'll throw a java.lang.OOMException, which our global exception handler intercepts. We check the exception and store a flag if OOM, before handing off to the crash reporter. Upon restart, we note the saved OOM flag and set telemetry accordingly.

4) Fennec runs out of memory while doing processing itself, in Gecko.

There is no warning whatsoever for this. We set a flag at startup noting an unclean shutdown. We clear the flag for each backgrounding, and reset it for each onRestart/onResume etc. If we have a crash that hits in the Crash Reporter, we clear the flag as well. Note that OOM kills don't hit the crash reporter. So basically, we note when Fennec disappeared without being stopped/paused and it was not due to a crash.

This is a bit more complicated than I hoped for, but it seems to handle all the cases correctly. 

Note that there are race conditions where we could get killed when the onPause handler has already finished but the Runnable it made is still pending, and whatever new is being loaded into memory causes a Gecko OOM. I see no way to avoid that one in the current Android design, as you get a StrictMode violation when you make onPause wait to hit disk (which is weird given how the docs say you're *supposed* to write stuff to storage in that function...but I digress). In any case these shouldn't be a big problem for us as we're only out for statistics.
Comment 5 Gian-Carlo Pascutto [:gcp] 2012-08-17 10:14:51 PDT
taras and froydnj pointed out this should switch to using FLAG histograms.
Comment 6 Gian-Carlo Pascutto [:gcp] 2012-08-20 14:13:07 PDT
Created attachment 653516 [details] [diff] [review]
Patch 2. Add ARMv7 detection and signaling

I didn't see any good way of telling ARMv6 from ARMv7 devices, unless we can somehow do so via the buildid, or if we're going to enumerate all phones out there. 

This adds actual ARMv7 detection to our system-info stuff, so no need to guess. (The MSVC part of this wasn't tested - I wouldn't know where! - , though I think I got it right).
Comment 7 Doug Turner (:dougt) 2012-08-20 15:13:34 PDT
Comment on attachment 653516 [details] [diff] [review]
Patch 2. Add ARMv7 detection and signaling

Review of attachment 653516 [details] [diff] [review]:
-----------------------------------------------------------------

push to try before landing on m-i.

::: xpcom/glue/arm.cpp
@@ +66,5 @@
> +{
> +#      if defined(MOZILLA_MAY_SUPPORT_ARMV7)
> +  __try
> +  {
> +    // ARMv7 DMB (Data Memory Barrier) for stores (DMB.ST)

add more of a comment because I didn't believe DMB didn't exist before v7.  How's:

// ARMv7 DMB (Data Memory Barrier) for stores (DMB.ST)
// The Data Memory Barrier existed before ARMv7 as a 
// cp15 operation, but ARMv7 introduced a dedicated
// instruction, DMB.
Comment 9 Gian-Carlo Pascutto [:gcp] 2012-08-21 08:34:35 PDT
Comment on attachment 652789 [details] [diff] [review]
Patch 1. Record OOM exits. Send telemetry.

[Approval Request Comment]
User impact if declined: No user impact, but harder to decide where to release ARMv6 builds on.
Testing completed (on m-c, etc.): Landed on m-i recently.
Risk to taking this patch (and alternatives if risky): Low, no user visible changes, most instrumentation.
Comment 10 Gian-Carlo Pascutto [:gcp] 2012-08-21 08:35:25 PDT
Comment on attachment 653516 [details] [diff] [review]
Patch 2. Add ARMv7 detection and signaling

[Approval Request Comment]
User impact if declined: No user impact, but harder to decide where to release ARMv6 builds on.
Testing completed (on m-c, etc.): Landed on m-i recently.
Risk to taking this patch (and alternatives if risky): Low, no user visible changes, mostly instrumentation.
Comment 12 Alex Keybl [:akeybl] 2012-08-22 12:28:40 PDT
Comment on attachment 652789 [details] [diff] [review]
Patch 1. Record OOM exits. Send telemetry.

[Triage Comment]
Let's land on Aurora at the same time so that we can get more data prior to the Beta uplift.
Comment 14 Alex Keybl [:akeybl] 2012-09-04 08:42:57 PDT
*** Bug 779362 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.