The default bug view has changed. See this FAQ.

Telemetry around SIGKILL

RESOLVED FIXED in Firefox 16

Status

()

Firefox for Android
General
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: akeybl, Assigned: gcp)

Tracking

unspecified
Firefox 17
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox16+ fixed)

Details

Attachments

(2 attachments)

(Reporter)

Description

5 years ago
We'd like to get Telemetry around SIGKILL asap in FN16 so that we can start putting together data about the correlation between

* processor architecture (ARMv6 vs ARMv7)
* # of OOM crashes
* device RAM

Taras and Brad let us know that the # of OOM crashes is the remaining puzzle piece here.
(Reporter)

Updated

5 years ago
Blocks: 779362
(Reporter)

Updated

5 years ago
tracking-firefox16: --- → +
Assignee: nobody → gpascutto
(Assignee)

Comment 1

5 years ago
Does Telemetry already the exact device model? I'm asking because there's nothing relating to that in toolkit/components/telemetry/TelemetryHistograms.h and it's not visible on the public telemetry interface. But comment 0 kinda implies it.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #1)
> Does Telemetry already the exact device model?

Well, we can get information on the CPU type:

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#366

So we need ARMv7 info and something about being SIGKILL'd due to OOM, right?
(Assignee)

Comment 3

5 years ago
I guess "arch", "version", "device", "manufacturer", "hardware" may be enough to know ARMv6 vs ARMv7, but adding a "hasARMv7" flag next to "hasARMv6" might be nicer.

I'll start with the OOM part as that's the most pressing.
(Assignee)

Comment 4

5 years ago
Created attachment 652789 [details] [diff] [review]
Patch 1. Record OOM exits. Send telemetry.

As far as I can tell, you can't intercept the SIGKILL itself, and in any case you can get that during "normal" operation and OOM in other ways, so I approached this by enumerating all cases and detecting them.

1) Fennec gets backgrounded. Users opens other applications. Fennec gets eventually killed to make room. This is the normal Android behavior, and we shouldn't report it.  

We'll get a onSaveInstanceState, at which point we will note that we were backgrounded. If upon restarting we note we're in RESTORE_OOM, but were in the background when being killed, we do not record an OOM telemetry event.

2) Fennec is in the foreground, but there is so much memory pressure that Android kills it while still being foregrounded. This is normally user-noticeable, so we want to count it as an OOM.

Same logic as previous case except that upon restart we will note that the savedInstaceState Bundle says we were in the foreground, so we *do* send telemetry.

3) Fennec runs out of memory while doing processing itself, in Java code. 

We'll throw a java.lang.OOMException, which our global exception handler intercepts. We check the exception and store a flag if OOM, before handing off to the crash reporter. Upon restart, we note the saved OOM flag and set telemetry accordingly.

4) Fennec runs out of memory while doing processing itself, in Gecko.

There is no warning whatsoever for this. We set a flag at startup noting an unclean shutdown. We clear the flag for each backgrounding, and reset it for each onRestart/onResume etc. If we have a crash that hits in the Crash Reporter, we clear the flag as well. Note that OOM kills don't hit the crash reporter. So basically, we note when Fennec disappeared without being stopped/paused and it was not due to a crash.

This is a bit more complicated than I hoped for, but it seems to handle all the cases correctly. 

Note that there are race conditions where we could get killed when the onPause handler has already finished but the Runnable it made is still pending, and whatever new is being loaded into memory causes a Gecko OOM. I see no way to avoid that one in the current Android design, as you get a StrictMode violation when you make onPause wait to hit disk (which is weird given how the docs say you're *supposed* to write stuff to storage in that function...but I digress). In any case these shouldn't be a big problem for us as we're only out for statistics.
Attachment #652789 - Flags: review?(blassey.bugs)
(Assignee)

Comment 5

5 years ago
taras and froydnj pointed out this should switch to using FLAG histograms.
Attachment #652789 - Flags: review?(blassey.bugs) → review+
(Assignee)

Comment 6

5 years ago
Created attachment 653516 [details] [diff] [review]
Patch 2. Add ARMv7 detection and signaling

I didn't see any good way of telling ARMv6 from ARMv7 devices, unless we can somehow do so via the buildid, or if we're going to enumerate all phones out there. 

This adds actual ARMv7 detection to our system-info stuff, so no need to guess. (The MSVC part of this wasn't tested - I wouldn't know where! - , though I think I got it right).
Attachment #653516 - Flags: review?(doug.turner)

Comment 7

5 years ago
Comment on attachment 653516 [details] [diff] [review]
Patch 2. Add ARMv7 detection and signaling

Review of attachment 653516 [details] [diff] [review]:
-----------------------------------------------------------------

push to try before landing on m-i.

::: xpcom/glue/arm.cpp
@@ +66,5 @@
> +{
> +#      if defined(MOZILLA_MAY_SUPPORT_ARMV7)
> +  __try
> +  {
> +    // ARMv7 DMB (Data Memory Barrier) for stores (DMB.ST)

add more of a comment because I didn't believe DMB didn't exist before v7.  How's:

// ARMv7 DMB (Data Memory Barrier) for stores (DMB.ST)
// The Data Memory Barrier existed before ARMv7 as a 
// cp15 operation, but ARMv7 introduced a dedicated
// instruction, DMB.
Attachment #653516 - Flags: review?(doug.turner) → review+
(Assignee)

Comment 8

5 years ago
https://tbpl.mozilla.org/?tree=Try&rev=5cadbbdb96d3
https://tbpl.mozilla.org/?tree=Try&rev=014d3b2c2c15

https://hg.mozilla.org/integration/mozilla-inbound/rev/03dd9481be20
https://hg.mozilla.org/integration/mozilla-inbound/rev/788a128c8ea5
(Assignee)

Comment 9

5 years ago
Comment on attachment 652789 [details] [diff] [review]
Patch 1. Record OOM exits. Send telemetry.

[Approval Request Comment]
User impact if declined: No user impact, but harder to decide where to release ARMv6 builds on.
Testing completed (on m-c, etc.): Landed on m-i recently.
Risk to taking this patch (and alternatives if risky): Low, no user visible changes, most instrumentation.
Attachment #652789 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 10

5 years ago
Comment on attachment 653516 [details] [diff] [review]
Patch 2. Add ARMv7 detection and signaling

[Approval Request Comment]
User impact if declined: No user impact, but harder to decide where to release ARMv6 builds on.
Testing completed (on m-c, etc.): Landed on m-i recently.
Risk to taking this patch (and alternatives if risky): Low, no user visible changes, mostly instrumentation.
Attachment #653516 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/03dd9481be20
https://hg.mozilla.org/mozilla-central/rev/788a128c8ea5
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 17
(Reporter)

Comment 12

5 years ago
Comment on attachment 652789 [details] [diff] [review]
Patch 1. Record OOM exits. Send telemetry.

[Triage Comment]
Let's land on Aurora at the same time so that we can get more data prior to the Beta uplift.
Attachment #652789 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(Reporter)

Updated

5 years ago
Attachment #653516 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(Assignee)

Comment 13

5 years ago
https://hg.mozilla.org/releases/mozilla-aurora/rev/a5c4075368af
https://hg.mozilla.org/releases/mozilla-aurora/rev/5520545ed688

Updated

5 years ago
status-firefox16: --- → fixed
Depends on: 785960
(Reporter)

Updated

5 years ago
Duplicate of this bug: 779362
You need to log in before you can comment on or make changes to this bug.