Closed Bug 863777 Opened 7 years ago Closed 7 years ago

Teach ANR reporter to use the profiler to get a native stack

Categories

(Firefox for Android :: General, defect)

ARM
Android
defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 24

People

(Reporter: jchen, Assigned: jchen)

References

Details

(Keywords: privacy-review-needed)

Attachments

(7 files, 8 obsolete files)

1.43 KB, patch
kats
: review+
Details | Diff | Splinter Review
5.74 KB, patch
kats
: review+
Details | Diff | Splinter Review
4.44 KB, patch
blassey
: review+
Details | Diff | Splinter Review
768 bytes, patch
BenWa
: review+
Details | Diff | Splinter Review
1.60 KB, patch
BenWa
: review+
Details | Diff | Splinter Review
2.96 KB, patch
jchen
: review+
Details | Diff | Splinter Review
9.08 KB, patch
jchen
: review+
Details | Diff | Splinter Review
Offshoot from bug 860879. It'd be very useful to have the Gecko stack in ANR reports, and the profiler can provide that.
Here is a set of WIP patches that lets ANR reporter get a profile sample.
To get a profile sample, we provide three new JNI functions.

requestNativeStack() starts the profiler and let it take a sample asynchronously (while ANR reporter is processing other data)

getNativeStack() gets the profiler sample and wait if it's unavailable

releaseNativeStack() stops the profiler; it's separate from getNativeStack() because releaseNativeStack() can lock up the current thread waiting for the profiler to stop, so we want to do it last, after ANR data processing is finished
This patch adds the new native stack function calls to ANRReporter
Attached patch Fix text profile output (v1) (obsolete) — Splinter Review
Fix the function that generates profile data in text format.
I'd prefer if you moved the native functions from GeckoAppShell to ANRReporter.java, and add ANRReporter.java to the CLASSES_WITH_JNI variable in the Makefile.
Attachment #739741 - Attachment description: Bug 863777 - Add types to jni-generator.py; r= → Add types to jni-generator.py (v1)
Attachment #739741 - Flags: review?(bugmail.mozilla)
This patch adds the barebone JNI definitions for getting a native stack through the profiler
Attachment #739743 - Attachment is obsolete: true
Attachment #743411 - Flags: review?(bugmail.mozilla)
This patch implements getting a native stack using the profiler. All of these calls happen on a Java thread separate from the Gecko and main threads.

requestNativeStack starts sampling in the background. A long sample time is used so that only one sample is collected.

getNativeStack waits if necessary and gets the profiler output.

releaseNativeStack ends sampling and stops the profiler. The reason it's separate from getNativeStack is because profiler_stop waits for the sampling timing thread to stop, and the timing thread is sleeping for a long time because of the long sampling time. We only call releaseNativeStack after all processing of profiler data is done, at which time we can wait for as long as we want.
Attachment #743414 - Flags: review?(bgirard)
Add an 'androidNativeStack' section to the ANR report containing the profiler output
Attachment #739745 - Attachment is obsolete: true
Attachment #743416 - Flags: review?(blassey.bugs)
Attached patch Fix text profile output (v2) (obsolete) — Splinter Review
We use the profiler output in plain-text form; this patch adds some missing/incomplete tags to the function that generates plain-text output.
Attachment #739746 - Attachment is obsolete: true
Attachment #743418 - Flags: review?(bgirard)
Attachment #739741 - Flags: review?(bugmail.mozilla) → review+
Attachment #743411 - Flags: review?(bugmail.mozilla) → review+
If this patch works reliably, can we get rid of the 4-second early-bailout in GeckoAppShell.sendEventToGeckoSync?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
> If this patch works reliably, can we get rid of the 4-second early-bailout
> in GeckoAppShell.sendEventToGeckoSync?

Currently, it still outputs a peudostack, until a couple more profiler unwinding bugs are fixed; it should be pretty reliable though. Getting rid of the early-bailout would definitely help us collect data. On the other hand, I guess if it was an effective workaround, then user experience will suffer a bit without it.
I'd still rather remove it and deal with the ANRs we get. If we run into problems we can't fix easily maybe we can flip it back on. I'd definitely like to get stacks of what the gecko thread is doing when these ANRs trigger, as it would shed more light on things like bug 834243.
Comment on attachment 743414 [details] [diff] [review]
Implement native stack JNI calls using the profiler API (v2)

Review of attachment 743414 [details] [diff] [review]:
-----------------------------------------------------------------

I talked with jchen I wanted to understand how this all works. I repeat it here:
1) When we startup we register for the ANR intent and spawn of a new thread to watch it.
2) At a later point the main UI (java) thread is blocked which wakes up the new ANR thread.
3) That thread now tries to collect a profile.

The calls used in this patch are not thread safe (profiler_start, profiler_get_profile).

We have a few options:
(1) We can risk crashing. I'd imagine the risk would be higher if they are using the profiler. I think the risk here is zero for the majority of our users. I'd still want to audit the code for calls that assume we're on the main thread. (free)
(2) We can finish the stack collection APIs in the profiler. (3-4 days but useful for many applications and debugging)
(3) We can make these API thread safe and perhaps allow each thread to have a profiler instance via TLS or remove all restrictions. (1-2 weeks)

I think for this purpose we should pick between 1 and 2. Brad would you accept a low risk data race knowingly or should we just finish the work in (2).

::: widget/android/AndroidJNI.cpp
@@ +885,5 @@
> +    if (profiler_is_active()) {
> +        // Don't proceed if profiler is already running
> +        return JNI_FALSE;
> +    }
> +    const char *NATIVE_STACK_FEATURES[] = {"js", "leaf", "threads"};

The js option and some custom profiler labels have URL so we will need a privacy review for them since this is the first time we're collecting these without user action. We could also disable those features.

@@ +886,5 @@
> +        // Don't proceed if profiler is already running
> +        return JNI_FALSE;
> +    }
> +    const char *NATIVE_STACK_FEATURES[] = {"js", "leaf", "threads"};
> +    profiler_register_thread("ANRReporter");

I don't think this thread is worth watching when profiling. If not we shouldn't register it for profiling.

@@ +897,5 @@
>  NS_EXPORT jstring JNICALL
>  Java_org_mozilla_gecko_ANRReporter_getNativeStack(JNIEnv* jenv, jclass)
>  {
> +    if (!profiler_is_active()) {
> +        // Maybe profiler support is disabled?

Also we can build with the profiler disabled.
Comment on attachment 743418 [details] [diff] [review]
Fix text profile output (v2)

Review of attachment 743418 [details] [diff] [review]:
-----------------------------------------------------------------

The old text format is completely deprecated and I plan on removing this code soon. Many profiling features are not supported and I'm not supporting the parsing for them in cleopatra.

Instead of fixing the format can you fix profiler_get_profile to return a stringification of the JSON profile data. This is done internally in mozilla_sampler_shutdown. You just need to copy the code and make sure our automated tests for profiler_get_profile don't regress and if they do they should be updated. The devtools tests are the only users of this API.
Attachment #743418 - Flags: review?(bgirard) → review-
Needinfo on Comment 13.
Flags: needinfo?(blassey.bugs)
(In reply to Benoit Girard (:BenWa) from comment #13)
> Comment on attachment 743414 [details] [diff] [review]
> Implement native stack JNI calls using the profiler API (v2)
> 
> Review of attachment 743414 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I talked with jchen I wanted to understand how this all works. I repeat it
> here:
> 1) When we startup we register for the ANR intent and spawn of a new thread
> to watch it.
> 2) At a later point the main UI (java) thread is blocked which wakes up the
> new ANR thread.
> 3) That thread now tries to collect a profile.
> 
> The calls used in this patch are not thread safe (profiler_start,
> profiler_get_profile).
> 
> We have a few options:
> (1) We can risk crashing. I'd imagine the risk would be higher if they are
> using the profiler. I think the risk here is zero for the majority of our
> users. I'd still want to audit the code for calls that assume we're on the
> main thread. (free)
> (2) We can finish the stack collection APIs in the profiler. (3-4 days but
> useful for many applications and debugging)
> (3) We can make these API thread safe and perhaps allow each thread to have
> a profiler instance via TLS or remove all restrictions. (1-2 weeks)
> 
> I think for this purpose we should pick between 1 and 2. Brad would you
> accept a low risk data race knowingly or should we just finish the work in
> (2).

#1 is entirely acceptable to me, but I wonder what the additional benefits of #2 are.
Flags: needinfo?(blassey.bugs)
#2 means solving bug 777828. Essentially capturing stacks is useful for debugging, profiler features and possibly telemetry.
Attachment #743416 - Flags: review?(blassey.bugs) → review+
Comment on attachment 743414 [details] [diff] [review]
Implement native stack JNI calls using the profiler API (v2)

Review of attachment 743414 [details] [diff] [review]:
-----------------------------------------------------------------

Almost there. Let's remove the register_thread and get a privacy review for this.
Attachment #743414 - Flags: review?(bgirard) → review-
Found out the reason I had to register the ANR thread was because profiler start/stop accessed the current thread's pseudo-stack. I think this patch fixes the problem.
Attachment #744811 - Flags: review?(bgirard)
This patch let profiler_get_profile return the JSON profile instead of fixing the text profile output. I will run a try build and post any fixes to tests as separate patches.
Attachment #743418 - Attachment is obsolete: true
Attachment #744814 - Flags: review?(bgirard)
This patch gets rid of registering the ANR thread. I'll also put up a privacy-review-needed notice.
Attachment #743414 - Attachment is obsolete: true
Attachment #744815 - Flags: review?(bgirard)
Comment on attachment 744811 [details] [diff] [review]
Letting profiler start/stop from a non-registered thread (v1)

Review of attachment 744811 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/platform.cpp
@@ +480,5 @@
> +        ThreadProfile* thread_profile = info->Profile();
> +        if (!thread_profile) {
> +          continue;
> +        }
> +        thread_profile->GetPseudoStack()->disableJSSampling();

js::EnableRuntimeProfilingStack(mRuntime, false); isn't thread safe.
The original code appears to not turn off js profiling on secondary thread but this makes the problem far worse. The profiler hooks into the js operation callback to disable js sampling for secondary threads but that only fixes the original problem.

If we don't turn on JS profiling we could just an assert checking that we're on the main thread.
Attachment #744811 - Flags: review?(bgirard) → review-
Comment on attachment 744814 [details] [diff] [review]
Make profiler_get_profile return the JSON profile (v2)

Review of attachment 744814 [details] [diff] [review]:
-----------------------------------------------------------------

Cool, just make sure that no tests are regressed.
Attachment #744814 - Flags: review?(bgirard) → review+
Comment on attachment 744815 [details] [diff] [review]
Implement native stack JNI calls using the profiler API (v3)

Review of attachment 744815 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/AndroidJNI.cpp
@@ +887,5 @@
> +        return JNI_FALSE;
> +    }
> +    const char *NATIVE_STACK_FEATURES[] = {"js", "leaf", "threads"};
> +    // Buffer one sample and let the profiler wait a long time
> +    profiler_start(100, 10000, NATIVE_STACK_FEATURES,

Can we add a giant warning saying that we know this isn't safe but decided to take the risk?
Attachment #744815 - Flags: review?(bgirard) → review+
This new patch just moves the pseudo-stack check inside the disableJS block, which is run for JS-enabled profiles and not run for ANR reporter profiles.
Attachment #744811 - Attachment is obsolete: true
Attachment #745195 - Flags: review?(bgirard)
Added a comment and disabled JS (see attachment 745195 [details] [diff] [review]). Carried over r+.
Attachment #744815 - Attachment is obsolete: true
Attachment #745198 - Flags: review+
Attachment #745195 - Flags: review?(bgirard) → review+
Adding a privacy-review-needed flag and CC'ing Alina, who did the previous privacy review.

By including profiling data in the ANR report, we may be including certain URLs the user is accessing. AFAIK this used to be JavaScript and image URLs, but since we are disabling JS profiling, the URLs collected will only include image URLs.

The data is only acquired when an ANR (hang) happens in Fennec, and will only be enabled on Nightly and Aurora builds. The data will be sent through telemetry and will be kept in a non-public, access-controlled database run by Metrics. The user is informed about ANR report collection through the telemetry opt-out notice during first-run.

FWIW, the previous privacy review for ANR reporter included sending Android logs, which may also include some URLs as part of the JavaScript console output.
No longer blocks: 860879
I would really like to see these patches get landed, and it looks like the only thing it's waiting for is privacy review. Alina, any ETA on when that might happen?
Flags: needinfo?(ahua)
I'm chiming in because Alina asked for help.

Kartikaya, could you attach the text or screenshot of the opt-out during first run?

In general, I agree that debugging requires having a full stack dump, which can include way more things than just URLs (for example, it could contain usernames, passwords, recently typed things, and other interesting strings)

From my POV, the important part of the disclosure is that it
1) Says there could be personally identifiable information in the data sent back to Mozilla
2) Says why we need the data
3) Says how we protect the data (who has access, how it's stored, how *long* it's stored, any derived data shared publicly will *not* have personally identifiable information)
4) Says that we transmit this data over SSL, as we are cleaning up all of our in product URLs to be HTTPS

It sounds like we are already sending back JS console output as part of ANR. In my opinion, adding native stack traces doesn't increase the privacy risk due to the type of data being sent. It does mean that the data that Mozilla collects is a juicier target due to increased volume.
Btw, in case it wasn't obvious, I am advocating for deleting these stack dumps after a period of time (3 months?), since they probably aren't useful when they're old anyway.
(In reply to Monica Chew [:mmc] (please use ?needinfo) from comment #29)
> I'm chiming in because Alina asked for help.
> 
> Kartikaya, could you attach the text or screenshot of the opt-out during
> first run?

The previous opt-out, which added the word "stability", was "Nightly sends information about performance, stability, hardware, usage and customizations back to Mozilla to help improve Nightly."
(http://hg.mozilla.org/mozilla-central/annotate/b8ba9eac4d2a/mobile/android/locales/en-US/chrome/browser.properties#l70)

However, a recent change removed "stability" (inadvertently I think). The current message is "Shares performance, usage, hardware and customization data about your browser with Mozilla to help us make Nightly better".
(http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/locales/en-US/android_strings.dtd#151)

> In general, I agree that debugging requires having a full stack dump, which
> can include way more things than just URLs (for example, it could contain
> usernames, passwords, recently typed things, and other interesting strings)

Note that in this case only specific information, not the full stack dump, is included. We are specifically including function names, addresses, library names, and image URLs. Other more sensitive information such as usernames, passwords, etc. are guaranteed to not be included.

AFAIK, the only potentially identifiable information included is image URLs.

> From my POV, the important part of the disclosure is that it
> 1) Says there could be personally identifiable information in the data sent
> back to Mozilla
> 2) Says why we need the data
> 3) Says how we protect the data (who has access, how it's stored, how *long*
> it's stored, any derived data shared publicly will *not* have personally
> identifiable information)
> 4) Says that we transmit this data over SSL, as we are cleaning up all of
> our in product URLs to be HTTPS
> 
> It sounds like we are already sending back JS console output as part of ANR.
> In my opinion, adding native stack traces doesn't increase the privacy risk
> due to the type of data being sent. It does mean that the data that Mozilla
> collects is a juicier target due to increased volume.

I agree we should disclose the things you mentioned, in the privacy policy, for example.
Hi Monica, please see my response to your questions in Comment 31
Flags: needinfo?(ahua) → needinfo?(mmc)
Hi Jim,

Your comment sounds reasonable to me, in fact you are disclosing less than I anticipated. I'll leave it up to an actual privacy policy person (alina) to make comments on the disclosure part.

Thanks,
Monica
Flags: needinfo?(mmc)
This patch adds a privacy feature to the profiler, which disables dynamic labels and markers. TableTicker exposes the privacy mode setting through PseudoStack, so that we can keep exporting PseudoStack.h and not TableTicker.h
Attachment #762287 - Flags: review?(bgirard)
Comment on attachment 762287 [details] [diff] [review]
Add and turn on privacy mode in profiler for ANR reports (v1)

Review of attachment 762287 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!
Attachment #762287 - Flags: review?(bgirard) → review+
(In reply to Benoit Girard (:BenWa) from comment #35)
> Comment on attachment 762287 [details] [diff] [review]
> Add and turn on privacy mode in profiler for ANR reports (v1)
> 
> Review of attachment 762287 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Nice!

Actually looks like you forgot to implement the empty static inline for non tier 1 platforms (ports) that don't support the profiler. This patch will break BSD.
(In reply to Benoit Girard (:BenWa) from comment #36)
> (In reply to Benoit Girard (:BenWa) from comment #35)
> > Comment on attachment 762287 [details] [diff] [review]
> > Add and turn on privacy mode in profiler for ANR reports (v1)
> > 
> > Review of attachment 762287 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Nice!
> 
> Actually looks like you forgot to implement the empty static inline for non
> tier 1 platforms (ports) that don't support the profiler. This patch will
> break BSD.

Added stub in GeckoProfiler.h

Carrying over r+ since the change is trivial.
Attachment #762309 - Flags: review+
Looks fine!
Attachment #762287 - Attachment is obsolete: true
Depends on: 999071
You need to log in before you can comment on or make changes to this bug.