Add "canary-in-the-mine" instrumentation to detect when the browser process spends too long outside its main-thread event loop(s)

RESOLVED FIXED

Status

()

Core
General
RESOLVED FIXED
8 years ago
7 years ago

People

(Reporter: cjones, Assigned: blassey)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 2 obsolete attachments)

For fennec, we want animated zoom to run at 60fps.  Bug 598873 would make things easier, but even that isn't a complete solution on fennec.  To achieve that framerate, we can't spend more than 10-15ms per event-loop task.  We can't reliably achieve this with SW compositing on the main thread, but when we have GL acceleration, we need to focus on that target.

Easy instrumentation is to create an RAII class that sets up a SIGALRM handler for 10ms in the future when starting to process a task, and clears the handler when the task finishes.  If the alarm expires while the task is running, we can NS_RUNTIMEABORT() to get backtraces, or do other fancy things.

This simple plan is confounded by us having three different event loops running on the main thread, with multiple implementations, and by nested loops.  But it should be easy to get something simple working.
This is related to but complementary to function timer instrumentation, which doesn't give us backtraces.
This is a great idea.
Just thinking out loud. We could add a new Talos test for this once it has landed (or enable it for all Talos tests?). I don't know if this is feasible since we can't really predict what the OS or other processes would throw at us, so it could be a random orange nightmare.

At any rate, we need to figure out a way to automate responsiveness testing so we can watch for regressions.
So this is for the chrome process only, not content processes, right?
(In reply to comment #3)
> Just thinking out loud. We could add a new Talos test for this once it has
> landed (or enable it for all Talos tests?). I don't know if this is feasible
> since we can't really predict what the OS or other processes would throw at us,
> so it could be a random orange nightmare.
> 
> At any rate, we need to figure out a way to automate responsiveness testing so
> we can watch for regressions.

Yes; I've heard something about shaver having written a Tbeachball test that's supposed to measure this.  I had in mind running builds instrumented with something like this as we would run them under valgrind (although, yes, that should be automated too ;)).

(In reply to comment #4)
> So this is for the chrome process only, not content processes, right?

Right, I use "browser process" and "chrome process" interchangeably.
(Assignee)

Updated

8 years ago
Assignee: nobody → blassey.bugs
Created attachment 484083 [details] [diff] [review]
patch
Attachment #484083 - Flags: review?(jones.chris.g)
Comment on attachment 484083 [details] [diff] [review]
patch

Big problems with this patch
 - always enabled, opt and debug (!)
 - backtrace_symbols_fd is non-standard, need to feature test
 - ualarm is a newer standard, also needs a feature test

>diff --git a/xpcom/threads/nsThread.cpp b/xpcom/threads/nsThread.cpp
>--- a/xpcom/threads/nsThread.cpp
>+++ b/xpcom/threads/nsThread.cpp
>@@ -500,9 +506,41 @@ nsThread::HasPendingEvents(PRBool *resul
>   return NS_OK;
> }
> 
>+#if defined(XP_UNIX) && !defined(ANDROID)
>+

Using this on an android device was the first thing I had in mind.  We
can define a canary for android, but not use backtrace_symbols_fd
(bionic has ualarm(), right?).  Just being able to break on offenders
in gdb is enough for shotgun-profiling-type use.

Also, please use

#if defined(BLAH) && ...
# define MOZ_CANARY

so this condition can live in one place.

>+static void canary_alarm_handler (int signum)
>+{
>+  void *array[30];
>+  
>+  const char msg[29] = "event took too long to run:\n";
>+  write(2, msg, sizeof(msg));

Please add a comment here about using write() for async-signal safety,
I did a double-take on first glance ;).  s/2/STDERR_FILENO/.

>+  backtrace_symbols_fd(array, backtrace(array, 30), 2);

This is really cool, good idea :).  We could use this info in a
postprocessing phase.

>+  raise(SIGCONT);

Why is this here?  Just returning from the signal handler should
continue execution.

>+}
>+
>+class canary {

Style is "class Canary".

>+public:
>+  canary() {
>+    if (NS_IsMainThread() && XRE_GetProcessType() == GeckoProcessType_Default) {

Hmmmm ... when should we allow this.  We definitely only want this |if
(PR_GetEnv("MOZ_KILL_CANARIES"))| or something.  It's only really fair
to use this instrumentation in opt builds, too, since debug ones can
do all sorts of random stuff.  It's not-so-great to add an env var
check to ProcessNextEvent(), but I would hope it's not going to add
noticeable overhead.  We can try just the |if (getenv())| guard, for
opt and debug, and restrict to a build flag if talos notices.

>+  ~canary() {
>+    if (NS_IsMainThread() && XRE_GetProcessType() == GeckoProcessType_Default) {
>+      ualarm(0, 0);
>+    }

XXX or FIXME here plz about this not working for nested loops (OK by
me, for v0).

>+  }
>+};
>+#endif
>+
> NS_IMETHODIMP
> nsThread::ProcessNextEvent(PRBool mayWait, PRBool *result)
> {
>+#if defined(XP_UNIX) && !defined(ANDROID)

#ifdef MOZ_CANARY per above.

We'll want to move this code somewhere else if/when we want to
instrument the other event loops, but I agree with deferring until
that time.

r- for the big problems above.
Attachment #484083 - Flags: review?(jones.chris.g) → review-
I played around with this patch a bit on desktop-gtk.  Cool stuff!  Three big offenders that popped up consistently are loading XUL-y/XBL-y/CSS-y stuff, initializing NSS, and XSync()ing after ShmPutImage.  But, in the steady state we spend our time in the glib event loop, which triggers a false positive every time it sits in poll() for more than 15ms.  We're going to need to instrument the glib loop, at least, to get useful results.  And that itself requires handling nested canaries (hah) because of the atrocious design of our event loops.
Created attachment 485654 [details] [diff] [review]
patch
Attachment #484083 - Attachment is obsolete: true
Attachment #485654 - Flags: review?(jones.chris.g)
Comment on attachment 485654 [details] [diff] [review]
patch

We don't want this instrumentation to run by default.  Per comment 7, let's only enable it if the env var is set.

Other nitty stuff looks OK.
Attachment #485654 - Flags: review?(jones.chris.g) → review-
Comment on attachment 485654 [details] [diff] [review]
patch

(In reply to comment #10)
> Comment on attachment 485654 [details] [diff] [review]
> patch
> 
> We don't want this instrumentation to run by default.  Per comment 7, let's
> only enable it if the env var is set.
> 
> Other nitty stuff looks OK.

unless there's an error in the logic I'm not seeing, sOutputFD will be set to 0 if the env var is not set, and this code will not be run.

Or are you saying we should only build this code if a macro is set at build time?
Attachment #485654 - Flags: review- → review?(jones.chris.g)
Comment on attachment 485654 [details] [diff] [review]
patch

ahh, there is an error in the logic in that the alarm will be set once
Attachment #485654 - Flags: review?(jones.chris.g)
Created attachment 485962 [details] [diff] [review]
patch
Attachment #485654 - Attachment is obsolete: true
Attachment #485962 - Flags: review?(jones.chris.g)
Comment on attachment 485962 [details] [diff] [review]
patch

>+  Canary() {
>+    if (sOutputFD != 0 && NS_IsMainThread() && 
>+        XRE_GetProcessType() == GeckoProcessType_Default) {
>+      if (sOutputFD == -1) {
>+        const int flags = O_WRONLY | O_APPEND | O_CREAT | O_NONBLOCK;
>+        const mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH;
>+        char* env_var_flag = getenv("MOZ_KILL_CANARIES");
>+        sOutputFD = env_var_flag ? (env_var_flag[0] ?
>+                                    open(env_var_flag, flags, mode) :
>+                                    STDERR_FILENO) : 0;
>+        if (sOutputFD == 0)
>+          return;
>+      }
>+      signal(SIGALRM, canary_alarm_handler);
>+      ualarm(15000, 0);
>+      
>+    }

Extra newline here.
Attachment #485962 - Flags: review?(jones.chris.g) → review+
(Assignee)

Updated

8 years ago
Attachment #485962 - Flags: approval2.0?
Comment on attachment 485962 [details] [diff] [review]
patch

a=beltzner, can land during the b7 freeze due to value of getting this information.

A blog post should accompany this, so that we can get people using it!
Attachment #485962 - Flags: approval2.0? → approval2.0+
Can we maybe dump the js stack too?
(In reply to comment #16)
> Can we maybe dump the js stack too?

please file a follow up
I'll wait until we need it then.
Brad, if we're going to point people at this v0 implementation, can we please move

+#ifdef MOZ_CANARY
+  Canary canary;
+#endif

to just after

   ++mRunningEvent;

in nsThread::ProcessNextEvent?  That still ignores the native event loop, but prevents it from contributing false positives (except with nesting).

(In reply to comment #16)
> Can we maybe dump the js stack too?

Technically we shouldn't do that, because DumpJSStack() isn't async-signal safe.  But we're already using backtrace_symbols_fd() which isn't safe either, and this isn't mission-critical code, so we could probably add that.  It'll occasionally lead to deadlocks and so forth though.
pushed http://hg.mozilla.org/mozilla-central/rev/b866072123fa
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED

Updated

8 years ago
Depends on: 607944

Updated

8 years ago
Depends on: 617299

Updated

8 years ago
No longer depends on: 617299
You need to log in before you can comment on or make changes to this bug.