Investigate mobile responsiveness problems caused by not returning to the event loop

RESOLVED FIXED

Status

()

defect
P1
normal
RESOLVED FIXED
11 years ago
4 months ago

People

(Reporter: mrbkap, Unassigned)

Tracking

(4 keywords)

Trunk
x86
Linux
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9.1 +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 attachments, 2 obsolete attachments)

Vlad wrote an instrumentation patch to see why Fennec running on VMs has serious responsiveness issues while loading "heavy" pages such as CNN.com. Part of his patch works by basically doing setTimout(0) and watching to see how long it takes before the setTimeouts fire. On a VM, he saw a lot of setTimeouts firing at regular intervals, followed by a massive slowdown (on the order of one second).

In a test run, he saw the following event being handled:

{BM(44)BM(10)ES(94)BM(103)BM(7)BM(54)BM(18)BM(37)BM(33)BM(11)BM(17)BM(10)BM(21)BM(62)BM(20)BM(16)BM(20)BM(7)BM(40)BM(17)BM(92)RP4(676)-E(0xb1a53de0,676)}

To translate: we're calling BuildModel a ton of times during the same event without ever returning to the event loop. The sum of all of the BuildModel calls is quite a lot and could explain the loss of responsiveness.

I realized that what this probably means is that we're executing a script with many calls to document.write(), each of which calls into nsParser::BuildModel. In this case, the parser won't ever interrupt (and even when it's done parsing, we won't get all the way back out to the event loop). We could verify this by adding a printf around nsParser::ResumeParse to make sure that we're not sticking around in ResumeParse forever but actually entering and exiting the parser code.
Blocks: 441364
Flags: blocking1.9.1?
Keywords: mobile, perf, topperf
The instrumentation indicates that that is not what is happening. We were instrumenting how much time was spent while the scriptloader was calling EvaluateScript, it's the ES(94) number in the log. First of all it looks like we're only spending 94milliseconds executng script, out of 676ms spent in ResumeParse. Also, almost all calls to BuildModel happens after EvaluateScript finishes executing.

There is definitely a chance that we instrumeted things wrongly though, so don't take my word for it.
(In reply to comment #1)
> There is definitely a chance that we instrumeted things wrongly though, so
> don't take my word for it.

Nope, you're correct. I added a bunch of instrumentation here, and the path through ResumeParse that I missed was the inline script case. When we see an inline script, we stop tokenizing and BuildModel. In CNN's case, there are about 6 inline scripts in a row, some of which do document.write() (causing more rounds through BuildModel) and all of these rounds around BuildModel cause us to reset the "last time we were in the event loop" counter. Ugh.

I think Ben was going to try to patch this.
If I understand correctly, we code like the following to alert before 1000ms have elapsed?

<html>
  <head>
    <script>
      function wait(ms) {
        var start = new Date().getTime();
        while (new Date().getTime() - start < ms)
          /* empty */ ;
      }      
      var start = new Date().getTime();
      setTimeout(function() { alert(new Date().getTime() - start) }, 0);
      for (var i = 0; i < 20; i++)
        document.write("<scr"+"ipt>wait(50)</scr"+"ipt>");
    </script>
  </head>
  <body></body>
</html>
(In reply to comment #3)
> If I understand correctly, we code like the following to alert before 1000ms
> have elapsed?

s/we code/we want code/
Posted patch mochitest case (obsolete) — Splinter Review
Fix to come soon.
Wait, we can't fix that case. The browser can't interrupt the for loop and each of those scripts executes synchronously. Instead, replace the for loop with 20 inline scripts in the source.
Posted patch WIP proof-of-concept patch (obsolete) — Splinter Review
No longer calling nsContentSink::WillProcessTokens (which resets mDelayTimerStart) in CNavDTD::BuildModel.  This change partially implements mrbkap's suggestion, but it goes a bit too far, since it causes the timer never to be reset.  In order to reset the timer less frequently (rather than never), I need to find other places to call WillProcessTokens (e.g., nsParser::OnDataAvailable).

Changed NS_MAX_TOKENS_DEFLECTED_IN_LOW_FREQ_MODE from 200 to 0.  This is obviously too drastic.  A value of 50 seems to produce good results on Vlad's XUL app and in the mobile browser.  I was pandering to my somewhat unrealistic test case, which wastes time not by parsing tokens but by sitting in a javascript loop for 50ms at a time.  So I need to (1) choose a more realistic value for the maximum number of tokens to deflect before sampling the clock and (2) rewrite my test so that it wastes time in a realistic way (i.e., by parsing a lot of tokens).

Have to run home now, but I wanted to put this out there so others can play with it (maybe on the n810 itself?).
Attachment #344709 - Attachment is obsolete: true
Here's the log I generated (I'll attach the patch in a moment) to prove that we're seeing a lot of inline scripts (note: we only flush tokens for <script>).
So, the timer that was controlled by nsIHTMLContentSink::WillProcessTokens really wants to be timing the amount of time that we've spent since last running the event loop. Resetting the time from the DTD's BuildModel function was totally wrong for this, since we re-enter the DTD all the time and (for inline scripts) end up simply iterating again.

I also noticed, that when we implemented incremental XML parsing, we added yet *another* notification (nsIContentSink::WillTokenize) that is only used by the XML content sink for the purpose of calling WillProcessTokens.

I got rid of that in favor of a brand new notification, "WillParse" which is only called when we enter the parser from the event loop (the three functions where this happens are OnDataAvaiable, OnStopRequest and ContinueInterruptedParsing). I haven't tested this yet, but I do know it compiles and displays web pages correctly.
Attachment #345188 - Flags: superreview?(jst)
Attachment #345188 - Flags: review?(jst)
Oh, and the patch is gigantic because we have something like 7 or 8 implementations of nsI{,HTML}ContentSink in our tree.
Attachment #345188 - Flags: superreview?(jst)
Attachment #345188 - Flags: superreview+
Attachment #345188 - Flags: review?(jst)
Attachment #345188 - Flags: review+
Can we set this selectively for mobile?
Attachment #345027 - Attachment is obsolete: true
Setting this differently for mobile makes a lot of sense...

The 200 value for desktop should probably be revisited too.
http://hg.mozilla.org/mozilla-central/rev/9d236c675d0b ... this isn't really fixed though, should I file a spinoff to track the remaining work needed?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
yeah, please file another bug, since there is some work left to do here
This landed before branching.
Flags: blocking1.9.1? → blocking1.9.1+
Keywords: fixed1.9.1
Priority: -- → P1
Blocks: 481566
Attachment #345387 - Flags: superreview+
Attachment #345387 - Flags: review+
Attachment #345387 - Flags: approval1.9.1+
The test included with my previous patch was a useful diagnostic, but actually enabling it seems like a bad idea, since its dependency on timeouts probably would cause intermittent mochitest failures.
Attachment #365826 - Flags: review+
Component: Content → DOM
QA Contact: content → general
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.