Open Bug 694519 Opened 13 years ago Updated 2 years ago

DOM mutation/web sockets combination can lag browser UI

Categories

(Core :: DOM: Core & HTML, defect, P5)

9 Branch
x86_64
Windows 7
defect

Tracking

()

People

(Reporter: djc, Unassigned)

Details

(Whiteboard: [Snappy:P3])

Attachments

(2 files, 2 obsolete files)

Attached file Crude python web sockets server (obsolete) —
      No description provided.
Attached file HTML web sockets client (obsolete) —
These files represent a simplified example of an application I wrote at work. First start the server (tested with Python 2.7, should hopefully also work with 2.6 or 2.5, might require minor changes, I'm happy to help). Then open the HTML page (it should listen to localhost:8000 by default).

On my computer, with THROTTLE = 0.0 (see top of Python code), the browser UI starts being unresponsive, more so with time (e.g. there seems to be a backlog of messages to process). This is mostly noticeable to me in the tab bar: closing the tab containing the test page is sluggish, so is opening a new tab, the tab bar seems to lag the content chrome. Clicking a link on the test page to go to another page also gets exceedingly slow. Note also that the timers getting set up (to clear the background marking of a fresh values) tend not to fire (on very high rates) or bunch up and get fired with a group at once.

It all seems to work rather better in Chrome. And I wasn't sure what component this bug fit with... But I talked about it with bz on IRC a while ago, so I hope he might fix that for me.
Attached file Added link, fixed IP.
Attachment #567038 - Attachment is obsolete: true
I get
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 525, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/threading.py", line 477, in run
    self.__target(*self.__args, **self.__kwargs)
  File "ws.py", line 119, in handle
    sock.send(self.proto.handshake(headers, lines))
AttributeError: 'Handle' object has no attribute 'proto'
But in general, if I read the server code correctly, it just sends lots of messages to
browser, and especially if the server is running in the same machine as browser, that leads to
new events in the event loop all the time. And when those events are processed, they
end up changing DOM which causes layout changes etc.

But need to profile this anyway, once I get a working server ;)

Chrome has separate process for its UI, so it can handle this situation better.
I wonder if the page itself is laggy in Chrome?
This server should not exhibit the traceback shown above.
Attachment #567037 - Attachment is obsolete: true
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 525, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/threading.py", line 477, in run
    self.__target(*self.__args, **self.__kwargs)
  File "server.py", line 119, in handle
    sock.send(self.proto.handshake(headers, lines))
  File "server.py", line 27, in handshake
    key = headers['Sec-WebSocket-Key']
KeyError: 'Sec-WebSocket-Key'
Are you visiting the port on which the server runs? The server *only* contains the web socket server, you have to load the test.html into the browser as a local file.
Oh, oops. yes. I somehow thought the server can handle http and ws
> I wonder if the page itself is laggy in Chrome?

djc, any data on that?

What sort of data is comming in the messages?  Is this setting up and clearing thousands of timeouts at a time?
You could try Chrome for yourself? :) I tried Chrome for a bit and hit the link, and it was quick, but maybe I didn't let it run long enough before clicking. But I think our actual app also flips to the next page much quicker.

Yeah, it's setting up a lot of timeouts. I'd guess it's a lot like the messages the test script sends; see line 123 in the server script. They would tend to be a little bursty in my real-world app, and probably less-uniformly distributed over the table rows.
> You could try Chrome for yourself?

I could once I take the time to figure out how to actually run your server, etc.  Step-by-step steps to reproduce would go a long way here....
So are we reading from network faster than we can process in the main thread.
(In reply to Boris Zbarsky (:bz) from comment #12)
> I could once I take the time to figure out how to actually run your server,
> etc.  Step-by-step steps to reproduce would go a long way here....

Yeah, I know this is not exactly minimal. But I lack enough knowledge of internals to say why it's slow or makes the UI sluggish, sorry...
I have knowledge of internals and a profiler.  What I don't have is good step-by-step instructions on what to do with the two attachments to this bug.  Neither does Olli, who's already spent several hours trying to make use of them.  I'm happy to look into the actual issue, but I don't have several hours to spend sorting out what to do with those attachments... Again, step-by-step instructions would be much appreciated.
On linux:
python server.py

and load the html file in the browser.
(I loaded it from local harddrive)
Thanks.  Looking.
OK, so on my machine (Mac) one core was just pegged by the python server.

The other core was mostly us.

For our process, 15% of the time is kernel time; lots of select() and recvfrom and read and write.

13% of our time is the socket transport thread polling, posting events to the main thread via WebSocketChannel::ProcessInput, reading from the websocket async streams, etc, etc.

2-3% on the CC thread.

Back on the main thread, 23% of the time is painting and 10% is reflow flushed via WillPaint notifications.  This will get better when we paint off the refresh driver, I bet.

The remaining 30% of the time or so is the actual websocket event processing on the main thread.  A bit of this is the native event processing that the mac event loop does on every plevent, but about 25% is calling out into JS.  This breaks down as:

  13% SetInnerHTML calls (2/3 is removing the old content and its frames, actually;
      painting off the refresh driver would help here too, since those frames would not
      exist).
   7% is setTimeout and clearTimeout; half is xpconnect overhead and the other half is
      the actual object management and so forth.
   3% is JS jitcode
 0.5% getElementsByTagName
 0.5% setAttribute

So actionable items:

1) Land bug 598482.
2) Quickstub setTimeout and clearTimeout (why aren't they already?)
3) Figure out how to deal with the event loop swampage better.

Ideally, in a e10s world, the socket transport thread would just directly post events to the content process....

What I see in Chrome is that the UI is responsive (no surprise) but the renderer process is completely hosed: I can't select text in the webpage, the cells update at 1/10 the rate the update at in Gecko, getting slower and slower until they freeze altogether, and after clicking the link in the page there is no response for 20 seconds or so even when the cells are updating.

Oh, and we're definitely getting the events off the network faster than we can process them; I just killed the python server about 40 seconds ago and we're still updating table cells now.  Chrome seems to handle it like this: after killing the python server the content process is still frozen for a minute or two and then there is a single update that repaints.
And to make it clear, the only reason the _browser_ UI doesn't freeze in Chrome is because it's in a separate process.
Thanks for taking such a thorough look.

Sorry, the instructions to set this up are in comment 2, perhaps I should have had them stand out more. It's weird that you see slow updates in Chrome; on my Windows 7 box, Chrome seemed to be better at keeping up with the message stream. At the very least it was doing much better with resetting the cell backgrounds. Is it possible that there would be an OS interaction there?

Some kind of trick to interrupt the event loop for UI interaction once in a while would probably be useful seems like it would definitely be useful (mostly until e10s comes along, of course, although I suppose it still has some value after that?).
> Ideally, in a e10s world, the socket transport thread would just directly 
> post events to the content process....

How would this help exactly?  This would reduce overhead of msg processing a little, but it's not going to solve flow control.  And I suspect the UI on the main process won't freeze from the socket transport's events, which get passed along to the IPDL thread w/o much processing generally AFAICT.   The big queue of not-yet processed msgs is probably on the child end.

So one thing I just noticed in chrome's websockets test suite is that they have a test that makes sure websockets are "suspended" while event processing takes place: specifically one test pops up an alert, and they make sure no later ws msgs get delivered to JS until the alert is closed.  We seem to not do the right thing for that, AFAICT.  But that's probably a separate bug, right?

Do we need to add in some notion of flow control for WS msgs?  We already queue messages manually on the child in the IPDL Recv function: we could implement suspend/resume on the parent, and have the child send a msg to suspend when the queue gets too big.
BTW I mention calling IPDL directly from the socket transport thread in bug 648433.  I'm guessing it would be a fair amount of work to get everything right.
It's entirely possible that there are OS differences here; event loop stuff is one of the not entirely cross-platform bits at least in gecko.

In our case, the event loop JS runs from _is_ the event loop for UI interaction (which is in JS itself), so throttling would have to be done on the level of websockets posting events to that event loop, not throttling the event loop itself.

> How would this help exactly? 

Again, in the e10s world, it would keep all the events involved out of the main UI event loop.  So it could still hose the event loop in the content process and on the socket thread, but you could close the tab easily, for example.

> And I suspect the UI on the main process won't freeze from the socket transport's
> events, which get passed along to the IPDL thread w/o much processing generally
> AFAICT.

Well, that would be the hope.  But if we avoid the main UI event loop altogether, then it's not even a theoretical possibility.

Again, in this testcase just the posting of events from the socket thread was 13-30% of total time, depending on how much of the poll/select time gets blamed on that.

> But that's probably a separate bug, right?

Yep.

> Do we need to add in some notion of flow control for WS msgs? 

Maybe.  It's not clear how it would help unless the message flow spike is transient....  Does the ws protocol have any way to make the server back off?  Is it allowed to drop messages once buffers fill up?

> I'm guessing it would be a fair amount of work to get everything right.

Yeah; the direct IPDL message from the socket thread would be ideal, but not necessary in the e10s world.
No, there's no flow control in the WS protocol (and you're not allowed to drop messages), so all suspending would do is push the problem back to the server (we'd stop reading TCP traffic), which could in theory detect the issue (but in practice probably wouldn't).  It would at least keep the browser from OOMing or whatever other Bad Things happen as we pile on msgs to the queue.
re: off-main-thread delivery.  Given that e10s is coming soon, I'm inclined to not try to implement that, as it's a lot of work, would complicate the existing code, and then not get used much or at all once e10s is used for desktop.
(In reply to Jason Duell (:jduell) from comment #25)
> re: off-main-thread delivery.  Given that e10s is coming soon, I'm inclined
> to not try to implement that, as it's a lot of work, would complicate the
> existing code, and then not get used much or at all once e10s is used for
> desktop.

Jason, can you reprioritize this now that e10s has been postponed?
Whiteboard: [Snappy]
>>  re: off-main-thread delivery
>
> Jason, can you reprioritize this now that e10s has been postponed?

I'm not sure why we'd be reprioritizing it from the discussion here so far.

There are two kinds of off-main-thread necko delivery that have been proposed:

1) under e10s, deliver necko msgs to the child process directly from the socket thread.  Requires IPDL to be multithreaded (so mostly IPC work, not necko). That's the kind we've mentioned in this bug, but it's moot for non-e10s.  So priority is low (except insofar at it might affect mobile HTTP perf, not this bug).

2) Deliver OnDataAvailable and/or WS data msgs directly to a client that is not on the main thread.  The use case has always been the HTML 5 parser.  This doesn't make sense for WebSockets which are on the main thread, but could possibly be an optimization for websockets used by web workers.  It wouldn't help the application here.
(In reply to Jason Duell (:jduell) from comment #27)
> >>  re: off-main-thread delivery
> >
> > Jason, can you reprioritize this now that e10s has been postponed?
> 
> I'm not sure why we'd be reprioritizing it from the discussion here so far.
> 
> There are two kinds of off-main-thread necko delivery that have been
> proposed:
> 
> 1) under e10s, deliver necko msgs to the child process directly from the
> socket thread.  Requires IPDL to be multithreaded (so mostly IPC work, not
> necko). That's the kind we've mentioned in this bug, but it's moot for
> non-e10s.  So priority is low (except insofar at it might affect mobile HTTP
> perf, not this bug).
> 
> 2) Deliver OnDataAvailable and/or WS data msgs directly to a client that is
> not on the main thread.  The use case has always been the HTML 5 parser. 
> This doesn't make sense for WebSockets which are on the main thread, but
> could possibly be an optimization for websockets used by web workers.  It
> wouldn't help the application here.

What I was asking is whether we can fix this given that e10s isn't happening. Comment 24 sounds like it would alleviate the problem well.
Whiteboard: [Snappy] → [Snappy:P3]
https://bugzilla.mozilla.org/show_bug.cgi?id=1472046

Move all DOM bugs that haven’t been updated in more than 3 years and has no one currently assigned to P5.

If you have questions, please contact :mdaly.
Priority: -- → P5
Component: DOM → DOM: Core & HTML
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: