Open Bug 1389812 Opened 3 years ago Updated 2 months ago

orbit.chat may freeze or crash the browser

Categories

(Core :: WebRTC: Networking, defect, P2)

55 Branch
defect

Tracking

()

UNCONFIRMED

People

(Reporter: mozilla, Unassigned)

Details

(Keywords: crash, crashreportid, hang)

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0
Build ID: 20170803173024

Steps to reproduce:

1. Navigate to http://orbit.chat/

2. Enter a nickname and press return (there is no need to register or log in)

3. Leave the application running


Actual results:

CPU usage of the web content process starts climbing steadily.  Eventually (in my case, after 15-30 minutes) the browser UI stops being redrawn.  Tooltips will be displayed and the mouse cursor changes appearence when hovering over text fields or links on other tabs, but both the chrome and the content look completely frozen.

The console gets a steady flood of sandbox violations (syscall 49), which may or may not be related.  It also gets flooded with an increasing amount of [GFX1-] failures:

Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16 140223767856976 12 8 0 3881477162.
Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16 140223767856976 12 2048 0 3881477162.
Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16 140223767856976 12 2097152 0 3881477162.
Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16 140223767856976 12 524288 0 3881477162.
[GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 1040
[GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 1040
[GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 1040
[GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 1040
[GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 1040
[GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 1040

The issue has been reproduced on two different Linux systems running 64-bit firefox 55.0b11 (developer edition, intel+nvidia optimus) and 55.0 (beta channel, amd rx480).  Console output snippet is from the latter.

One crash occurred while reproducing this bug: https://crash-stats.mozilla.com/report/index/55e5a100-20f5-4c49-adc2-d2f340170812


Expected results:

The browser should not freeze.  And it definitely should not crash.
Component: Untriaged → Graphics
Product: Firefox → Core
I also found this message repeated every once in a while:

[GFX1-]: ClientLayerManager::BeginTransaction with IPC channel down. GPU process may have died.
Complete terminal output.  The freezing seems to start around the time the "Sandbox: Unexpected EOF [..]" line is output.
I believe this bug is the symptoms of fd exhaustion caused by webrtc connections.

I'm observing a growing number of peer connections in about:webrtc.  Simultaneously, /proc/(main-process-pid)/fd/ presents a growing number of fds.  The browser freezes around when the fd limit for my environment (1024) is hit.

What causes so many webrtc connections to hang, I don't know.  It might be an application bug, or a webrtc bug, but either way it oughtn't bomb the browser.
Attached file A test case
I extracted a small test case to reproduce the issue.  The file has a button "bomb", and when you press it, 600*2 rtcpeerconnections (enough to go over the default fd limit of 1024 on fedora) will be created in a loop.  This usually freezes or crashes my browser within a few seconds after clicking.
Byron, Nils, you might want to have a look at this one.
Rank: 15
Component: Graphics → WebRTC: Networking
Priority: -- → P1
Flags: needinfo?(drno)
Flags: needinfo?(docfaraday)
Reports from IRC say that 52.2.0 (debian) and 53.0.2 (64-bit, fedora) are also affected.
Sounds strongly like an application bug; resource exhaustion is hard to stop -- relax the limit on fd's, and it will likely hit another limit (or chew memory and/or CPU until limits don't matter).

600+ peerconnections is almost certainly an application bug.  Each peerconnection needs UDP socket(s), which means fd's - there's little way to avoid it.  "dead" peerconnections shouldn't be using FDs, and likely aren't.
Yeah, you could probably exhaust the system FD limit by setting up > 1K websockets, or XHR, or whatever.
Flags: needinfo?(docfaraday)
(In reply to Randell Jesup [:jesup] from comment #7)
> Sounds strongly like an application bug; resource exhaustion is hard to stop
> -- relax the limit on fd's, and it will likely hit another limit (or chew
> memory and/or CPU until limits don't matter).
> 
> 600+ peerconnections is almost certainly an application bug.  Each
> peerconnection needs UDP socket(s), which means fd's - there's little way to
> avoid it.  "dead" peerconnections shouldn't be using FDs, and likely aren't.

What privsep daemons do is they allocate all the fds that are critical to the daemon's operation up-front (this includes all IPC channels).  Then the rest are used for serving incoming requests.  A few may be reserved for transient use.

If they run out of fds, they stop accepting new connections until it's possible again.  Existing connections continue to be served without a hiccup and the daemon won't crash.

Is such a model unfeasible for firefox?  I see that it currently uses lots of transient fds (shm for passing raster data?  Bug 1245241).  This would explain the graphics freeze I observed.  I don't know if there's another user of fds that would explain the crashes.  Might be the same graphics code dying because syncing with compositor after 256 handles won't help if you don't have that many free handles to begin with.  That would mean 1245241 is somewhat mitigated but not solved for good.

So if it's not feasible to open up a limited number of these handles up-front and live with them for the rest of the day, then perhaps the number of shms between syncs could be adjusted dynamically based on fd pressure.  A fixed number of fds is reserved for this use to give a reasonable lower bound.

Then if an application runs out of fds creating rtc connections or whatever, these specific calls could fail and throw an error.

Some thoughts:

0) The fds in this bug all open in the main process.  I don't suppose moving them into content processes is acceptable?

1) It is easy to blow past 200 fds in "normal use" with just a handful of tabs that aren't misbehaving like orbit.chat or the test case.

2) I don't suppose there is a way for an application to know how many connections it can create before crashing the browser.

3) Perhaps typical fd use with a fair number of tabs plus a few applications that all create only a "reasonable" number of rtc connections could blow past the fd limit.  Would it still be an application bug?  User error?

4) Most of the time, if firefox crashes due to fd exhaustion, it fails to create a crash report (maybe bug 1249995).  Does that mean we have no idea how many people might be affected by such crashes?  When a report is created, do we see the fd count anywhere, or is there something else to give away the fact that it's a crash due to fd exhaustion?
(In reply to Henri Kemppainen from comment #9)
> (In reply to Randell Jesup [:jesup] from comment #7)
> > Sounds strongly like an application bug; resource exhaustion is hard to stop
> > -- relax the limit on fd's, and it will likely hit another limit (or chew
> > memory and/or CPU until limits don't matter).
> > 
> > 600+ peerconnections is almost certainly an application bug.  Each
> > peerconnection needs UDP socket(s), which means fd's - there's little way to
> > avoid it.  "dead" peerconnections shouldn't be using FDs, and likely aren't.
> 
> What privsep daemons do is they allocate all the fds that are critical to
> the daemon's operation up-front (this includes all IPC channels).  Then the
> rest are used for serving incoming requests.  A few may be reserved for
> transient use.
> 
> If they run out of fds, they stop accepting new connections until it's
> possible again.  Existing connections continue to be served without a hiccup
> and the daemon won't crash.

Ok... for a daemon (service) that makes sense (mostly; there are ways to avoid running into the limit that add considerable complexity to the application (mostly by using multiple processes).  Or you can up the fd limit (with root privs).

> Is such a model unfeasible for firefox? 

Probably it's unfeasible in general.  Firefox is not a server; it's a browser meant for interactive use.  If you're trying to be a bittorrent application in firefox (ala webtorrent), you may need to limit the number of connections.  This is a fact of life in linux, though why the limit remains is a good question - but not for here.

> I see that it currently uses lots
> of transient fds (shm for passing raster data?  Bug 1245241).  This would
> explain the graphics freeze I observed.

There's a LOT of transient use of fd's - avoiding fd churn is not a design goal.  Any network load will use one or more likely a bunch of fd's (one per server it needs to contact).  Probably most profile data uses persistent fd's, but there may be churn there.  Anything that requires shm needs fd's (which is a pain); for OpenH264 (which loads into GMP in a separate process) or EME (similar) there needs to be shms passed back and forth between the GMP process(es) and the content processes.  OpenH264 is especially bad since it uses (still I think) 3 shm allocations per frame; in order to keep the number of shms and the allocated space under control it keeps caches shms to reuse.  WebSockets use fds.  Compositing uses fds (especially for video I think).  Lots of things do.

>  I don't know if there's another
> user of fds that would explain the crashes.  Might be the same graphics code
> dying because syncing with compositor after 256 handles won't help if you
> don't have that many free handles to begin with.  That would mean 1245241 is
> somewhat mitigated but not solved for good.
> 
> So if it's not feasible to open up a limited number of these handles
> up-front and live with them for the rest of the day, then perhaps the number
> of shms between syncs could be adjusted dynamically based on fd pressure.  A
> fixed number of fds is reserved for this use to give a reasonable lower
> bound.

Generally we don't try to cache shms much; openh264 and EME do (also, if the GMP process needs a new shm and doesn't have one it *can't* allocate them - sandbox is locked down.  It has to send an IPC asking the non-GMP content process to allocate an shm and pass a handle back to the GMP process.

> 
> Then if an application runs out of fds creating rtc connections or whatever,
> these specific calls could fail and throw an error.
> 
> Some thoughts:
> 
> 0) The fds in this bug all open in the main process.  I don't suppose moving
> them into content processes is acceptable?

Content process aren't allowed to open fd's (generally, they may be allowed to open them to limited directories).  We may move the network socket code (and associated fds) out of the main process, which would help some.

> 1) It is easy to blow past 200 fds in "normal use" with just a handful of
> tabs that aren't misbehaving like orbit.chat or the test case.

Yup.  XHR's, long-polls, etc use an fd for their lifetime I believe.

> 2) I don't suppose there is a way for an application to know how many
> connections it can create before crashing the browser.

No, and that number isn't a firm number.  it's influenced a LOT by dynamics of other pages that are loaded.

> 3) Perhaps typical fd use with a fair number of tabs plus a few applications
> that all create only a "reasonable" number of rtc connections could blow
> past the fd limit.  Would it still be an application bug?  User error?
> 
> 4) Most of the time, if firefox crashes due to fd exhaustion, it fails to
> create a crash report (maybe bug 1249995).  Does that mean we have no idea
> how many people might be affected by such crashes?  When a report is
> created, do we see the fd count anywhere, or is there something else to give
> away the fact that it's a crash due to fd exhaustion?

One problem may be it's hard to save/send a crash report if you have no fd's....  Perhaps keeping a crash-report-file open from startup, so you can just write to it.

That said, having crashreporter report the number of open fds would be useful in addition to finding a way to safely send a crash stack report.
It appears that orbit.chat uses the simple-peer JS lib in a really bad way. Looks likes its opening several new PeerConnections per second. Generate offers and then closes the connections after some time.

As Byron and Randell indicated I don't think that there is much we can do about using up all resources through bad code. One thing I noticed is that the page closes the PeerConnections after some time. And if I'm not mistaken we don't free the FDs from ICE at that point, but only when the PeerConnection gets garbage collected.

I think we could look into freeing the FDs from ICE once the PeerConnection gets closed, as you are not allowed to re-use from that point on forward any more. But that would give only a little bit more time until the resource limit gets hit.
Flags: needinfo?(drno)
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
(In reply to Henri Kemppainen from comment #0)
> Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16
> 140223767856976 12 8 0 3881477162.
> Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16
> 140223767856976 12 2048 0 3881477162.
> Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16
> 140223767856976 12 2097152 0 3881477162.
> Sandbox: seccomp sandbox violation: pid 5022, tid 5050, syscall 49, args 16
> 140223767856976 12 524288 0 3881477162.

Syscall 49 on x86_64 is bind().  It would be interesting to see what happens with MOZ_SANDBOX_CRASH_ON_ERROR=1 set in the environment.


(In reply to Randell Jesup [:jesup] from comment #10)
> There's a LOT of transient use of fd's - avoiding fd churn is not a design
> goal.

Adding to the laundry list: the sandbox file broker client creates a socketpair for each request, to carry the response, because the alternative is trying to demultiplex responses in code that needs to be thread safe, async signal safe, and reentrant.

> Content process aren't allowed to open fd's (generally, they may be allowed
> to open them to limited directories).

To nitpick this a little: it's the resource behind the fd that's important here.  Creating private resources, like with socketpair() or pipe() or memfd_create(), is usually fine.  Accessing external resources, like the filesystem and the network, is not.  We haven't completely cut off the network yet (bug 1362220 is the remaining problem, but see also bug 1129492), but we plan to.  Handing it actual Internet-domain sockets is dangerous, and in particular a UDP socket would let it sendto() arbitrary data to any port on any host anywhere.

I am totally sure that this is due to a bug, it has happened to me hundreds of times and a friend helped me to finally know what it was. When I entered this page https://www.webcampornoenvivo.es/chat-porno-espanol/ somehow the network tried to process the request but it seemed as if it tried to load the page again and again and again without any result, with what which made my screen blank and the whole program unanswered. It was a loop that tried to load the page until an error code appeared and forced me to close mozilla. Then we downloaded an app to block advertising and computer viruses as such and the page stopped failing, I never had that problem again and in fact something similar has never happened to me on other pages.

(In reply to Liam from comment #14)

I am totally sure that this is due to a bug, it has happened to me hundreds of times and a friend helped me to finally know what it was. When I entered this page https://www.webcampornoenvivo.es/chat-porno-espanol/ somehow the network tried to process the request but it seemed as if it tried to load the page again and again and again without any result, with what which made my screen blank and the whole program unanswered. It was a loop that tried to load the page until an error code appeared and forced me to close mozilla. Then we downloaded an app to block advertising and computer viruses as such and the page stopped failing, I never had that problem again and in fact something similar has never happened to me on other pages.

I also encountered the same situation when accessing this website: https://officespace.vn/office/doji-tower/

You need to log in before you can comment on or make changes to this bug.