Closed Bug 1301863 Opened 8 years ago Closed 7 years ago

Crash caused by memory leak when using WebSocket in Worker

Categories

(Core :: DOM: Core & HTML, defect, P3)

48 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- fixed

People

(Reporter: browser_defects, Assigned: jonco, NeedInfo)

References

Details

(Keywords: crash, Whiteboard: [MemShrink:P2])

Attachments

(3 files, 4 obsolete files)

Attached file ff_worker.zip
User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0
Build ID: 20160823121617

Steps to reproduce:

1. Download and install nodejs (https://nodejs.org/en/)
2. Download and unzip the attached prototype
3. Open shell and navigate to the folder extracted in point 2
4. Start the server by typing: node server.js
5. Open Firefox and navigate to http://localhost:8888
6. Observer the Firefox memory consumption in the task manager.

Link to crash report: https://crash-stats.mozilla.com/report/index/c94acc22-8f80-4a7d-9356-8fad12160909


Actual results:

After running for 5-10 minutes browser either crash because reach the maximum memory allowed for the process or terminate the worker which can be seen in server console log as closed connection.


Expected results:

Memory consumption should be stable. This is only seen in Firefox.
Attached file memory-report.json.gz
about:memory report
Severity: normal → critical
Component: Untriaged → DOM: Service Workers
Keywords: crash
Product: Firefox → Core
Whiteboard: [MemShrink]
I think the issue here is that the server sends a message every 20ms without regard to whether the client can keep up or not.  We probably don't provide any meaningful back-pressure here and just let it fill up the event queue.

We should perhaps track the number of outstanding web socket messages remaining to fire and if we exceed some threshold stop reading from the incoming socket.  Once the messages are consumed we can then begin reading the socket again.
Status: UNCONFIRMED → NEW
Component: DOM: Service Workers → Networking: WebSockets
Ever confirmed: true
Of course, if we stop reading from the socket the tcp socket will then apply back-pressure to the server once the tcp window is filled.  Your server.js setInterval() will then be filling some server-side buffer and eventually OOM your server instead.

You should probably build some kind of throttling or flow-control mechanism into your client/server app.
I does sound like we want to do this in order to avoid OOM.
Unsure if this should be done in WebSocketChannel or in dom/base/WebSocket.cpp
Flags: needinfo?(michal.novotny)
Flags: needinfo?(amarchesini)
It doesn't matter if we increase the interval that server will send data to the Worker, it still leaks. It is slower, but browser still does not release memory. When I set the interval to 1000 ms it still leaks for about 100 MB per 10 minutes. The sample was kept as simple as possible. In the real application we consume the event and then set it to null and it still leaks memory. If you think of any workaround until the fix is made that will be very helpful.
It seems a dup of bug 1280916. I'll investigate it today.
Flags: needinfo?(amarchesini)
(In reply to Valentin Gosu [:valentin] from comment #4)
> I does sound like we want to do this in order to avoid OOM.
> Unsure if this should be done in WebSocketChannel or in
> dom/base/WebSocket.cpp

In WebSocketChannel we can't know whether nsIWebSocketListener receives or queues the incoming frames. All we could do is to implement suspend/resume methods that would be called by the listener in case it's not able to consume the frames quickly enough. Anyway, as per comment 6 it seems the problem is somewhere else.
Flags: needinfo?(michal.novotny)
(In reply to Andrea Marchesini [:baku] from comment #6)
> It seems a dup of bug 1280916. I'll investigate it today.
Thanks!
Assignee: nobody → amarchesini
Whiteboard: [MemShrink] → [MemShrink][necko-active]
The issue here is exactly what I have seen in bug 1280916: WebSocket creates some JS objects (blobs here, but similar issue happens with strings) but then not all of them are GCed. I'm sure that is not CC because I interrupted the dispatching of WebSocket events here:

https://dxr.mozilla.org/mozilla-central/source/dom/base/WebSocket.cpp#2000

Every is correctly GCed when I run GC from about:memory. The |JS::GCForReason(aCx, GC_SHRINK, JS::gcreason::DOM_WORKER)| is correctly executed by the worker thread.

Jon, any feedback about this?
Flags: needinfo?(jcoppeard)
Don't we just GC less often on worker threads?  Is this just a consequence of that design decision?  Can we trigger a more aggressive GC/CC at some point?
the mPeriodicGCTimerTarget runs often. This executes JS_MaybeGC(aCx).
Less often we execute JS::GCForReason(aCx, GC_SHRINK, JS::gcreason::DOM_WORKER); but they still run.
But we still leak. I don't think the issue is about when we trigger GC/CC.
I want to test comment 5.
I wonder if something on the main thread is keeping it alive until the about:memory triggers a cycle collection there, etc.
Jon is in PTO. moving NI to sfink
Flags: needinfo?(sphink)
Flags: needinfo?(jcoppeard)
I confirm that we correctly execute mozilla::dom::workers::WorkerPrivate::GarbageCollectInternal.
Quite often we end up calling JS_MaybeGC(aCx); and any few seconds, we call JS::GCForReason(aCx, GC_SHRINK, JS::gcreason::DOM_WORKER);
Whiteboard: [MemShrink][necko-active] → [MemShrink:P2][necko-active]
NI me if there is something else I can help with.
Assignee: amarchesini → nobody
Component: Networking: WebSockets → DOM
Whiteboard: [MemShrink:P2][necko-active] → [MemShrink:P2]
Priority: -- → P3
Jon, can you give feedback on comment 9?
Flags: needinfo?(sphink) → needinfo?(jcoppeard)
I've tried reproducing this and cannot... it's sitting here running and I'm seeing lots of received packets but no build up of memory.

If I had to guess I'd say that the data is being kept alive by something outside JS but wouldn't be surprised if there was some problem with the way we schedule GC for workers either.

baku, is there anything else I need to do to repro this other than the original instructions?  Do you know if it's still a problem?
Flags: needinfo?(jcoppeard) → needinfo?(amarchesini)
Never mind, I can reproduce this with e10s disabled.

Memory usage grows to > 7 GB as reported by activity monitor.  Multiple worker runtimes are created but not all have GCs triggered on them.

about:memory shows most usage is memory-file-data:

11,455.77 MB (100.0%) -- explicit
├──11,027.59 MB (96.26%) -- dom
│  ├──11,027.52 MB (96.26%) -- memory-file-data
│  │  ├──11,027.50 MB (96.26%) ── string [176440]
│  │  └───────0.02 MB (00.00%) ── large/file(length=5430, sha1=49263695f6b0cdd72f45cf1b775e660fdc36c606) [2]
│  └───────0.08 MB (00.00%) ++ (2 tiny)
├─────258.06 MB (02.25%) ++ (27 tiny)
└─────170.11 MB (01.48%) ── heap-unclassified

Minimise memory usage frees memory down to a normal size again.
Flags: needinfo?(amarchesini)
Attached patch account-blob-reflectors (obsolete) — Splinter Review
The problem here is that a JS reflector object can keep a large DOM object alive but the JS GC by default doesn't know about this extra memory and so doesn't trigger a GC.

The patch adds a function which can be overloaded to provide this information and implements it for blobs.  This causes GCs to be triggered for this test case and fixes the memory leak.

We should probably do something like this for other large DOM objects as well.
Assignee: nobody → jcoppeard
Attachment #8884867 - Flags: review?(peterv)
Note that we are calling JS_updateMallocCounter in various places already. How does this interact with some of the existing callers related to blobs (for example http://searchfox.org/mozilla-central/rev/01d27fdd3946f7210da91b18fcccca01d7324fe2/dom/canvas/CanvasRenderingContextHelper.cpp#50)? Will we double-count?
Flags: needinfo?(jcoppeard)
Comment on attachment 8884867 [details] [diff] [review]
account-blob-reflectors

-ing while waiting for response to question in comment 20.
Attachment #8884867 - Flags: review?(peterv) → review-
Sorry for the long delay.  I checked the other uses of JS_updateMallocCounter and the previous patch would double-count when CanvasRenderingContextHelper was converted to a blob so I removed that use.
Attachment #8884867 - Attachment is obsolete: true
Flags: needinfo?(jcoppeard)
Attachment #8906033 - Flags: review?(peterv)
Attached patch bug1301863-blob-reflectors-2 (obsolete) — Splinter Review
And further to the last patch, we should remove the update call from OffscreenCanvas when converting to a blob (since blobs are already handled) and when converting HTMLCanvasElement to a File (since file is a subclass of Blob and is also handled).
Attachment #8906642 - Flags: review?(peterv)
Comment on attachment 8906033 [details] [diff] [review]
bug1301863-account-blob-reflectors v2

Review of attachment 8906033 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/file/Blob.cpp
@@ +286,5 @@
>    mImpl->GetInternalStream(aStream, aRv);
>  }
>  
> +uint64_t
> +BindingJSObjectMallocBytes(Blob* blob)

blob -> aBlob

@@ +287,5 @@
>  }
>  
> +uint64_t
> +BindingJSObjectMallocBytes(Blob* blob)
> +{

MOZ_ASSERT(aBlob)

@@ +288,5 @@
>  
> +uint64_t
> +BindingJSObjectMallocBytes(Blob* blob)
> +{
> +  ErrorResult error;

Use IgnoredErrorResult and remove error.SuppressException().

@@ +289,5 @@
> +uint64_t
> +BindingJSObjectMallocBytes(Blob* blob)
> +{
> +  ErrorResult error;
> +  uint64_t size = blob->GetSize(error);

This is wrong. Blob size is different than the memory allocation. For instance if this Blob is a real file on disk, GetSize() returns the file size, but we don't allocate such amount of memory.

Some BlobImpl objects have a custom memory report. maybe we could expand that or implement a AllocationSize() method.

::: dom/file/Blob.h
@@ +146,5 @@
>  private:
>    nsCOMPtr<nsISupports> mParent;
>  };
>  
> +uint64_t BindingJSObjectMallocBytes(Blob* blob);

aBlob here as well.
Add a comment about why we need this function.
Attachment #8906033 - Flags: review?(peterv) → review-
See Also: → 1370836
See Also: → 1280916
Attachment #8906642 - Flags: review?(peterv)
Thanks for the review.  How does this look?  I added a GetAllocationSize() method and implemented it for MemoryBlobImpl, StringBlobImpl and MultipartBlobImpl.
Attachment #8906033 - Attachment is obsolete: true
Attachment #8906642 - Attachment is obsolete: true
Attachment #8906663 - Flags: review?(amarchesini)
Comment on attachment 8906663 [details] [diff] [review]
bug1301863-account-blob-reflectors v3

Review of attachment 8906663 [details] [diff] [review]:
-----------------------------------------------------------------

You should do something also for StreamBlobImpl. Wondering if you can use stringInputStream->SizeOfIncludingThis(MallocSizeOf) for that.

::: dom/file/BaseBlobImpl.h
@@ +96,5 @@
>    }
>  
>    virtual void GetType(nsAString& aType) override;
>  
> +  size_t GetAllocationSize() override

size_t GetAllocationSize() const override

@@ +193,5 @@
>  } // namespace dom
>  } // namespace mozilla
>  
>  #endif // mozilla_dom_BaseBlobImpl_h
> +

remove this extra line

::: dom/file/Blob.cpp
@@ +214,5 @@
>    return blob.forget();
>  }
>  
> +size_t
> +Blob::GetAllocationSize()

const?

::: dom/file/Blob.h
@@ +129,5 @@
>                                 const Optional<int64_t>& aEnd,
>                                 const nsAString& aContentType,
>                                 ErrorResult& aRv);
>  
> +  size_t GetAllocationSize();

size_t GetAllocationSize() const;

::: dom/file/BlobImpl.h
@@ +50,5 @@
>    virtual uint64_t GetSize(ErrorResult& aRv) = 0;
>  
>    virtual void GetType(nsAString& aType) = 0;
>  
> +  virtual size_t GetAllocationSize() = 0;

virtual size_t GetAllocationSize() const = 0;
Attachment #8906663 - Flags: review?(amarchesini) → review+
Updated patch with comments applied.
Attachment #8906698 - Flags: review+
Attachment #8906663 - Attachment is obsolete: true
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6187a252c146
Tell the JS engine how much memory blob reflectors hold alive r=baku
https://hg.mozilla.org/mozilla-central/rev/6187a252c146
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Hello,

I just retested our sample app. Now instead of crashing after 10 minutes, it reaches 90% memory in 20 minutes, stays there for 2 hours and then crashes.

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64;rv:62.0) Gecko/20100101 Firefox/62.0
Build ID: 20180920131237

Same attachment: ff_worker.zip
Same steps to reproduce.

You can see the video recording here: https://youtu.be/cU-1HemJm6M


Do you think it's the same bug as bug 1280916 or it's something else?
Flags: needinfo?(jcoppeard)

I've reproduced this and I think the continuing problems are the same issue as bug 1501238, where we allow a lot of memory to build up before triggering a collection, but exacerbated because it's happening for several workers at the same time.

Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.