Closed Bug 1667581 Opened 4 years ago Closed 3 years ago

Downloading a large file on mega.nz sometimes spawns a very large number of threads, causing Firefox to become unresponsive.

Categories

(Core :: Networking, defect, P2)

Firefox 81
defect

Tracking

()

VERIFIED FIXED
85 Branch
Tracking Status
firefox-esr78 86+ fixed
firefox82 --- wontfix
firefox83 --- wontfix
firefox84 --- wontfix
firefox85 --- verified

People

(Reporter: valyel, Assigned: valentin)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Attachments

(7 files, 1 obsolete file)

Attached image resmon screenshot (obsolete) —

User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0

Steps to reproduce:

Downloaded a large file (~2GB) from mega.nz. After mega.nz finishes buffering the data and starts saving it, Firefox begins spawning a large number of threads and becomes unresponsive. This happens somewhat frequently (~10% of the time).

Actual results:

See attached file.

Firefox continues spawning threads and becomes unresponsive. Eventually the spawning stops, and after some time the thread count declines. Takes 2-5 minutes.

Expected results:

Store the file and finish.

Attached image Better image

resmon screenshot

Attachment #9178033 - Attachment is obsolete: true
Flags: needinfo?(valyel)
Flags: needinfo?(valyel)

Hi,

I cannot reproduce on windows 10 pro, firefox release 81.0 (64-bit), beta 82.0b4 , nightly 83.0a1 (2020-09-29) (64-bit). I downloaded a sample file from https://www.thinkbroadband.com/download, and saved 1gb (biggest file available, couldn't find any other sample website with 2gb files) , I then made a copy of the file and saved two of them in one zip archive to make it a 2gb size file. I created an account in https://mega.nz/, uploaded the 2gb file, and then downloaded it, but it finished successfully.

I am setting a component for this in order to get the dev team involved. Please attach your about:support information.

Best,
Clara

Component: Untriaged → File Handling
Attached image download.png

Also, please confirm if this happened both using "standard download" and "download as zip" options.

Best,
Clara

Attached file about-support.json

I believe this happens with a standard download. I download files hosted by someone else, which are usually zipped.

This is not consistent. I have not been able to recreate it since creating the bug report, though have had this happen ~5 times in the past.

Size of the file may not be relevant. I have only seen this occur with larger files though.

Just reproduced this again. File was 1.32GB. Thread count begins to increase once the "saving" is done (the progress bar for the download reaches 100%). This time went up to 14000 threads, before declining. Had a YouTube video playing in a different tab.

This sounds pretty bad...

Does this happen on any site other than mega? Mega does some... strange... things with blobs and download verification.

Can you try to use the devtools profiler or debugger, or using about:memory's "measure" button (which reports JS/DOM workers) to see if these are website worker threads or something? It's hard to know what to do to narrow this down without clear steps to reproduce and/or the identities of the threads...

Flags: needinfo?(valyel)
Attached file profiling data

Here is the profiler output. Using anything else is unfortunately difficult, as the scheduler is losing it's mind trying to manage quite a few thousands of threads.

I can't say with certainty whether this happens with other sites, but as far as I can remember, it does not.

Let me know if you want more information on this.

Flags: needinfo?(valyel)

(In reply to valyel from comment #8)

Let me know if you want more information on this.

Unfortunately the devtools profiler info does not appear to be readable. This is a tooling issue on the Firefox side, for which I can only apologize.

I'm not convinced our standard profiling tools (and attempting to profile all those thousands of threads...) aren't going to just hang like the debugger did. So I've written up a custom bit of JS that should hopefully tell us a bit more.

  1. open about:config or about:preferences (has to be a "privileged" about page for the below script to work), and open the devtools web console in them

  2. reproduce the issue

  3. run the following script in the web console. It'll create a JS object whose property names are thread names, and whose property values are the number of such threads that are running.

threadMap = await (async function() {
  let { threads } = await ChromeUtils.requestProcInfo(); 
  let m = {};
  for (let thread of threads) {
    let normalizedName = (thread.name || "").replace(/#\d+$/, "");
    if (!m.hasOwnProperty(normalizedName)) { m[normalizedName] = 0; };
    m[normalizedName]++
  }
  return m;
})()
  1. then run (separately, in the same console):
copy(JSON.stringify(threadMap, null, 2))

This should copy a JSON version of that object to the clipboard, which you can paste here.

You can test the script out without step 2 (ie reproducing the downloading issue) to get an idea of what kind of data that gathers when things are "normal".

Flags: needinfo?(valyel)
Summary: Downloading a large file sometimes spawns a very large number of threads, causing Firefox to become unresponsive. → Downloading a large file on mega.nz sometimes spawns a very large number of threads, causing Firefox to become unresponsive.

It doesn't seem like any of the threads have names assigned. Every thread name is replaced with empty string, so you just get the total count of threads. Reproduced to be sure, and the json output is { "" : 18863 }.

Let me know if I'm missing something.

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to valyel from comment #10)

It doesn't seem like any of the threads have names assigned. Every thread name is replaced with empty string, so you just get the total count of threads. Reproduced to be sure, and the json output is { "" : 18863 }.

I'm pretty confused by this... I obviously tested the script before sending it to you, though I originally tested on macOS.

If you look at the output of (await ChromeUtils.requestProcInfo()).threads, do all of those threads have an empty string as their name?

Yoric, is that expected on release, or Windows 8, or something? (I just tested with mozregression --launch 81 on Windows 10, and I see thread names there...)

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(dteller)

Versions of Windows prior to 10 do not support thread names (that's a limitation of the OS). So that's not surprising.

Flags: needinfo?(dteller)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #12)

Versions of Windows prior to 10 do not support thread names (that's a limitation of the OS). So that's not surprising.

So what can we do here to get any meaningful debug information if we cannot reproduce? Surely there must be some way to get this information when we're creating tens of thousands of threads? It looks like there's MOZ_LOG information for threads running tasks (which I'm sure there'll be lots of either way, so it'll be too spammy to work out what's going on, and anyway they don't seem to log any thread info), but I don't see any for thread creation. :-\

Reporter: I don't suppose you have another machine on which you're seeing this? If not, can you reproduce the same thing on a nightly build from https://nightly.mozilla.org/ ? If so, I can try and get you a custom build with instrumentation to get the relevant data... but the first question is whether this reproduces on nightly (which will be using a different profile).

Flags: needinfo?(dteller)

I don't have another machine where I see this. I can give this a try on Win10 and Linux machines, though I assume it will be fine.

Reproduced on a clean nightly build. Same exact behavior. An instrumented build sounds good. I haven't seen the thread count climb past 50k, if that makes your job any easier.

Flags: needinfo?(valyel)

(In reply to :Gijs (he/him) from comment #13)

Surely there must be some way to get this information when we're creating tens of thousands of threads?

As far as I can tell looking through the code, there doesn't even seem to be any MOZ_LOG with the thread name. If you had access to the machine, you could breakpoint on thread creation, but I realize that it's not possible here.

Flags: needinfo?(dteller)
Attached image Thread creation logging

Apologies for the delay here.

I just pushed to our tryserver infrastructure to get you a build. The patch I pushed is straightforward ( https://hg.mozilla.org/try/rev/36a4981b32fc8b68eae983777500ce4f07116a67 ), just in terms of you trusting the resulting build :-)

In about an hour, at https://treeherder.mozilla.org/#/jobs?repo=try&revision=82d88a3506db62ee7b03ac96c8d51cfaae911b93&selectedTaskRun=BucY1GpcQYq-AYHrTXxXIQ.0 , the "B" job that's selected and (as of writing) in progress, should turn green and be finished. Once that happens, at the bottom, in the "Artifacts" tab, there will be a "target.zip" link. This is a zipfile of the build, which you should be able to extract wherever you like and run from there. I'd advise not overwriting your existing Firefox installs with it to avoid issues. Extracting it elsewhere will also mean it'll use its own profile (though the usual rules apply around running multiple builds at once - you probably want to close any remaining Firefox / Nightly instances before opening this new one).

~Edit: it was finished more quickly than I thought; direct link to the zipfile: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/BucY1GpcQYq-AYHrTXxXIQ/runs/0/artifacts/public/build/target.zip

This build will dump thread creation into the browser (note: not the same as the regular developer tools) console (open with ctrl-shift-j). hopefully this should get us some more info... You can clear the console before reproducing, I guess. It should also be fine to not have the console open before the problem happens - opening it afterwards should still show you the most recent few hundred logs...

Flags: needinfo?(valyel)
Created thread StreamTrans #56950
Created thread StreamTrans #56951
Created thread StreamTrans #56952
Created thread StreamTrans #56953
Created thread StreamTrans #56954
Created thread StreamTrans #56955
Created thread StreamTrans #56956
Created thread StreamTrans #56957
Created thread StreamTrans #56958
Created thread StreamTrans #56959
Created thread StreamTrans #56960
Created thread StreamTrans #56961
Created thread StreamTrans #56962
Created thread StreamTrans #56963
Created thread StreamTrans #56964
Created thread StreamTrans #56965
Created thread StreamTrans #56966
Created thread StreamTrans #56967
Created thread StreamTrans #56968
Created thread StreamTrans #56969
Created thread StreamTrans #56970
Created thread StreamTrans #56971
Created thread StreamTrans #56972
Created thread StreamTrans #56973
Created thread StreamTrans #56974
Created thread StreamTrans #56975
Created thread StreamTrans #56976
Created thread StreamTrans #56977
Created thread StreamTrans #56978
Created thread StreamTrans #56979
Created thread StreamTrans #56980
Created thread StreamTrans #56981
Created thread StreamTrans #56982
Created thread StreamTrans #56983
Created thread StreamTrans #56984
Created thread StreamTrans #56985
Created thread StreamTrans #56986
Created thread StreamTrans #56987
Created thread StreamTrans #56988
Created thread StreamTrans #56989
Created thread StreamTrans #56990
Created thread StreamTrans #56991
Created thread StreamTrans #56992
Created thread StreamTrans #56993
Created thread StreamTrans #56994
Created thread StreamTrans #56995
Created thread DNS Resolver #9
Created thread DNS Resolver #10
Created thread BgIOThreadPool #4
Created thread StreamTrans #56996
Created thread StreamTrans #56997
Created thread IndexedDB #32
Created thread IndexedDB #33
Created thread DNS Resolver #11
Created thread SSL Cert #21
Created thread StreamTrans #14
Created thread StreamTrans #15
Created thread StreamTrans #56998
Created thread Untrusted Modules
Created thread Jump List
Created thread IndexedDB #34
Created thread StreamTrans #56999
Created thread StreamTrans #57000
Created thread StreamTrans #57001
Created thread StreamTrans #57002
Created thread SSL Cert #22
Created thread StreamTrans #57003
Created thread StreamTrans #57004
Created thread StreamTrans #57005
Created thread StreamTrans #57006
Created thread BackgroundThreadPool #5
Created thread BackgroundThreadPool #6
Created thread Jump List
Created thread StreamTrans #57007
Created thread StreamTrans #57008
Created thread StreamTrans #57009
Created thread StreamTrans #57010
Created thread IndexedDB #35
Created thread StreamTrans #57011
Created thread SSL Cert #23
Created thread Jump List
Created thread Classifier Update

This is what I get. This is the end of the output. I removed 10k lines before this spawning more StreamTrans threads. However, I am not able to catch the beginning of this, as the console buffer is around 50k lines too short. :)

Flags: needinfo?(valyel)

Thanks very much! That at least points at something.

Valentin, any idea how we're ending up creating tens of thousands of stream transport threads, or how we'd figure out why? https://searchfox.org/mozilla-central/rev/c2e3ac11be4837718c2604e58085fbc8252b69dd/netwerk/base/nsStreamTransportService.cpp#251,254-256 would suggest the limit is supposed to be 25...

Florian, I don't suppose we have thread creation markers in the gecko profiler, which would get us stacks (unsure if that's going to be helpful here)?

Component: File Handling → Networking
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(florian)
Product: Firefox → Core

(In reply to :Gijs (back Oct. 12; he/him) from comment #18)

Florian, I don't suppose we have thread creation markers in the gecko profiler, which would get us stacks (unsure if that's going to be helpful here)?

Gerald, any idea of how difficult that would be? Would this be another marker that would need to live in the base profiler? I guess the marker would be in the track of the thread that requested the creation of another thread. I wonder if we should also collect a stack of when a thread is registered with the profiler; maybe that stack could be captured in the gecko profiler? Not sure where we would display it; maybe as a marker at the beginning of the new thread.

Flags: needinfo?(florian) → needinfo?(gsquelart)

Quick thoughts: (Evening here, happy to help tomorrow if you have more questions.)

As a first quick thing, you could add a marker when creating a thread in xpcom, possibly in nsThread::Init.
These markers would appear in the thread-creating thread (i.e., not the new one), but at least they will capture the cause of the thread creation.
If you then run the profiler with "Bypass selections above and record all registered threads" (in about:profiling), hopefully you can find the markers you're interested in.

Maybe you could record that marker into new thread, possibly at the end of nsThreadManager::NewNamedThread (not sure how to get that from nsIThread), by using something like PROFILER_MARKER_UNTYPED("Thread creation", OTHER, {MarkerThreadId( <get tid from nsIThread> ), MarkerStack::Capture()}).

If needed, the next step would be to profiler_capture_backtrace() when creating a thread, and pass it to the new thread through ThreadInitData and record a marker (with MarkerStack::TakeBacktrace(std::move( <captured stack> ))) in nsThread::ThreadFunc. This should appear at the start of each thread.

No need to involve the Base Profiler, unless we wanted to instrument some raw thread creations outside of xul, which I don't think is needed for "StreamTrans" in this bug.

Good luck!

Flags: needinfo?(gsquelart)
See Also: → 1670737

(In reply to Gerald Squelart [:gerald] (he/him) from comment #20)
Thanks! I opened bug 1670786 to work on these suggestions.

Having realized from looking at bug 1670737 that blame for some of the FreeOMT calls is from bug 1584898 which was also fixed to deal with issues with huge downloads on mega.nz , I'm pretty sure this is due to that change. The thread count should not balloon like that though. I'd ask either :baku or :ehsan to take a look but neither is active in Firefox at the moment, AIUI. So I am not sure what best to do. I think batching free calls a bit more (like doing 16 or 32 or whatever is a sensible number, in each task) would help without reintroducing the same problem that bug 1584898 fixed.

Kris, is there currently a "right" way to do this work? I see Bas landed a TaskController, but that seems oriented towards the main thread and so I'm not sure if it's appropriate here.

Flags: needinfo?(kwright)
Regressed by: 1584898
Has Regression Range: --- → yes

(In reply to :Gijs (he/him) from comment #18)

Valentin, any idea how we're ending up creating tens of thousands of stream transport threads, or how we'd figure out why?

It seems you've got the thread issue figured out. Great find. I wonder how often this caused a hard to diagnose perf problem.

Flags: needinfo?(valentin.gosu)

(In reply to :Gijs (he/him) from comment #22)

Having realized from looking at bug 1670737 that blame for some of the FreeOMT calls is from bug 1584898 which was also fixed to deal with issues with huge downloads on mega.nz , I'm pretty sure this is due to that change. The thread count should not balloon like that though. I'd ask either :baku or :ehsan to take a look but neither is active in Firefox at the moment, AIUI. So I am not sure what best to do. I think batching free calls a bit more (like doing 16 or 32 or whatever is a sensible number, in each task) would help without reintroducing the same problem that bug 1584898 fixed.

Kris, is there currently a "right" way to do this work? I see Bas landed a TaskController, but that seems oriented towards the main thread and so I'm not sure if it's appropriate here.

I looked at these calls and tried to move them to the background thread pool - but since it's still an nsThreadPool it still has the same strange behavior where it can make a whole bunch of these threads during large jobs. I can only think of two ways to really fix this until I can refactor nsThreadPool dispatch - we can batch together the FreeOMT calls like you say into larger tasks, so that they are less likely to create a new thread and we avoid bogging down the event queue, or we can use a background TaskQueue to force the tasks to only execute serially which would be roughly the equivalent of running all of the calls on a single standalone thread. I'm not sure what the perf risks could be of doing the latter over the former or how much batching the calls would improve the threading issue.

Flags: needinfo?(kwright)
Assignee: nobody → valentin.gosu
Severity: -- → S3
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Priority: -- → P2
Whiteboard: [necko-triaged]
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/0bcd9a5ae49f
Use a BackgroundTaskQueue to free up memory segments from nsSegmentedBuffer off main thread r=baku,sg
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 84 Branch
Regressions: 1673405
Regressions: 1675203

Please note that the patch got backed out by Valentin via bug 1675203:
https://hg.mozilla.org/integration/autoland/rev/e0fa99439c48

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 84 Branch → ---
Attachment #9182378 - Attachment description: Bug 1667581 - Use a BackgroundTaskQueue to free up memory segments from nsSegmentedBuffer off main thread r=baku,ehsan! → Bug 1667581 - Batch calls to nsSegmentedBuffer::FreeOMT when dispatching to a background thread r=baku,sg
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/e0c3ec47f2f4
Batch calls to nsSegmentedBuffer::FreeOMT when dispatching to a background thread  r=baku,sg,KrisWright
https://hg.mozilla.org/integration/autoland/rev/f9553be62ad1
Ensure there's only one nsSegmentedBuffer::FreeOMT pending at any time r=sg
Blocks: 1679390
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

Comment on attachment 9188422 [details]
Bug 1667581 - Ensure there's only one nsSegmentedBuffer::FreeOMT pending at any time r=sg

Beta/Release Uplift Approval Request

  • User impact if declined: In certain cases that use nsSegmentedBuffers we may be spawning and destroying a large number of threads causing the computer to lock up.
    This isn't a very common scenario, but when it does happen the user experience sucks.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: see comment 0
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): The previous attempt to fix this got backed out as it caused the memory to be released too late - sometimes causing a spike in memory use.
    The current fix should not have that problem - it is essentially the same as the old code, still using the stream transport thread pool, but avoids dispatching more than one runnable so we don't hit bug 1670737.
    We should make sure that this adequately fixes the problem without causing other perf regressions before uplifting to beta & esr.
  • String changes made/needed:

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: perf regression
  • User impact if declined: same reasoning as for beta.
  • Fix Landed on Version: 85
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): same as for beta
  • String or UUID changes made by this patch:
Attachment #9188422 - Flags: approval-mozilla-esr78?
Attachment #9188422 - Flags: approval-mozilla-beta?
Attachment #9182378 - Flags: approval-mozilla-esr78?
Flags: qe-verify+

I think this should ride the trains given the age of the bug and regression risk.

Attachment #9188422 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
QA Whiteboard: [qa-triaged]

Tried to reproduce this issue on Windows 8.1 using multiple large files on Firefox 81.0 but was unsuccessful. I did receive an error from mega.nz when reaching 22% or 42% (always different) that says: Write error: [Exception... "Out of Memory" and suggests to download their app or use Chrome but this also happens with Latest Nightly (2020-12-08) build.

Valyel, could you please try with a build that contains the fix and see if you can still reproduce the initial issue?
https://www.mozilla.org/en-US/firefox/85.0a1/releasenotes/

QA Whiteboard: [qa-triaged]
Flags: qe-verify+ → needinfo?(valyel)

I am now unable to reproduce this on build 85.0a1.

Flags: needinfo?(valyel)

Thank you very much for your verification, closing this issue as Verified.

Status: RESOLVED → VERIFIED

We're going to let this bake a cycle on Release before revisiting for ESR.

Comment on attachment 9182378 [details]
Bug 1667581 - Batch calls to nsSegmentedBuffer::FreeOMT when dispatching to a background thread r=baku,sg

Fixes a possible deadlock in some circumstances. Been on 85 since it was on Nightly with no known regressions. Approved for 78.8esr.

Attachment #9182378 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+
Attachment #9188422 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+

I know its been a long time but could I ask you to do one more verification on 78.8.0esr Build?
You can download the .zip build from the link below, no need to install it.
https://archive.mozilla.org/pub/firefox/candidates/78.8.0esr-candidates/build2/win64/en-US/firefox-78.8.0esr.zip

Flags: needinfo?(valyel)

Sorry to say, but the machine that I was having this issue on has been replaced. Unfortunately, I don't have a way to recover it at this point.
As such, I won't be able to verify. Sorry about that!

Flags: needinfo?(valyel)

(In reply to valyel from comment #42)

Sorry to say, but the machine that I was having this issue on has been replaced. Unfortunately, I don't have a way to recover it at this point.
As such, I won't be able to verify. Sorry about that!

No worries, you already helped us plenty with previous verification.

See Also: → 1652898
See Also: → 1503704
You need to log in before you can comment on or make changes to this bug.