Open Bug 1525623 Opened 5 years ago Updated 11 months ago

Firefox crashes during long-run automated test of a web application

Categories

(Core :: JavaScript: GC, defect, P5)

64 Branch
defect

Tracking

()

UNCONFIRMED

People

(Reporter: tomasz.1.kazimierczak, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: memory-leak, parity-chrome, Whiteboard: [MemShrink:P3])

Attachments

(5 files)

Attached file firefox_issue.zip

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/63.0.3239.84 Chrome/63.0.3239.84 Safari/537.36

Steps to reproduce:

Steps to reproduce: open an application which relies on library that performs intensive DOM manipulations (react + react-router based applications fit perfectly). Perform an automated test which navigates around the page quickly. Example: open reddit.com and cycle between "Popular", "All" and "Original Content" categories using buttons at top page every 2 seconds in the loop. Don't wait until content is fully loaded. The memory usage grows rapidly and memory is released painfully slowly, even if the browser is then left idle for some time afterwards. Eventually browser becomes at least completely unresponsive and, especially on weaker machine, crashes.

Actual results:

My team attempts to perform a stability test of our react-based web application (I used reddit as an example here, I can't share original application which btw. isn't as "heavy" as reddit). Several selenium-based test suites are run against it in the loop with expectation that the UI behaves stable for many hours of usage. Typical testcase starts with opening the application's main page, then navigating to other part of application (in such case we typically don't validate the main page content nor wait until all backend calls completed but rather perform a simplistic check that e.g. main containers have been loaded, other tests are typical frontend tests which wait for appearance of certain page elements - in other words the navigation is rather fast). The tests were executed on two machines:

-a laptop under Ubuntu 17.04 with approximately 25GB virtual memory (14GB physical RAM + 11GB swap)
-a RedHat VM with ~11,5 GB available (7,4GB physical + 4GB swap)

In both cases the firefox process which handles the webpage is eating up all memory in several hours (~2h on RHEL box, ~7h on Ubuntu box).

In order to eliminate the possiblility that the application itself suffers from memory leak the tests were repeated on google chrome. The chrome process handling the application was stable and its memory usage didn't exceed 2,5GB (and was not growing over time). Also it's likely not a geckodriver issue - the tests (at least the basic ones, which didn't require selenium) were repeated using independent script recorder (actiona) with the same result - no issues with chrome, memory full with firefox.

During the investigation several configuration settings were applied:

browser.cache.memory.enable: False
browser.cache.memory.capacity: 0
javascript.options.asmjs: False
javascript.options.baselinejit: False
javascript.options.compact_on_user_inactive_delay: 10000
javascript.options.ion: False
javascript.options.mem.high_water_mark: 32
javascript.options.mem.max: 50000
javascript.options.wasm_baselinejit: False
javascript.options.wasm_ionjit: False
javascript.options.mem.gc_incremental_slice_ms: 50
javascript.options.mem.gc_high_frequency_high_limit_mb: 15
javascript.options.mem.gc_high_frequency_heap_growth_max: 15
javascript.options.mem.gc_allocation_threshold_mb: 15

I don't know which one is actually relevant (some of them aren't too well documented...) but they seem to mitigate the issue somehow. On Ubuntu box the browser isn't at least eating up the whole memory (although still hitting 16GB level) and the browser seems to survive through the night (test results showed it was ocassionally unresponsive), it doesn't help on the weaker RHEL box (the only thing which improves is that browser is able to survive for longer (~3h) after becoming unresponsive, instead of crashing quickly).

Remarks:
I didn't manage to reproduce the issue on Firefox 57 using the "stronger" Ubuntu machine (the behavior - and memory usage - seemed to be similar to the one observed on FF64 and 60 after the tweaks were applied), not tested on RHEL.
Running periodically a test which opens about:memory and performs explicit CC/GC allows the browser survive over 12h but it's a rather ugly hack I think.
The problem doesn't seem to affect the normal human usage of the browser (assuming the user is not working continuously and is unable to perform actions as quickly as automated test software) however is certainly an issue when long-lasting automated tests.

Attached are sample results of top command showing the resource usage process which handled the webpage: Firefox 57 (results would be very similar on FF 60 or 64 with tweaks described above applied) ,FF 64 on "weak" RHEL machine (ending with crash) and FF 64 on "strong" Ubuntu box (ending with crash) and same test for chrome 12h for comparison.

Expected results:

Browser is able to release memory faster if there's no memory leak in the handled application

This report is difficult to fix without a testcase.
Can you attach a memory report from about:memory generated before the browser is crashing ?

Flags: needinfo?(tomasz.1.kazimierczak)
Flags: needinfo?(tomasz.1.kazimierczak)

Attached several memory reports. The test was executed on my laptop with application deployed locally (thus I got the application unresponsive faster). Reports were taken when:
-test started (memory-report-start)
-firefox process handling the application under test reached 4gb of memory usage (interestingly, the offending process 6444 seems to be gone from the report), due to the fact I run the test locally at this point the browser tab holding the application was already somewhat unresponsive)
-all interaction with application stopped (memory-report-test-stopped-5gb, same issue as above)
-some time minutes after interaction stopped (memory-report-test-stopped-still-growing-over6gb, same issue as above, the memory consumption still grew)
Data for problematic process is not only gone from memory reports but also from GC/CC logs - logs for that process were stored as incomplete* and are empty (tried to take the log twice)

This is a difficult situation because there is no way for me to test this and the memory report doesn't show where the memory is used (report from the pid is missing).

I will move this to javascript:gc as a start but I doubt that this bug is actionable.

Component: Untriaged → JavaScript: GC
Product: Firefox → Core

Thanks for the bug report.

  1. You used reddit as an example, can the problem be reproduced using reddit or some other site, maybe even a mock-up that doesn't give away too many details about your project. (Please reproduce the bug using one of these).

  2. With the various about:config values set. What was the purpose of these, to work-around the problem?

Thanks.

Flags: needinfo?(tomasz.1.kazimierczak)
Whiteboard: [MemShrink]

Blocking GCSCheduling although I'm thinking we might need a separate meta bug for examples of FF holding onto memory for too long.

It would be really interesting to know whether it was the JIT prefs or the GC prefs that improved the situation.

Blocks: GCScheduling
Priority: -- → P5

That top output is wild:

                        VSS     RSS
7788 user      20   0 25,575g 0,013t  84236 R  87,5 88,9 546:26.56 Web Content

Apparently we've seen this with web workers before. tomasz, are you using web workers? Another data point that might be helpful is the output from about:performance.

Blocks: 1533449
No longer blocks: GCScheduling

Thanks for all replies and sorry for not responding for a while, due to some shift in my work priorities I didn't have time to have a look at the case. Regarding questions asked:

[:pbone] question No.2: Yes, the config values were set solely as an attempt to work around the problem. (About question 1, I'll do my best to upload something meaningful once I find some time to repeat the test)

[:erahm] No, my app doesn't use web workers.

Whiteboard: [MemShrink] → [MemShrink:P3]
Severity: normal → S3
Flags: needinfo?(tomasz.1.kazimierczak)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: