Closed Bug 1765299 Opened 5 months ago Closed 5 months ago

Android child process fd limits are too low | Perma [Tier 2] dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | Test timed out. -

Categories

(Toolkit :: Startup and Profile System, defect, P1)

defect

Tracking

()

RESOLVED FIXED
101 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox99 --- unaffected
firefox100 --- unaffected
firefox101 - fixed

People

(Reporter: intermittent-bug-filer, Assigned: jld)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

Filed by: smolnar [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=374941149&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Iksw5t3yQ8KYd2fQPBTtwA/runs/0/artifacts/public/logs/live_backing.log


INFO -  TEST-PASS | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | successfullyParsed is true
[task 2022-04-19T08:52:06.733Z] 08:42:30     INFO -  Buffered messages finished
[task 2022-04-19T08:52:06.733Z] 08:42:30  WARNING -  TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | Test timed out. -
[task 2022-04-19T08:52:06.733Z] 08:42:42  WARNING -  TEST-UNEXPECTED-FAIL | SimpleTest | this test already called finish!
[task 2022-04-19T08:52:06.733Z] 08:42:42  WARNING -  TEST-UNEXPECTED-ERROR | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | called finish() multiple times
[task 2022-04-19T08:52:06.733Z] 08:42:42     INFO -  TEST-INFO took 327446ms
[task 2022-04-19T08:52:06.733Z] 08:43:04  WARNING -  TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | Test timed out. -
[task 2022-04-19T08:52:06.733Z] 08:43:04  WARNING -  TEST-UNEXPECTED-FAIL | SimpleTest | this test already called finish!
[task 2022-04-19T08:52:06.733Z] 08:43:04  WARNING -  TEST-UNEXPECTED-ERROR | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | called finish() multiple times
[task 2022-04-19T08:52:06.733Z] 08:43:04     INFO -  TEST-INFO
[task 2022-04-19T08:52:06.733Z] 08:43:39  WARNING -  TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | Test timed out. -
[task 2022-04-19T08:52:06.733Z] 08:43:39  WARNING -  TEST-UNEXPECTED-FAIL | SimpleTest | this test already called finish!
[task 2022-04-19T08:52:06.733Z] 08:43:39  WARNING -  TEST-UNEXPECTED-ERROR | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | called finish() multiple times
[task 2022-04-19T08:52:06.733Z] 08:43:39     INFO -  TEST-INFO
[task 2022-04-19T08:52:06.733Z] 08:44:02  WARNING -  TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | Test timed out. -
[task 2022-04-19T08:52:06.733Z] 08:44:02  WARNING -  TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | 4 test timeouts, giving up. -
[task 2022-04-19T08:52:06.733Z] 08:44:02  WARNING -  TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | Skipping 235 remaining tests. -
[task 2022-04-19T08:52:06.733Z] 08:44:02  WARNING -  TEST-UNEXPECTED-FAIL | SimpleTest | this test already called finish!
[task 2022-04-19T08:52:06.733Z] 08:44:02  WARNING -  TEST-UNEXPECTED-ERROR | (SimpleTest/TestRunner.js) | called finish() multiple times
[task 2022-04-19T08:52:06.733Z] 08:44:02     INFO -  TEST-INFO
[task 2022-04-19T08:52:06.733Z] 08:51:31     INFO -  wait for org.mozilla.geckoview.test_runner complete; top activity=org.mozilla.geckoview.test_runner
[task 2022-04-19T08:52:06.733Z] 08:51:31     INFO -  org.mozilla.geckoview.test_runner unexpectedly found running. Killing...
[task 2022-04-19T08:52:06.733Z] 08:51:43  WARNING -  TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) (finished) | application timed out after 370 seconds with no output
[task 2022-04-19T08:52:06.733Z] 08:51:43     INFO -  runtestsremote.py | Application ran for: 0:14:49.535887
[task 2022-04-19T08:52:06.733Z] 08:51:44     INFO -  Stopping web server
[task 2022-04-19T08:52:06.733Z] 08:51:44     INFO -  Server shut down.
[task 2022-04-19T08:52:06.733Z] 08:51:44     INFO -  Web server killed.
[task 2022-04-19T08:52:06.733Z] 08:51:44     INFO -  Stopping web socket server
[task 2022-04-19T08:52:06.733Z] 08:51:44     INFO -  Stopping ssltunnel
[task 2022-04-19T08:52:06.733Z] 08:51:44     INFO -  leakcheck | refcount logging is off, so leaks can't be detected!
[task 2022-04-19T08:52:06.733Z] 08:51:44     INFO -  runtests.py | Running tests: end.
[task 2022-04-19T08:52:06.733Z] 08:51:48     INFO -  Buffered messages finished
[task 2022-04-19T08:52:06.733Z] 08:51:53     INFO -  0 INFO TEST-START | Shutdown

Nika, the failure seem to have started from here.
Can you please take a look?

Flags: needinfo?(nika)

Seems like we're hitting FD limits when trying to create new shared memory regions, which is one of the situations I was worried about with that patch. The logcat logs seem to be emitting errors like:

04-19 08:37:09.155  7372  7396 E Gecko   : ShmemAndroid::Create():open: Too many open files (24)
04-19 08:37:09.157  7372  7396 E Gecko   : ShmemAndroid::Create():open: Too many open files (24)

It'll be a bit before I can figure out a good way to mitigate this unfortunately. It might be best to back out bug 1757802 until we find some other approach.

Flags: needinfo?(nika)
Regressed by: 1757802

Set release status flags based on info from the regressing bug 1757802

I pushed a patch to Try to see what the per-process resource limits are, and got this result (the two numbers are the current value and the hard limit):

04-20 22:18:30.252  7215  7238 E Gecko   : rlimits: 1024 4096
04-20 22:18:30.252  7215  7238 E Gecko   : ShmemAndroid::Create():open: Too many open files (24)

So we're currently limited to 1024 fds, but we could raise it as high as 4096. Normally, we do raise the limit to (at least) 4096 if possible, but I think what's going on here is that we only do that in the parent process, because normally the child processes are direct descendants so they inherit the change, but that's not the case on Android. As the de-facto owner of things related to RLIMIT_NOFILE I'll see if I can come up with a quick fix.

Incidentally, 1024 is pretty small, given that Necko will use up to 1000 on its own, and then there are other subsystems like IndexedDB (and maybe the cache?) that can have significant fd usage; this is why we had to increase it on desktop Linux.

Assignee: nobody → jld
Has Regression Range: --- → yes
Severity: S4 → S3
Component: Canvas: WebGL → Startup and Profile System
Priority: P5 → P1
Product: Core → Toolkit
Summary: Perma [Tier 2] dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | Test timed out. - → Android child process fd limits are too low | Perma [Tier 2] dom/canvas/test/webgl-conf/generated/test_2_conformance__extensions__webgl-compressed-texture-astc.html | Test timed out. -

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #5)

Incidentally, 1024 is pretty small, given that Necko will use up to 1000 on its own, and then there are other subsystems like IndexedDB (and maybe the cache?) that can have significant fd usage; this is why we had to increase it on desktop Linux.

But this applies only to child processes, so Necko's usage probably isn't relevant. Even so, it's intended that all processes have at least 4k fds available (if the OS config allows it, which it does in this case), so that ought to be fixed. And it fixes the test failure (on Try).

I'm mostly convinced that there isn't an actual leak here — the new Shmem is basically a fancy wrapper for RefPtr<mozilla::ipc::SharedMemory>, so refcount logging ought to pick up any leaks, at least if they live past shutdown.

We still want this change, but the regressing bug has also been backed out

Not tracking for 101 anymore since the regressing change was backed out.

Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/65b54546353e
Set fd resource limits correctly for child processes on Android. r=glandium
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch
You need to log in before you can comment on or make changes to this bug.