Intermittent jsreftest timeouts in the JS shell
Categories
(Core :: JavaScript Engine, task, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox101 | --- | fixed |
People
(Reporter: jandem, Assigned: jandem)
References
(Blocks 1 open bug)
Details
We're getting a lot of intermittent failures for SM shell jobs that appear to be test timeouts while running jsreftests. Filing this bug for tracking and analyzing this.
| Assignee | ||
Comment 1•3 years ago
|
||
Some intermittent failure bugs I added as dependency were filed on June 16-17 2021, followed by more of them starting at July 9-15. Looking at the commit logs, I don't see obvious candidates that landed on autoland around that time.
Aryx, do you have any thoughts here? Is it possible to check for CI-related changes/updates around this time?
| Assignee | ||
Comment 2•3 years ago
|
||
Some thoughts:
- Many of these tests are very simple: just a few statements, no loops, no JIT stuff.
- This would either be an OS-level starvation/scheduling thing, a bug in the test harness, or a bug in the JS shell or JS engine (deadlock? iloop?)
- I'll try to do an analysis of platform, test job (which SM task), test files etc next week.
- jit-tests might also be affected (we have time-out bugs on file), but because we have much fewer tests there, this would likely show up less.
| Assignee | ||
Comment 3•3 years ago
|
||
I can reproduce this locally but it's very unreliable and takes a long time to trigger. It seems to be a deadlock with our helper thread code so that narrows it down a bit.
Comment 4•3 years ago
|
||
Is there anything related/suspicious in the pushlog fro 2021-06-15 - 2021-06-18?
| Assignee | ||
Comment 5•3 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #4)
Is there anything related/suspicious in the pushlog fro 2021-06-15 - 2021-06-18?
Yeah it seems related to the thread pool overhaul, bug 1714141 for example landed around that time. It makes a lot of sense given what we know, but I don't understand the root cause yet.
Updated•3 years ago
|
| Assignee | ||
Comment 6•3 years ago
|
||
I tracked this down to a bug in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=25847
When a single thread gets notified via pthread_cond_signal, there's a very short window where if the kernel deschedules the waiting thread, the condition variable can get into an invalid state resulting in the signal being dropped and no thread waking up. This causes hangs in other programs and language runtimes too, the glibc bug report has a lot of discussion and details on this issue but unfortunately an official fix for this has not landed yet after two years. Ubuntu is cherry-picking a proposed patch.
In Firefox we use the XPCOM thread pool instead of the one in SpiderMonkey, so this hopefully doesn't affect Firefox.
Now I have to think about the right workaround for this...
| Assignee | ||
Updated•3 years ago
|
| Assignee | ||
Comment 7•3 years ago
|
||
No new bugs were filed since bug 1766844 landed on April 28, so the workaround seems to be holding up \o/
Fingers crossed the glibc bug will be fixed at some point.
Description
•