Closed Bug 1765970 Opened 3 years ago Closed 3 years ago

Intermittent jsreftest timeouts in the JS shell

Categories

(Core :: JavaScript Engine, task, P1)

task

Tracking

()

RESOLVED FIXED
101 Branch
Tracking Status
firefox101 --- fixed

People

(Reporter: jandem, Assigned: jandem)

References

(Blocks 1 open bug)

Details

We're getting a lot of intermittent failures for SM shell jobs that appear to be test timeouts while running jsreftests. Filing this bug for tracking and analyzing this.

Depends on: 1763453
Depends on: 1765318
Depends on: 1762773
Depends on: 1763689
Depends on: 1764579
Depends on: 1763578
Depends on: 1760787
Depends on: 1764235
Depends on: 1764908
Depends on: 1760767
Depends on: 1764225
Depends on: 1764305
Depends on: 1761457
Depends on: 1764581
Depends on: 1761641
Depends on: 1761056
Depends on: 1761672
Depends on: 1764234
Depends on: 1761027
Depends on: 1762159
Depends on: 1762639
Depends on: 1763380
Depends on: 1760968
Depends on: 1762319
Depends on: 1761883
Depends on: 1764215
Depends on: 1761666
Depends on: 1761854
Depends on: 1762678
Depends on: 1761869
Depends on: 1760919
Depends on: 1761103
Depends on: 1750150
Depends on: 1764130
Depends on: 1749843
Depends on: 1749485
Depends on: 1765920
Depends on: 1765858
Depends on: 1765834
Depends on: 1765557
Depends on: 1765448
Depends on: 1734699
Depends on: 1753627
Depends on: 1753684
Depends on: 1753832
Depends on: 1734952
Depends on: 1735977
Depends on: 1735978
Depends on: 1736525
Depends on: 1736608
Depends on: 1736951
Depends on: 1737843
Depends on: 1737283
Depends on: 1737074
Depends on: 1737952
Depends on: 1739433
Depends on: 1739593
Depends on: 1740778
Depends on: 1741323
Depends on: 1741858
Depends on: 1742322
Depends on: 1742690
Depends on: 1743492
Depends on: 1743507
Depends on: 1743601
Depends on: 1743746
Depends on: 1743974
Depends on: 1744137
Depends on: 1744898
Depends on: 1746610
Depends on: 1746810
Depends on: 1747794
Depends on: 1748365
Depends on: 1748767
Depends on: 1748796
Depends on: 1750887
Depends on: 1751794
Depends on: 1751801
Depends on: 1751983
Depends on: 1752877
Depends on: 1752910
Depends on: 1752938
Depends on: 1753262
Depends on: 1755013
Depends on: 1756265
Depends on: 1757276
Depends on: 1757432
Depends on: 1757484
Depends on: 1757652
Depends on: 1758019
Depends on: 1758244
Depends on: 1758508
Depends on: 1758783
Depends on: 1758796
Depends on: 1759683
Depends on: 1759767
Depends on: 1759799
Depends on: 1759839
Depends on: 1760525
Depends on: 1731685
Depends on: 1730989
Depends on: 1730964
Depends on: 1730963
Depends on: 1730928
Depends on: 1730792
Depends on: 1730092
Depends on: 1729375
Depends on: 1729191
Depends on: 1727305
Depends on: 1726967
Depends on: 1725872
Depends on: 1725598
Depends on: 1725597
Depends on: 1724912
Depends on: 1724818
Depends on: 1724757
Depends on: 1724416
Depends on: 1724409
Depends on: 1723816
Depends on: 1723768
Depends on: 1722194
Depends on: 1722315
Depends on: 1722350
Depends on: 1722743
Depends on: 1722932
Depends on: 1723179
Depends on: 1723470
Depends on: 1722179
Depends on: 1721501
Depends on: 1721151
Depends on: 1721141
Depends on: 1720842
Depends on: 1720797
Depends on: 1719877
Depends on: 1716965
Depends on: 1716936
Depends on: 1716935
Depends on: 1716852

Some intermittent failure bugs I added as dependency were filed on June 16-17 2021, followed by more of them starting at July 9-15. Looking at the commit logs, I don't see obvious candidates that landed on autoland around that time.

Aryx, do you have any thoughts here? Is it possible to check for CI-related changes/updates around this time?

Flags: needinfo?(aryx.bugmail)
Depends on: 1765993

Some thoughts:

  • Many of these tests are very simple: just a few statements, no loops, no JIT stuff.
  • This would either be an OS-level starvation/scheduling thing, a bug in the test harness, or a bug in the JS shell or JS engine (deadlock? iloop?)
  • I'll try to do an analysis of platform, test job (which SM task), test files etc next week.
  • jit-tests might also be affected (we have time-out bugs on file), but because we have much fewer tests there, this would likely show up less.

I can reproduce this locally but it's very unreliable and takes a long time to trigger. It seems to be a deadlock with our helper thread code so that narrows it down a bit.

Flags: needinfo?(aryx.bugmail)

Is there anything related/suspicious in the pushlog fro 2021-06-15 - 2021-06-18?

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #4)

Is there anything related/suspicious in the pushlog fro 2021-06-15 - 2021-06-18?

Yeah it seems related to the thread pool overhaul, bug 1714141 for example landed around that time. It makes a lot of sense given what we know, but I don't understand the root cause yet.

Severity: normal → N/A
Priority: -- → P1

I tracked this down to a bug in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=25847

When a single thread gets notified via pthread_cond_signal, there's a very short window where if the kernel deschedules the waiting thread, the condition variable can get into an invalid state resulting in the signal being dropped and no thread waking up. This causes hangs in other programs and language runtimes too, the glibc bug report has a lot of discussion and details on this issue but unfortunately an official fix for this has not landed yet after two years. Ubuntu is cherry-picking a proposed patch.

In Firefox we use the XPCOM thread pool instead of the one in SpiderMonkey, so this hopefully doesn't affect Firefox.

Now I have to think about the right workaround for this...

Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Depends on: 1766827
Depends on: 1766729
Depends on: 1766661
Depends on: 1766422
Depends on: 1766844

No new bugs were filed since bug 1766844 landed on April 28, so the workaround seems to be holding up \o/

Fingers crossed the glibc bug will be fixed at some point.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch
You need to log in before you can comment on or make changes to this bug.