Closed Bug 1906191 Opened 4 months ago Closed 2 months ago

Not all tests run on rusttests tasks on Windows

Tracking

(firefox-esr115 unaffected, firefox-esr128 fixed, firefox128 wontfix, firefox129 wontfix, firefox130 wontfix, firefox131 fixed)

Status:

RESOLVED FIXED

Milestone:

131 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

unaffected

firefox-esr128

---

fixed

firefox128

---

wontfix

firefox129

---

wontfix

firefox130

---

wontfix

firefox131

---

fixed

People

(Reporter: glandium, Assigned: glandium)

References

(Regression)

Details

(Keywords: regression)

Attachments

(8 files)

Bug 1906191 - Change how ProcessHandler handles output_timeout. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1906191 - Don't set Process.returncode when it's winprocess.STILL_ACTIVE. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1906191 - Don't check whether the IO completion port manager thread is alive. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1906191 - Don't call Process.kill from the IO completion port manager. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1906191 - Properly handle wait timeout. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1906191 - Don't cleanup unless the process has died. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1906191 - Avoid race condition when multiple threads are calling wait. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1906191 - Avoid doubling the timeout from ProcessHandlerMixin.kill. 3 months ago Mike Hommey [:glandium] 48 bytes, text/x-phabricator-request		Details \| Review

Mike Hommey [:glandium]

Assignee

Description

•

4 months ago

For some reason, this is not reproducible locally (only on CI), and doesn't seem to be happening before rustc 1.80.
The most disturbing thing is that this happens silently.
The most hard to understand thing is that cargo test is invoked for the tests that don't end up neither built nor run.

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

4 months ago

The Bugbug bot thinks this bug should belong to the 'Firefox Build System::Task Configuration' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: General → Task Configuration

Mike Hommey [:glandium]

Assignee

Updated

•

4 months ago

Component: Task Configuration → General

Mike Hommey [:glandium]

Assignee

Comment 2

•

4 months ago

(In reply to Mike Hommey [:glandium] from comment #0)

For some reason, this is not reproducible locally (only on CI), and doesn't seem to be happening before rustc 1.80.
The most disturbing thing is that this happens silently.
The most hard to understand thing is that cargo test is invoked for the tests that don't end up neither built nor run.

Correction: they are (partially) built, but don't run. Or at least, the output log is truncated, and errors that are supposed to happen when those tests run are not caught.

Mike Hommey [:glandium]

Assignee

Comment 3

•

4 months ago

Where I'm at: there's some weird interaction between ProcessHandlerMixin in mach, make's output-sync, and cargo.

BugBot [:suhaib / :marco/ :calixte]

Comment 4

•

3 months ago

The severity field is not set for this bug.
:ahochheiden, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(ahochheiden)

Alex Hochheiden [:ahochheiden] (needinfo? me)

Updated

•

3 months ago

Severity: -- → S3

Flags: needinfo?(ahochheiden)

Priority: -- → P3

Mike Hommey [:glandium]

Assignee

Updated

•

3 months ago

Keywords: regression

Regressed by: 1845125

BugBot [:suhaib / :marco/ :calixte]

Comment 5

•

3 months ago

Set release status flags based on info from the regressing bug 1845125

status-firefox128: --- → affected

status-firefox129: --- → affected

status-firefox130: --- → affected

status-firefox-esr115: --- → unaffected

status-firefox-esr128: --- → affected

Mike Hommey [:glandium]

Assignee

Comment 6

•

3 months ago

Kagami, can you still reproduce bug 1845125 if you revert its patch locally? (asking to know whether I can ask you to test an alternative fix for that bug)

Flags: needinfo?(krosylight)

Kagami Rosylight [:saschanaz] (they/them)

Comment 7

•

3 months ago

Yes, without the patch it still stalls at 85%:

:machBuildFaster>  1:11.65 > Task :geckoview:generateJNIWrappersForGeneratedWithGeckoBinariesDebug
:machBuildFaster>  1:11.65 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf-8
:machBuildFaster>  1:11.65 Processing annotations...
:machBuildFaster>  1:11.65 Annotation processing complete in 644ms
:machBuildFaster>  1:11.65 Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.
:machBuildFaster>  1:11.65 You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.
:machBuildFaster>  1:11.65 For more on this, please refer to https://docs.gradle.org/8.7/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.
:machBuildFaster>  1:11.65 BUILD SUCCESSFUL in 1m 7s
:machBuildFaster>  1:11.65 52 actionable tasks: 10 executed, 42 up-to-date
<===========--> 85% EXECUTING [5m 42s]
> :machBuildFaster
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE

Flags: needinfo?(krosylight)

Mike Hommey [:glandium]

Assignee

Comment 8

•

3 months ago

Can you try this patch (with the one from bug 1845125 backed out)?

diff --git a/testing/mozbase/mozprocess/mozprocess/processhandler.py b/testing/mozbase/mozprocess/mozprocess/processhandler.py
index 705ff5a2106c4..a1be3d5a984be 100644
--- a/testing/mozbase/mozprocess/mozprocess/processhandler.py
+++ b/testing/mozbase/mozprocess/mozprocess/processhandler.py
@@ -1074,12 +1074,8 @@ class ProcessReader(object):
         return thread
 
     def _read_stream(self, stream, queue, callback):
-        while True:
-            line = stream.readline()
-            if not line:
-                break
+        for line in iter(stream.readline, ""):
             queue.put((line, callback))
-        stream.close()
 
     def start(self, proc):
         queue = Queue()
@@ -1146,10 +1142,6 @@ class ProcessReader(object):
                 self.timeout_callback()
             except Exception:
                 traceback.print_exc()
-        if stdout_reader:
-            stdout_reader.join()
-        if stderr_reader:
-            stderr_reader.join()
         if not timed_out:
             try:
                 self.finished_callback()

Flags: needinfo?(krosylight)

Kagami Rosylight [:saschanaz] (they/them)

Comment 9

•

3 months ago

•

Edited

Unfortunately that didn't help, it still halts at 85%. (with python 3.11.9 and 3.12.4)

Flags: needinfo?(krosylight)

Donal Meehan [:dmeehan]

Updated

•

3 months ago

status-firefox128: affected → wontfix

status-firefox129: affected → wontfix

Mike Hommey [:glandium]

Assignee

Comment 10

•

3 months ago

Attached file Bug 1906191 - Change how ProcessHandler handles output_timeout. — Details

First and foremost, this undoes the patch for bug 1845125, which causes
other problems (e.g. output being dropped past a certain point).

The rest of the patch address bug 1845125 at the root: Ultimately, this
is a similar problem to bug 1863675.

wait is blocked on joining the reader thread, but the reader thread is
blocked on readline because the process is finished but hasn't been
waited on yet. Borrowing the strategy from bug 1863675 simplifies things
overall.

BugBot [:suhaib / :marco/ :calixte]

Comment 11

•

3 months ago

Set release status flags based on info from the regressing bug 1845125

status-firefox131: --- → affected

Mike Hommey [:glandium]

Assignee

Comment 12

•

3 months ago

Attached file Bug 1906191 - Don't set Process.returncode when it's winprocess.STILL_ACTIVE. — Details

That value might be returned to callers of other methods that return
Process.returncode, and might mislead them because they would be
expecting None in that situation.

Mike Hommey [:glandium]

Assignee

Comment 13

•

3 months ago

Attached file Bug 1906191 - Don't check whether the IO completion port manager thread is alive. — Details

If the current thread is the IO completion port manager thread, then
obviously it's alive. If the current thread is any other thread, and we
were using IO completion ports, there could be pending events in the
process events queue regardless of the state of the IO completion port
manager thread: after putting an event on the queue the thread would
terminate, so there was a race condition where the queue would not be
used when it should have been.

There is a case, though, where there might not be an event on the queue:
if the thread fails in an unexpected exception. In that case, we put
something on the queue that will make _custom_wait fail.

Mike Hommey [:glandium]

Assignee

Comment 14

•

3 months ago

Attached file Bug 1906191 - Don't call Process.kill from the IO completion port manager. — Details

The only reason custom_wait is checking whether the current thread is
the IO completion port manager is because there is a case where it might
call Process.kill, which does an automatic wait. However, for the
purpose of the IO completion port manager, the only interesting thing
is for the process to be killed, not to wait for that to finish in the
thread itself.

So we introduce a method that does kill without wait. It could be argued
that kill shouldn't wait in the first place, like subprocess.Popen does,
but that's a bigger API change.

Mike Hommey [:glandium]

Assignee

Comment 15

•

3 months ago

Attached file Bug 1906191 - Properly handle wait timeout. — Details

wait(None) is supposed to wait for a process terminating indefinitely.
But we currently cap it to 3 minutes and a half, which seems more of a
workaround than something really wanted.

When we always have a timeout, we don't distinguish between the case
where wait was called with an explicit timeout, and the case where it's
meant to wait until the process actually dies.

And because not getting an event on the queue within the timeout throws
an Empty exception, that exception is caught and is handled as an error,
while it's supposed to be normal behavior.

The expectation we should have is that either there's an event on the
process event queue within the timeout requested for the wait, and go on
normally, or the queue is still empty at the end of the timeout,
throwing an exception, in which case we'd want wait to return normally
indicating that the process is still running.

Similarly, in the no-IO port completion manager case, we shouldn't kill
a process because we waited the amount requested. wait(n) is not "kill
after n seconds".

Mike Hommey [:glandium]

Assignee

Comment 16

•

3 months ago

Attached file Bug 1906191 - Don't cleanup unless the process has died. — Details

Cleanup closes various handles, with consequences on the IO port
completion manager, the stream reader, etc. That means further
operations after a wait() after which the process was still running
(which is a valid state) could fail to work properly.

So only cleanup when we know the process has died either a natural or
forced death.

Mike Hommey [:glandium]

Assignee

Comment 17

•

3 months ago

Attached file Bug 1906191 - Avoid race condition when multiple threads are calling wait. — Details

When multiple threads call wait, they may all end up in _custom_wait,
waiting for an event on the process event queue. But there's only ever
going to be one of those events emitted by the IO completion port
manager.

Of course, because before, we were always timing-out doing that,
eventually, that would work out, albeit after multiple minutes.

So we re-emit the event for other threads calling wait not to wait
forever.

And because now we have multiple threads that might return for wait
concurrently, they may end up calling _cleanup concurrently too, which
can fail because of race conditions, so add some locking.

Mike Hommey [:glandium]

Assignee

Comment 18

•

3 months ago

Attached file Bug 1906191 - Avoid doubling the timeout from ProcessHandlerMixin.kill. — Details

Process.kill already waits, which means if there's a timeout and the
process doesn't die within that timeout, we can wait twice the timeout
currently.

Mike Hommey [:glandium]

Assignee

Comment 19

•

3 months ago

Kagami, can you test the full patch stack?

Flags: needinfo?(krosylight)

Mike Hommey [:glandium]

Assignee

Comment 20

•

3 months ago

Oh great, bug 1845125 is still happening... what did I miss this time?

Flags: needinfo?(krosylight)

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

2 months ago

(In reply to Mike Hommey [:glandium] from comment #20)

Oh great, bug 1845125 is still happening... what did I miss this time?

Should I hold off on reviewing this for now?

Flags: needinfo?(mh+mozilla)

Geoff Brown [:gbrown]

Comment 22

•

2 months ago

marionette tests are sensitive to mozprocess changes and seem to fail with this stack applied.
Also, it looks like the mozprocess unit tests are broken, at least on Windows:
https://treeherder.mozilla.org/jobs?repo=try&revision=059386a378657a038cd740190af72dd84a31db42

Mike Hommey [:glandium]

Assignee

Comment 23

•

2 months ago

•

Edited

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #21)

(In reply to Mike Hommey [:glandium] from comment #20)

Oh great, bug 1845125 is still happening... what did I miss this time?

Should I hold off on reviewing this for now?

I thought I had covered it all with the last update, but it turns out there seems to still be some race condition... I think you can review everything but https://bugzilla.mozilla.org/attachment.cgi?id=9417425, though, but I triggered marionette tests on https://phabricator.services.mozilla.com/D218910 in case something breaks with the changes up to that one.

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Assignee

Comment 24

•

2 months ago

(In reply to Mike Hommey [:glandium] from comment #23)

but I triggered marionette tests on https://phabricator.services.mozilla.com/D218910 in case something breaks with the changes up to that one.

Things are looking good up to there.

Mike Hommey [:glandium]

Assignee

Comment 25

•

2 months ago

Fun fact, apparently, the patches up to https://phabricator.services.mozilla.com/D218910 are enough to fix the immediate problem I had for which I filed this bug for... although https://phabricator.services.mozilla.com/D218389 would be a good thing to fix too.

Mike Hommey [:glandium]

Assignee

Comment 26

•

2 months ago

(In reply to Geoff Brown [:gbrown] (last day Aug 23) from comment #22)

marionette tests are sensitive to mozprocess changes and seem to fail with this stack applied.

I'm baffled. The marionette failures come from the very last patch, apparently, the one that removes a redundant wait.

Phabricator Automation

Updated

•

2 months ago

Attachment #9418447 - Attachment description: Bug 1906191 - Remove redundant wait. → Bug 1906191 - Avoid doubling the timeout from ProcessHandlerMixin.kill.

Mike Hommey [:glandium]

Assignee

Comment 27

•

2 months ago

The entire stack is green now, hopefully.

Ryan VanderMeulen [:RyanVM]

Updated

•

2 months ago

status-firefox130: affected → wontfix

Pulsebot

Comment 28

•

2 months ago

Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/e3aae8a1d938 Don't set Process.returncode when it's winprocess.STILL_ACTIVE. r=releng-reviewers,jcristau https://hg.mozilla.org/integration/autoland/rev/8346d9048769 Don't check whether the IO completion port manager thread is alive. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/70f290732bd8 Don't call Process.kill from the IO completion port manager. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/1ead82b42c89 Properly handle wait timeout. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/7d0cb14bcdea Don't cleanup unless the process has died. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/35264cb15789 Avoid race condition when multiple threads are calling wait. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/64952b7e24e1 Change how ProcessHandler handles output_timeout. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/999882657e11 Avoid doubling the timeout from ProcessHandlerMixin.kill. r=releng-reviewers,ahal,gbrown

pstanciu

Comment 29

•

2 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e3aae8a1d938
https://hg.mozilla.org/mozilla-central/rev/8346d9048769
https://hg.mozilla.org/mozilla-central/rev/70f290732bd8
https://hg.mozilla.org/mozilla-central/rev/1ead82b42c89
https://hg.mozilla.org/mozilla-central/rev/7d0cb14bcdea
https://hg.mozilla.org/mozilla-central/rev/35264cb15789
https://hg.mozilla.org/mozilla-central/rev/64952b7e24e1
https://hg.mozilla.org/mozilla-central/rev/999882657e11

Status: NEW → RESOLVED

Closed: 2 months ago

status-firefox131: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 131 Branch

Mike Hommey [:glandium]

Assignee

Updated

•

2 months ago

Regressions: 1915354

Mike Hommey [:glandium]

Assignee

Updated

•

2 months ago

Blocks: 1915540

Mike Hommey [:glandium]

Assignee

Updated

•

2 months ago

No longer blocks: 1915540

Regressions: 1915540

Mike Hommey [:glandium]

Assignee

Updated

•

2 months ago

No longer regressions: 1915540

ISHIKAWA, Chiaki

Updated

•

2 months ago

Updated

•

2 months ago

Regressions: 1916125

Mike Hommey [:glandium]

Assignee

Updated

•

2 months ago

Regressions: 1915540

Ryan VanderMeulen [:RyanVM]

Updated

•

2 months ago

status-firefox-esr128: affected → fixed

Pulsebot

Comment 30

•

2 months ago

uplift

You need to log in before you can comment on or make changes to this bug.