Closed Bug 1906191 Opened 4 months ago Closed 2 months ago

Not all tests run on rusttests tasks on Windows

Categories

(Firefox Build System :: General, defect, P3)

defect

Tracking

(firefox-esr115 unaffected, firefox-esr128 fixed, firefox128 wontfix, firefox129 wontfix, firefox130 wontfix, firefox131 fixed)

RESOLVED FIXED
131 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- fixed
firefox128 --- wontfix
firefox129 --- wontfix
firefox130 --- wontfix
firefox131 --- fixed

People

(Reporter: glandium, Assigned: glandium)

References

(Regression)

Details

(Keywords: regression)

Attachments

(8 files)

For some reason, this is not reproducible locally (only on CI), and doesn't seem to be happening before rustc 1.80.
The most disturbing thing is that this happens silently.
The most hard to understand thing is that cargo test is invoked for the tests that don't end up neither built nor run.

The Bugbug bot thinks this bug should belong to the 'Firefox Build System::Task Configuration' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: General → Task Configuration
Component: Task Configuration → General

(In reply to Mike Hommey [:glandium] from comment #0)

For some reason, this is not reproducible locally (only on CI), and doesn't seem to be happening before rustc 1.80.
The most disturbing thing is that this happens silently.
The most hard to understand thing is that cargo test is invoked for the tests that don't end up neither built nor run.

Correction: they are (partially) built, but don't run. Or at least, the output log is truncated, and errors that are supposed to happen when those tests run are not caught.

Where I'm at: there's some weird interaction between ProcessHandlerMixin in mach, make's output-sync, and cargo.

The severity field is not set for this bug.
:ahochheiden, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(ahochheiden)
Severity: -- → S3
Flags: needinfo?(ahochheiden)
Priority: -- → P3
Keywords: regression
Regressed by: 1845125

Set release status flags based on info from the regressing bug 1845125

Kagami, can you still reproduce bug 1845125 if you revert its patch locally? (asking to know whether I can ask you to test an alternative fix for that bug)

Flags: needinfo?(krosylight)

Yes, without the patch it still stalls at 85%:

:machBuildFaster>  1:11.65 > Task :geckoview:generateJNIWrappersForGeneratedWithGeckoBinariesDebug
:machBuildFaster>  1:11.65 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf-8
:machBuildFaster>  1:11.65 Processing annotations...
:machBuildFaster>  1:11.65 Annotation processing complete in 644ms
:machBuildFaster>  1:11.65 Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.
:machBuildFaster>  1:11.65 You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.
:machBuildFaster>  1:11.65 For more on this, please refer to https://docs.gradle.org/8.7/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.
:machBuildFaster>  1:11.65 BUILD SUCCESSFUL in 1m 7s
:machBuildFaster>  1:11.65 52 actionable tasks: 10 executed, 42 up-to-date
<===========--> 85% EXECUTING [5m 42s]
> :machBuildFaster
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
Flags: needinfo?(krosylight)

Can you try this patch (with the one from bug 1845125 backed out)?

diff --git a/testing/mozbase/mozprocess/mozprocess/processhandler.py b/testing/mozbase/mozprocess/mozprocess/processhandler.py
index 705ff5a2106c4..a1be3d5a984be 100644
--- a/testing/mozbase/mozprocess/mozprocess/processhandler.py
+++ b/testing/mozbase/mozprocess/mozprocess/processhandler.py
@@ -1074,12 +1074,8 @@ class ProcessReader(object):
         return thread
 
     def _read_stream(self, stream, queue, callback):
-        while True:
-            line = stream.readline()
-            if not line:
-                break
+        for line in iter(stream.readline, ""):
             queue.put((line, callback))
-        stream.close()
 
     def start(self, proc):
         queue = Queue()
@@ -1146,10 +1142,6 @@ class ProcessReader(object):
                 self.timeout_callback()
             except Exception:
                 traceback.print_exc()
-        if stdout_reader:
-            stdout_reader.join()
-        if stderr_reader:
-            stderr_reader.join()
         if not timed_out:
             try:
                 self.finished_callback()
Flags: needinfo?(krosylight)

Unfortunately that didn't help, it still halts at 85%. (with python 3.11.9 and 3.12.4)

Flags: needinfo?(krosylight)

First and foremost, this undoes the patch for bug 1845125, which causes
other problems (e.g. output being dropped past a certain point).

The rest of the patch address bug 1845125 at the root: Ultimately, this
is a similar problem to bug 1863675.

wait is blocked on joining the reader thread, but the reader thread is
blocked on readline because the process is finished but hasn't been
waited on yet. Borrowing the strategy from bug 1863675 simplifies things
overall.

Set release status flags based on info from the regressing bug 1845125

That value might be returned to callers of other methods that return
Process.returncode, and might mislead them because they would be
expecting None in that situation.

If the current thread is the IO completion port manager thread, then
obviously it's alive. If the current thread is any other thread, and we
were using IO completion ports, there could be pending events in the
process events queue regardless of the state of the IO completion port
manager thread: after putting an event on the queue the thread would
terminate, so there was a race condition where the queue would not be
used when it should have been.

There is a case, though, where there might not be an event on the queue:
if the thread fails in an unexpected exception. In that case, we put
something on the queue that will make _custom_wait fail.

The only reason custom_wait is checking whether the current thread is
the IO completion port manager is because there is a case where it might
call Process.kill, which does an automatic wait. However, for the
purpose of the IO completion port manager, the only interesting thing
is for the process to be killed, not to wait for that to finish in the
thread itself.

So we introduce a method that does kill without wait. It could be argued
that kill shouldn't wait in the first place, like subprocess.Popen does,
but that's a bigger API change.

wait(None) is supposed to wait for a process terminating indefinitely.
But we currently cap it to 3 minutes and a half, which seems more of a
workaround than something really wanted.

When we always have a timeout, we don't distinguish between the case
where wait was called with an explicit timeout, and the case where it's
meant to wait until the process actually dies.

And because not getting an event on the queue within the timeout throws
an Empty exception, that exception is caught and is handled as an error,
while it's supposed to be normal behavior.

The expectation we should have is that either there's an event on the
process event queue within the timeout requested for the wait, and go on
normally, or the queue is still empty at the end of the timeout,
throwing an exception, in which case we'd want wait to return normally
indicating that the process is still running.

Similarly, in the no-IO port completion manager case, we shouldn't kill
a process because we waited the amount requested. wait(n) is not "kill
after n seconds".

Cleanup closes various handles, with consequences on the IO port
completion manager, the stream reader, etc. That means further
operations after a wait() after which the process was still running
(which is a valid state) could fail to work properly.

So only cleanup when we know the process has died either a natural or
forced death.

When multiple threads call wait, they may all end up in _custom_wait,
waiting for an event on the process event queue. But there's only ever
going to be one of those events emitted by the IO completion port
manager.

Of course, because before, we were always timing-out doing that,
eventually, that would work out, albeit after multiple minutes.

So we re-emit the event for other threads calling wait not to wait
forever.

And because now we have multiple threads that might return for wait
concurrently, they may end up calling _cleanup concurrently too, which
can fail because of race conditions, so add some locking.

Process.kill already waits, which means if there's a timeout and the
process doesn't die within that timeout, we can wait twice the timeout
currently.

Kagami, can you test the full patch stack?

Flags: needinfo?(krosylight)

Oh great, bug 1845125 is still happening... what did I miss this time?

Flags: needinfo?(krosylight)

(In reply to Mike Hommey [:glandium] from comment #20)

Oh great, bug 1845125 is still happening... what did I miss this time?

Should I hold off on reviewing this for now?

Flags: needinfo?(mh+mozilla)

marionette tests are sensitive to mozprocess changes and seem to fail with this stack applied.
Also, it looks like the mozprocess unit tests are broken, at least on Windows:
https://treeherder.mozilla.org/jobs?repo=try&revision=059386a378657a038cd740190af72dd84a31db42

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #21)

(In reply to Mike Hommey [:glandium] from comment #20)

Oh great, bug 1845125 is still happening... what did I miss this time?

Should I hold off on reviewing this for now?

I thought I had covered it all with the last update, but it turns out there seems to still be some race condition... I think you can review everything but https://bugzilla.mozilla.org/attachment.cgi?id=9417425, though, but I triggered marionette tests on https://phabricator.services.mozilla.com/D218910 in case something breaks with the changes up to that one.

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #23)

but I triggered marionette tests on https://phabricator.services.mozilla.com/D218910 in case something breaks with the changes up to that one.

Things are looking good up to there.

Fun fact, apparently, the patches up to https://phabricator.services.mozilla.com/D218910 are enough to fix the immediate problem I had for which I filed this bug for... although https://phabricator.services.mozilla.com/D218389 would be a good thing to fix too.

(In reply to Geoff Brown [:gbrown] (last day Aug 23) from comment #22)

marionette tests are sensitive to mozprocess changes and seem to fail with this stack applied.

I'm baffled. The marionette failures come from the very last patch, apparently, the one that removes a redundant wait.

Attachment #9418447 - Attachment description: Bug 1906191 - Remove redundant wait. → Bug 1906191 - Avoid doubling the timeout from ProcessHandlerMixin.kill.

The entire stack is green now, hopefully.

Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/e3aae8a1d938 Don't set Process.returncode when it's winprocess.STILL_ACTIVE. r=releng-reviewers,jcristau https://hg.mozilla.org/integration/autoland/rev/8346d9048769 Don't check whether the IO completion port manager thread is alive. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/70f290732bd8 Don't call Process.kill from the IO completion port manager. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/1ead82b42c89 Properly handle wait timeout. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/7d0cb14bcdea Don't cleanup unless the process has died. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/35264cb15789 Avoid race condition when multiple threads are calling wait. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/64952b7e24e1 Change how ProcessHandler handles output_timeout. r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/999882657e11 Avoid doubling the timeout from ProcessHandlerMixin.kill. r=releng-reviewers,ahal,gbrown
Regressions: 1915354
Blocks: 1915540
No longer blocks: 1915540
Regressions: 1915540
No longer regressions: 1915540
See Also: → 1916100
Regressions: 1916125
Regressions: 1915540
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: