Not all tests run on rusttests tasks on Windows
Categories
(Firefox Build System :: General, defect, P3)
Tracking
(firefox-esr115 unaffected, firefox-esr128 fixed, firefox128 wontfix, firefox129 wontfix, firefox130 wontfix, firefox131 fixed)
People
(Reporter: glandium, Assigned: glandium)
References
(Regression)
Details
(Keywords: regression)
Attachments
(8 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
For some reason, this is not reproducible locally (only on CI), and doesn't seem to be happening before rustc 1.80.
The most disturbing thing is that this happens silently.
The most hard to understand thing is that cargo test is invoked for the tests that don't end up neither built nor run.
Comment 1•4 months ago
|
||
The Bugbug bot thinks this bug should belong to the 'Firefox Build System::Task Configuration' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.
Assignee | ||
Updated•4 months ago
|
Assignee | ||
Comment 2•4 months ago
|
||
(In reply to Mike Hommey [:glandium] from comment #0)
For some reason, this is not reproducible locally (only on CI), and doesn't seem to be happening before rustc 1.80.
The most disturbing thing is that this happens silently.
The most hard to understand thing is that cargo test is invoked for the tests that don't end up neither built nor run.
Correction: they are (partially) built, but don't run. Or at least, the output log is truncated, and errors that are supposed to happen when those tests run are not caught.
Assignee | ||
Comment 3•4 months ago
|
||
Where I'm at: there's some weird interaction between ProcessHandlerMixin in mach, make's output-sync, and cargo.
Comment 4•3 months ago
|
||
The severity field is not set for this bug.
:ahochheiden, could you have a look please?
For more information, please visit BugBot documentation.
Updated•3 months ago
|
Assignee | ||
Updated•3 months ago
|
Comment 5•3 months ago
|
||
Set release status flags based on info from the regressing bug 1845125
Assignee | ||
Comment 6•3 months ago
|
||
Kagami, can you still reproduce bug 1845125 if you revert its patch locally? (asking to know whether I can ask you to test an alternative fix for that bug)
Comment 7•3 months ago
|
||
Yes, without the patch it still stalls at 85%:
:machBuildFaster> 1:11.65 > Task :geckoview:generateJNIWrappersForGeneratedWithGeckoBinariesDebug
:machBuildFaster> 1:11.65 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf-8
:machBuildFaster> 1:11.65 Processing annotations...
:machBuildFaster> 1:11.65 Annotation processing complete in 644ms
:machBuildFaster> 1:11.65 Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.
:machBuildFaster> 1:11.65 You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.
:machBuildFaster> 1:11.65 For more on this, please refer to https://docs.gradle.org/8.7/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.
:machBuildFaster> 1:11.65 BUILD SUCCESSFUL in 1m 7s
:machBuildFaster> 1:11.65 52 actionable tasks: 10 executed, 42 up-to-date
<===========--> 85% EXECUTING [5m 42s]
> :machBuildFaster
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
> IDLE
Assignee | ||
Comment 8•3 months ago
|
||
Can you try this patch (with the one from bug 1845125 backed out)?
diff --git a/testing/mozbase/mozprocess/mozprocess/processhandler.py b/testing/mozbase/mozprocess/mozprocess/processhandler.py
index 705ff5a2106c4..a1be3d5a984be 100644
--- a/testing/mozbase/mozprocess/mozprocess/processhandler.py
+++ b/testing/mozbase/mozprocess/mozprocess/processhandler.py
@@ -1074,12 +1074,8 @@ class ProcessReader(object):
return thread
def _read_stream(self, stream, queue, callback):
- while True:
- line = stream.readline()
- if not line:
- break
+ for line in iter(stream.readline, ""):
queue.put((line, callback))
- stream.close()
def start(self, proc):
queue = Queue()
@@ -1146,10 +1142,6 @@ class ProcessReader(object):
self.timeout_callback()
except Exception:
traceback.print_exc()
- if stdout_reader:
- stdout_reader.join()
- if stderr_reader:
- stderr_reader.join()
if not timed_out:
try:
self.finished_callback()
Comment 9•3 months ago
•
|
||
Unfortunately that didn't help, it still halts at 85%. (with python 3.11.9 and 3.12.4)
Updated•3 months ago
|
Assignee | ||
Comment 10•3 months ago
|
||
First and foremost, this undoes the patch for bug 1845125, which causes
other problems (e.g. output being dropped past a certain point).
The rest of the patch address bug 1845125 at the root: Ultimately, this
is a similar problem to bug 1863675.
wait is blocked on joining the reader thread, but the reader thread is
blocked on readline because the process is finished but hasn't been
waited on yet. Borrowing the strategy from bug 1863675 simplifies things
overall.
Comment 11•3 months ago
|
||
Set release status flags based on info from the regressing bug 1845125
Assignee | ||
Comment 12•3 months ago
|
||
That value might be returned to callers of other methods that return
Process.returncode, and might mislead them because they would be
expecting None in that situation.
Assignee | ||
Comment 13•3 months ago
|
||
If the current thread is the IO completion port manager thread, then
obviously it's alive. If the current thread is any other thread, and we
were using IO completion ports, there could be pending events in the
process events queue regardless of the state of the IO completion port
manager thread: after putting an event on the queue the thread would
terminate, so there was a race condition where the queue would not be
used when it should have been.
There is a case, though, where there might not be an event on the queue:
if the thread fails in an unexpected exception. In that case, we put
something on the queue that will make _custom_wait fail.
Assignee | ||
Comment 14•3 months ago
|
||
The only reason custom_wait is checking whether the current thread is
the IO completion port manager is because there is a case where it might
call Process.kill, which does an automatic wait. However, for the
purpose of the IO completion port manager, the only interesting thing
is for the process to be killed, not to wait for that to finish in the
thread itself.
So we introduce a method that does kill without wait. It could be argued
that kill shouldn't wait in the first place, like subprocess.Popen does,
but that's a bigger API change.
Assignee | ||
Comment 15•3 months ago
|
||
wait(None) is supposed to wait for a process terminating indefinitely.
But we currently cap it to 3 minutes and a half, which seems more of a
workaround than something really wanted.
When we always have a timeout, we don't distinguish between the case
where wait was called with an explicit timeout, and the case where it's
meant to wait until the process actually dies.
And because not getting an event on the queue within the timeout throws
an Empty exception, that exception is caught and is handled as an error,
while it's supposed to be normal behavior.
The expectation we should have is that either there's an event on the
process event queue within the timeout requested for the wait, and go on
normally, or the queue is still empty at the end of the timeout,
throwing an exception, in which case we'd want wait to return normally
indicating that the process is still running.
Similarly, in the no-IO port completion manager case, we shouldn't kill
a process because we waited the amount requested. wait(n) is not "kill
after n seconds".
Assignee | ||
Comment 16•3 months ago
|
||
Cleanup closes various handles, with consequences on the IO port
completion manager, the stream reader, etc. That means further
operations after a wait() after which the process was still running
(which is a valid state) could fail to work properly.
So only cleanup when we know the process has died either a natural or
forced death.
Assignee | ||
Comment 17•3 months ago
|
||
When multiple threads call wait, they may all end up in _custom_wait,
waiting for an event on the process event queue. But there's only ever
going to be one of those events emitted by the IO completion port
manager.
Of course, because before, we were always timing-out doing that,
eventually, that would work out, albeit after multiple minutes.
So we re-emit the event for other threads calling wait not to wait
forever.
And because now we have multiple threads that might return for wait
concurrently, they may end up calling _cleanup concurrently too, which
can fail because of race conditions, so add some locking.
Assignee | ||
Comment 18•3 months ago
|
||
Process.kill already waits, which means if there's a timeout and the
process doesn't die within that timeout, we can wait twice the timeout
currently.
Assignee | ||
Comment 19•3 months ago
|
||
Kagami, can you test the full patch stack?
Assignee | ||
Comment 20•3 months ago
|
||
Oh great, bug 1845125 is still happening... what did I miss this time?
Comment 21•2 months ago
|
||
(In reply to Mike Hommey [:glandium] from comment #20)
Oh great, bug 1845125 is still happening... what did I miss this time?
Should I hold off on reviewing this for now?
Comment 22•2 months ago
|
||
marionette tests are sensitive to mozprocess changes and seem to fail with this stack applied.
Also, it looks like the mozprocess unit tests are broken, at least on Windows:
https://treeherder.mozilla.org/jobs?repo=try&revision=059386a378657a038cd740190af72dd84a31db42
Assignee | ||
Comment 23•2 months ago
•
|
||
(In reply to bhearsum@mozilla.com (:bhearsum) from comment #21)
(In reply to Mike Hommey [:glandium] from comment #20)
Oh great, bug 1845125 is still happening... what did I miss this time?
Should I hold off on reviewing this for now?
I thought I had covered it all with the last update, but it turns out there seems to still be some race condition... I think you can review everything but https://bugzilla.mozilla.org/attachment.cgi?id=9417425, though, but I triggered marionette tests on https://phabricator.services.mozilla.com/D218910 in case something breaks with the changes up to that one.
Assignee | ||
Comment 24•2 months ago
|
||
(In reply to Mike Hommey [:glandium] from comment #23)
but I triggered marionette tests on https://phabricator.services.mozilla.com/D218910 in case something breaks with the changes up to that one.
Things are looking good up to there.
Assignee | ||
Comment 25•2 months ago
|
||
Fun fact, apparently, the patches up to https://phabricator.services.mozilla.com/D218910 are enough to fix the immediate problem I had for which I filed this bug for... although https://phabricator.services.mozilla.com/D218389 would be a good thing to fix too.
Assignee | ||
Comment 26•2 months ago
|
||
(In reply to Geoff Brown [:gbrown] (last day Aug 23) from comment #22)
marionette tests are sensitive to mozprocess changes and seem to fail with this stack applied.
I'm baffled. The marionette failures come from the very last patch, apparently, the one that removes a redundant wait.
Updated•2 months ago
|
Assignee | ||
Comment 27•2 months ago
|
||
The entire stack is green now, hopefully.
Updated•2 months ago
|
Comment 28•2 months ago
|
||
Comment 29•2 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e3aae8a1d938
https://hg.mozilla.org/mozilla-central/rev/8346d9048769
https://hg.mozilla.org/mozilla-central/rev/70f290732bd8
https://hg.mozilla.org/mozilla-central/rev/1ead82b42c89
https://hg.mozilla.org/mozilla-central/rev/7d0cb14bcdea
https://hg.mozilla.org/mozilla-central/rev/35264cb15789
https://hg.mozilla.org/mozilla-central/rev/64952b7e24e1
https://hg.mozilla.org/mozilla-central/rev/999882657e11
Assignee | ||
Updated•2 months ago
|
Updated•2 months ago
|
Comment 30•2 months ago
|
||
uplift |
Description
•