Open Bug 1438009 Opened 7 years ago Updated 2 years ago

[mozprocess] Prevent returning an exit code of 259 (STILL_ACTIVE) when calling GetExitCodeProcess

Categories

(Testing :: Mozbase, defect, P2)

Version 3
All
Windows
defect

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

While working on enabling the mozprocess unit tests on Windows I noticed on bug 892902 that the `GetExitCodeProcess()` API call returns `STILL_ACTIVE` in case the application is running, and that its value of 259 will be assumed as the process exit code. Implementing this change on bug 892902 was not successful because of Marionette unit test failures on Windows 7 only: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dbcbabda181b The patch as necessary can be found as attachment 8950254 [details]. I will work on that bug once bug 892902 got landed.
Depends on: 892902
It would be good to have a unique returncode reporting before I continue with bug 1434878. So lets get this bug fixed first.
Blocks: 1434878
The problem here is the following line which is causing us race conditions in case of a slow shutdown of Firefox: https://dxr.mozilla.org/mozilla-central/rev/3f474e48db7f7b0f1c2c090f812290a8d9a21650/testing/mozbase/mozprocess/mozprocess/processhandler.py#530 Here without my change the returncode is 259, and we bubble it up to the runner and more consumers. All of them assume that the process has exit, which is clearly not the case given that this value means `STILL_ACTIVE`. With my patch we now return None, and consumers assume the wrong state of the process. Digging more into this it looks like it only happens when there is an active _procmgrthread, which checks for the IO completion port, and notifies with `FINISHED` once the process is gone. Interestingly this doesn't happen and `FINISHED` is send too early: > waiting with IO completion port > ** item: {796L: 'FINISHED'} > received 'FINISHED' from _procmgrthread > *** returned exit code: None I will have to dig into the io completion port handling. Sadly my testcase only works when I do not put that many print statements into the Python code. So it's actually exhausting to figure out what's wrong. I'm really puzzled that this never caused major bustage! Maybe it's really related to long shutdown times, which we currently have a lot since Firefox 56 (see bug 1405290), and that is why we see such an increased number of intermittent test failures on Windows.
Assignee: nobody → hskupin
Severity: normal → major
Status: NEW → ASSIGNED
OS: Unspecified → Windows
Priority: -- → P1
Hardware: Unspecified → All
Blocks: 1433873
Depends on: 1433905
Marionette restart tests are still failing on Windows with the patch from comment 0 applied. Best would most likely be to fix Marionette first, or even let the patch for this bug slide into it.
No longer blocks: 1433873
Depends on: 1433873
Actually the proposed attachment from comment 0 should not be used! It would add a shim in winprocess, which manipulates the exit code as returned from `GetExitCodeProcess()`. The better solution here is to first call `WaitForSingleObject()`, and only when this returns `0`, we should retrieve the exit code via `GetExitCodeProcess()`. This prevents us from special-casing the STILL_ACTIVE exit code value.
Summary: [mozprocess] Returncode value 259 (STILL_ACTIVE) of GetExitCodeProcess should translate to "None" → [mozprocess] Prevent returning an exit code of 259 (STILL_ACTIVE) when calling GetExitCodeProcess
Blocks: 1498003

Currently I'm not working on this bug.

Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Priority: P1 → P2
Severity: major → S2
Severity: S2 → S3

Not fully sure but maybe bug 1493796 might help here already.

Depends on: 1493796
You need to log in before you can comment on or make changes to this bug.