Open Bug 1066728 Opened 5 years ago Updated 5 years ago

Investigate xpcshell retry/cleanup issues on windows

Categories

(Testing :: XPCShell Harness, defect)

x86
macOS
defect
Not set

Tracking

(Not tracked)

People

(Reporter: chmanchester, Unassigned)

References

Details

I was looking at something for bug 1033126 and noticed some sketchy logs for windows xpcshell runs. For instance: https://tbpl.mozilla.org/php/getParsedLog.php?id=47982166&tree=Mozilla-Inbound&full=1#error0

The code path for waiting after failing to remove a directory gets hit pretty frequently, and never actually completes in some cases. If you check out one of the retried tests, test_crashreporter_crash.js, "running test..." gets logged, but "Test failed or timed out, will retry." never does. Finally, an uncaught WindowsError from python ends the run.

I know the parallelization sandboxes needed some workarounds on windows, but maybe we can do better here. Structured logging wants test_start/test_end pairs to make sense, so this is becoming a medium sized headache.
I have no idea how to fix the Windows issue and I remember talking to :ted about it back then. Here are two links I found now on the issue:

http://superuser.com/questions/260375/why-would-system-continue-locking-executable-file-handles-after-the-app-has-exit
http://stackoverflow.com/questions/12463927/windows-process-exit-and-file-socket-handles

TLDR: There's a delay between the process exiting and its files being unlocked.
I did some poking around on try and while all the retries aren't great they seem to resolve themselves within about 15 attempts (this run prints the retry counts: https://tbpl.mozilla.org/?tree=Try&rev=6e3f95e0e417).

More concerning perhaps is the end of the log complaining about child processes that never complete because the test queue management in python seems to suggest this shouldn't ever happen, and it will necessitate further workarounds.
It looks like these are cases where Automation().killAndGetStackNoScreenshot never returns on windows. If that's right we'll just have to log test_end before this section.

Ted, does this sound plausible (maybe this is a known issue even)?
Flags: needinfo?(ted)
I suppose that's a thing that could happen. I'm not aware of any known issues there, although I would like to replace that code with bug 890026 if I ever get around to finishing that.
Flags: needinfo?(ted)
You need to log in before you can comment on or make changes to this bug.