Closed Bug 1362070 Opened 7 years ago Closed 7 years ago

Need to upgrade mozprocess in mozharness to handle nested job groups in Windows 10

Categories

(Release Engineering :: Applications: MozharnessCore, enhancement)

enhancement
Not set
normal

Tracking

(firefox56 fixed)

RESOLVED FIXED
Tracking Status
firefox56 --- fixed

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

The version of mozprocess used by mozharness is an old static copy, that was reintroduced into the gecko tree mozharness was moved in-tree.

Since introducing Windows 10 testing, I noticed that the version of mozprocess used isn't handling nested job groups, despite this being a feature of mozprocess that was added some time ago in bug 1139722.

I suspect this is causing test failures on Windows 10, because without nested job groups, we cannot safely terminate processes as part of test teardown, and therefore we cannot guarantee that sequential tests do not interfere with each other.

My suggestion would be that in-tree mozharness simply points to the mozprocess that already exists in tree (/testing/mozbase/mozprocess/mozprocess) - at the moment there is a static old copy at /testing/mozharness/mozprocess.

To reiterate: bug 1139722 is one critical feature that we require for Windows 10, but undoubtedly other fixes would be useful - and let's not patch the stale version and end up having two forked versions, let's try to just maintain a single version.
Hey Coop, would this be something your team could help with, or is this something for A-Team?

Thanks!
Flags: needinfo?(coop)
To say, Buildbot based mozharness still :( only grabs mozharness as a standalone directory (via archiver) so will not have access to any other in-tree files for a given test suite, unless we copy them into the mozharness directory *or* explicitly pull it as part of mozharness, *before* mozharness requires it.
(In reply to Pete Moore [:pmoore][:pete] from comment #0)
> I suspect this is causing test failures on Windows 10, because without
> nested job groups, we cannot safely terminate processes as part of test
> teardown, and therefore we cannot guarantee that sequential tests do not
> interfere with each other.

Are jobs actually failing because of this? Is it all tests, or just some tests?
 
(In reply to Justin Wood (:Callek) from comment #2)
> To say, Buildbot based mozharness still :( only grabs mozharness as a
> standalone directory (via archiver) so will not have access to any other
> in-tree files for a given test suite, unless we copy them into the
> mozharness directory *or* explicitly pull it as part of mozharness, *before*
> mozharness requires it.

Is this something that needs to be fixed *before* we setup Win10 in TC? Can it be fixed as part of that switch?

IMO there's little upside to fixing the buildbot version. Let's fix this in a TC world and let it ride the trains.
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #3)
> (In reply to Justin Wood (:Callek) from comment #2)
> > To say, Buildbot based mozharness still :( only grabs mozharness as a
> > standalone directory (via archiver) so will not have access to any other
> > in-tree files for a given test suite, unless we copy them into the
> > mozharness directory *or* explicitly pull it as part of mozharness, *before*
> > mozharness requires it.
> 
> Is this something that needs to be fixed *before* we setup Win10 in TC? Can
> it be fixed as part of that switch?
> 
> IMO there's little upside to fixing the buildbot version. Let's fix this in
> a TC world and let it ride the trains.

If we leave mozprocess in the mozharness dir, that sounds great.
If we remove mozprocess from the mozharness dir without addressing this issue, then we break all of buildbot, including release promotion.  I think that's a non-starter.

The simplest fix is probably to update mozprocess in the mozharness directory, and address the duplicate copy later.
(In reply to Chris Cooper [:coop] from comment #3)
> (In reply to Pete Moore [:pmoore][:pete] from comment #0)
> > I suspect this is causing test failures on Windows 10, because without
> > nested job groups, we cannot safely terminate processes as part of test
> > teardown, and therefore we cannot guarantee that sequential tests do not
> > interfere with each other.
> 
> Are jobs actually failing because of this? Is it all tests, or just some
> tests?
>  

I suspect there could be several-difficult-to-diagnose tests failing because of this, I don't have solid data. The reason for this is, in Windows 8 and higher, the way to cleanly terminate a firefox process tree requires creating a job group that is necessarily nested (since the firefox process already inherits another job group).

The tricky thing is, if this introduces failures, it is likely to be in a subsequent test that tries to start firefox, but is not able to, since it was not cleanly shut down.

I think Chris Manchester made the original implementation of the fix in bug 1139722, but in discovering this, it highlighted that we have these two versions of mozprocess in-tree, which we should probably clean up, rather than just patching in the one fix, and then later it being harder to unify the two versions.

> (In reply to Justin Wood (:Callek) from comment #2)
> > To say, Buildbot based mozharness still :( only grabs mozharness as a
> > standalone directory (via archiver) so will not have access to any other
> > in-tree files for a given test suite, unless we copy them into the
> > mozharness directory *or* explicitly pull it as part of mozharness, *before*
> > mozharness requires it.
> 
> Is this something that needs to be fixed *before* we setup Win10 in TC? Can
> it be fixed as part of that switch?

Yes it can be done as part of the win10 rollout, since only affects tests running on win8 and higher, which is only win10 at this point.

> IMO there's little upside to fixing the buildbot version. Let's fix this in
> a TC world and let it ride the trains.

+1
I think I said this on IRC, but yeah, we should just copy the fix from testing/mozbase into the mozharness copy for now, and we can worry about de-duplicating once we stop running jobs on buildbot.
Ted, do we know what still uses /testing/mozbase/mozprocess? If buildbot tests all run under mozharness now, could it be that nothing is still using /testing/mozbase/mozprocess?

It isn't clear to me if the two versions of mozprocess have diverged (i.e. both been adapted), or whether the mozbase version is just ahead of the mozharness version. Because of this I'm not sure if it is better just to patch in the single fix that Chris made, or to replace the mozharness version with the mozbase version. What I'm thinking is, if e.g. we just patched this one fix in, and that was enough to get Win10 running and eventually buildbot turned off, it might be then that we delete the mozbase version as nothing uses it, and then we'd lose the additional fixes that went into it, that could mean we have subtle regressions in the testing that we might just not be aware of at the moment. At the same time, without having an in-depth understanding of all the changes that potentially went into both branches of the mozprocess code, I'm not sure if a straight replacement would also introduce regressions, i.e. losing fixes there were only applied only to the mozharness version.

Perhaps this mozprocess surgery might be better performed by someone with a bit more context on the history and the usage of the library than me - I feel a bit out-of-my-depth calling the shots on this one.

I had a look at hg history to try to see if it was a simple fast-forward or not, but due to mozharness being moved around in-tree, I think the linear history breaks, making it a bit trickier to understand if the mozharness version was patched at all since the fork occured.

Chris, I'm curious also what your opinion is, as you worked on bug 1139722 and might have some more context here to this and maybe other bug fixes that have gone into mozprocess?
Flags: needinfo?(ted)
Flags: needinfo?(cmanchester)
(In reply to Pete Moore [:pmoore][:pete] from comment #7)
> Ted, do we know what still uses /testing/mozbase/mozprocess? If buildbot
> tests all run under mozharness now, could it be that nothing is still using
> /testing/mozbase/mozprocess?

All of our test harnesses still use that copy, last I checked.

I picked an arbitrary Windows buildbot Mochitest job from treeherder:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&selectedJob=97591856

Here's the log:
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win32/1494312582/mozilla-inbound_win7_vm_test-mochitest-1-bm128-tests1-windows-build61.txt.gz

One of the first steps that mozharness executes is `create_virtualenv`, where it creates a Python virtualenv that's distinct from the one that's executing the mozharness script:
00:51:05     INFO - Running main action method: create_virtualenv
00:51:05     INFO - Creating virtualenv c:\slave\test\build\venv

It then proceeds to `pip install` the mozbase packages from the test package (which came from testing/mozbase):
00:51:46     INFO - Copy/paste: c:\slave\test\build\venv\Scripts\pip install --no-deps --timeout 120 -r c:\slave\test\build\tests\config\mozbase_requirements.txt --no-index --find-links http://pypi.pvt.build.mozilla.org.proxxy1.srv.releng.use1.mozilla.com/pub --find-links http://pypi.pub.build.mozilla.org.proxxy1.srv.releng.use1.mozilla.com/pub --find-links http://pypi.pvt.build.mozilla.org/pub --find-links http://pypi.pub.build.mozilla.org/pub

And we can see that that includes mozprocess:
00:51:51     INFO -  Unpacking c:\slave\test\build\tests\mozbase\mozprocess
00:51:51     INFO -    Running setup.py egg_info for package from file:///c%7C%5Cslave%5Ctest%5Cbuild%5Ctests%5Cmozbase%5Cmozprocess
Flags: needinfo?(ted)
mozprocess is shipped along with the common.tests.zip package. So if mozharness doesn't necessarily need it before unpacking this archive, we might be able to get rid of it.
core mozharness does need mozprocess, before common.tests.zip is downloaded (and some mozharness scripts don't download common.tests.zip): https://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/mozharness/base/script.py#l56

We can keep going around in circles, or we can fix the duplicate problem by upgrading buildbot, or we can just land a duplicate copy of mozprocess in mozharness.  We've been pointing to this last solution for a while now.
I would be a little surprised if bug 1139722 impacted jobs other than Marionette based restart tests, but my impression is mozharness' mozprocess hasn't diverged in any incompatible way, and we shouldn't hesitate to update it if we think that's the best short term fix here.
Flags: needinfo?(cmanchester)
:pmoore,

Can you do a try-all push with an upgraded mozprocess (copied to mozharness). And own further work in here, if any?

I'm told this blocks the win10 work, and among your other w10 work I think this falls in line with you quite well.

I completely agree with aki's c#10 and concur it is the path we should proceed with.

(If we're concerned about divergence in this world where we have a copy, we can write a [quick] task that turns orange if the file's don't match in both places, triggered by a change to either one, as a followup)
(In reply to Justin Wood (:Callek) from comment #12)
> :pmoore,
> 
> Can you do a try-all push with an upgraded mozprocess (copied to
> mozharness). And own further work in here, if any?
> 
> I'm told this blocks the win10 work, and among your other w10 work I think
> this falls in line with you quite well.
> 
> I completely agree with aki's c#10 and concur it is the path we should
> proceed with.
> 
> (If we're concerned about divergence in this world where we have a copy, we
> can write a [quick] task that turns orange if the file's don't match in both
> places, triggered by a change to either one, as a followup)

This sounds much better suited to someone that understands mozprocess/mozharness more and the implications of changing either of those and the affects on things relying on it.
I should clarify, my request of pete is to update and run on try, looking for any actual failures.

Anything that does fail due to upgrading it is not on pete to fix, but identifying the failures as existing is. If no failures I'd then ask him to get a formal review on the patch and push it to the tree.
Flags: needinfo?(pmoore)
OK I'll run two try pushes:

1) as is
2) replacing mozprocess in mozharness, with the copy in-tree
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pmoore][:pete] from comment #15)
> OK I'll run two try pushes:
> 
> 1) as is
> 2) replacing mozprocess in mozharness, with the copy in-tree

1) https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f8b94ff9a25499ce7236a1ac14af923a41b6fc1
2) https://treeherder.mozilla.org/#/jobs?repo=try&revision=814f4e5acb898c14a8e8525c3f1ca32116ddc7ff
Over to you :-)
Assignee: nobody → bugspam.Callek
3:49 PM <Callek> KWierso: Aryx: ping -- can you confirm/deny any fallout from teh change in https://bugzilla.mozilla.org/show_bug.cgi?id=1362070#c16 -- I'm looking for any signs of a bustage.
3:49 PM <firebot> Bug 1362070 — NEW, bugspam.Callek@gmail.com — Need to upgrade mozprocess in mozharness to handle nested job groups in Windows 10
3:49 PM <Callek> (two try pushes linked, one without a change, one with the new mozprocess)
3:55 PM <Aryx> Callek: looks good. no idea about the Win 10 Wr because that's not on integration afaik
3:55 PM <Callek> Aryx: yea, not too worried about anything not on integration

Pete, given that above are you willing to submit this as a mozreview patch?  I'll own [or find owners for] any regressions from it landing ;-)
Flags: needinfo?(pmoore)
Thanks Callek! I'll attach a patch shortly.
Flags: needinfo?(pmoore)
This is a straight replacement:

pmoore@Petes-iMac:~/hg/mozilla-central $ rm -rf testing/mozharness/mozprocess
pmoore@Petes-iMac:~/hg/mozilla-central $ cp -pr testing/mozbase/mozprocess/mozprocess testing/mozharness/mozprocess
Assignee: bugspam.Callek → pmoore
Status: NEW → ASSIGNED
Attachment #8876065 - Flags: review?(bugspam.Callek)
Attachment #8876065 - Flags: review?(bugspam.Callek) → review+
Blocks: 1195299
There was a merge conflict with https://reviewboard.mozilla.org/r/118772/diff/6#index_header ("Bug 1048446 - [mozprocess] Allow passing in a function as a value to 'processOutputLine', r?gbrown") but I've spoken to ahal, and that patch exists already upstream, so I can obliterate it:


From http://logs.glob.uno/?c=taskcluster#c199543 :

<&pmoore> ahal: in bug 1362070 i'm upgrading mozprocess in mozharness to be the same as the mozbase version. i saw you landing a change in the mozharness fork of mozprocess in bug 1048446 but not in mozbase/mozprocess. would there be any harm if it landed it in mozbase/mozprocess too?
16:17 
<firebot> https://bugzil.la/1362070 — ASSIGNED, pmoore — Need to upgrade mozprocess in mozharness to handle nested job groups in Windows 10
16:17 https://bugzil.la/1048446 — FIXED, ahalberstadt — selftests for Mochitest
16:19 
<ahal> pmoore: yeah, that change exists already in the canonical version, feel free to clobber whatever is in the mozharness one
16:19 
<&pmoore> ahal: awesome, thanks!
16:19 
<ahal> pmoore: also, awesome! that is very overdue
16:20 I tried doing that a year ago in bug 1195299, I think I just never got around to pushing on it
16:20 
<firebot> https://bugzil.la/1195299 — NEW, nobody@mozilla.org — Remove duplicated mozbase modules from mozharness
16:20 
<&pmoore> ah nice!
16:21 
<ahal> pmoore: if you don't mind, do you think you could update mozinfo and mozfile while you're at it?
16:21 those two should be relatively trivial compared to mozprocess
16:21 
<&pmoore> ahal: this first step will just be a plain copy - i think the removal will happen after migrating fully to taskcluster
16:21 
<ahal> (mozprocess might even depend on them)
16:21 yeah, I agree a plain update makes sense first
16:21 
<&pmoore> ahal: sure. can i use you for r? ?
16:22 
<ahal> definitely
16:22 
<&pmoore> thanks!
16:22 
<ahal> yeah, once everything is using the mozharness.zip, it'll be easy to remove them
16:25 ah yeah, mozprocess depends on mozinfo, which depends on mozfile
16:25 probably why those three are in there in the first place



The upgrade of mozinfo and mozfile I'll be doing in a second bug, as this has already been r+'d and had a good try push. I've hung this bug of 1195299 and will also hang the mozinfo/mozfile upgrade off that bug too.
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/43d9a3838a42
Upgrade version of mozprocess inside mozharness to the latest version,r=Callek
https://hg.mozilla.org/mozilla-central/rev/43d9a3838a42
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: