Closed Bug 701755 Opened 13 years ago Closed 12 years ago

be over aggressive in killing off servers during the setup scripts for a test run

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: bear)

References

Details

(Keywords: intermittent-failure, Whiteboard: [mobile_unittests][android_tier_1])

Attachments

(1 file, 2 obsolete files)

Last quarter we landed code to ensure our test harnesses will cleanup properly on errors.  This works great, but there are scenarios where the device loses connectivity and we kill the entire job from buildbot/twisted error handling.  The end result is we still have leftover webservers sitting around.

We need to be super aggressive before each test run while we are installing the .apk and look for anything running on the ports that we care about and kill -9 those processes.
Assignee: nobody → bear
Blocks: 438871
Attachment #587118 - Flags: feedback?(hwine)
Comment on attachment 587118 [details] [diff] [review]
try and remove *all* related child processes for a tegra during cleanup

Looks as if the return value from checkStalled is:
 False - no processes were found to be killed
 True - some processes were found to be killed from 'ps' OR
        some "well known" processes were found and could not be killed
(i.e. it's not related to whether there are still processes running after function completes)

That seems an odd enough contract to be worth a comment.
Attachment #587118 - Flags: feedback?(hwine) → feedback+
(In reply to Hal Wine [:hwine] from comment #2)
> Comment on attachment 587118 [details] [diff] [review]
> try and remove *all* related child processes for a tegra during cleanup
> 
> Looks as if the return value from checkStalled is:
>  False - no processes were found to be killed
>  True - some processes were found to be killed from 'ps' OR
>         some "well known" processes were found and could not be killed
> (i.e. it's not related to whether there are still processes running after
> function completes)
> 
> That seems an odd enough contract to be worth a comment.

yea, I'm torn on that also - do I just assume that now with the more aggressive bits that it will always clean up and not return anything or make the True just be informational in the cleanup.py calling step
(In reply to Mike Taylor [:bear] from comment #3)
> (In reply to Hal Wine [:hwine] from comment #2)
> > Comment on attachment 587118 [details] [diff] [review]
> > try and remove *all* related child processes for a tegra during cleanup
> > 
> > Looks as if the return value from checkStalled is:
> >  False - no processes were found to be killed
> >  True - some processes were found to be killed from 'ps' OR
> >         some "well known" processes were found and could not be killed
> > (i.e. it's not related to whether there are still processes running after
> > function completes)
> > 
> > That seems an odd enough contract to be worth a comment.
> 
> yea, I'm torn on that also - do I just assume that now with the more
> aggressive bits that it will always clean up and not return anything or make
> the True just be informational in the cleanup.py calling step

Maybe something like

CLEAN = 1
DIRTY_BUT_KILLED = 2
DIRTY_NOT_KILLED = 3

and return one of those?
Comment on attachment 587118 [details] [diff] [review]
try and remove *all* related child processes for a tegra during cleanup

Hm, should you be checking the return value of the killPID call in checkStalled ?
I can see wanting to have more info about what's happening, either in output or in the return value or logs or w/e.
Attachment #587118 - Flags: review?(aki) → review+
changed result from checkStalled to reflect if stalled pids were found and handled or found and not handled
Attachment #587118 - Attachment is obsolete: true
Attachment #596680 - Flags: review?(aki)
Comment on attachment 596680 [details] [diff] [review]
try and remove *all* related child processes for a tegra during cleanup

>+if checkStalled(os.env['SUT_NAME']) > 1:
>+    setFlag(errorFile, "Remote Device Error: process from previous test run present")
>+    sys.exit(2)

I like how you put this in a method.
Why are we setting an error flag if we found old cruft but cleaned it up?
I would guess we'd want a warning at 2 and an error at 3.
Possibly an "all clear" message at 1 depending how verbose we want to be.

>+    for pid in pids:
>+        if not killPID(pid, sig=signal.SIGKILL, includeChildren=True):
>+            result := 3

I thought this was some sort of special python command, but on Python 2.6.7 this appears to be a typo:

>>> a = 2
>>> a := 3
  File "<stdin>", line 1
    a := 3
      ^
SyntaxError: invalid syntax

We're close though.
Attachment #596680 - Flags: review?(aki) → review-
changed response to checkStalled to show a warning if the result was > 1 and removed errant Delphi-style assignment that snuck in
Attachment #596680 - Attachment is obsolete: true
Attachment #596711 - Flags: review?(aki)
Comment on attachment 596711 [details] [diff] [review]
try and remove *all* related child processes for a tegra during cleanup

I almost want a final else: to catch any weirdness when errcode > 1 but is not 2 or 3.  But this will work without it.  r=me.
Attachment #596711 - Flags: review?(aki) → review+
Comment on attachment 596711 [details] [diff] [review]
try and remove *all* related child processes for a tegra during cleanup

committed changeset 2226:4ef60f593587
Attachment #596711 - Flags: checked-in+
overnight cronjob runs of this new code has removed all but 1 hung CP tegra \o/

marking as fixed
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [orange][mobile_unittests][android_tier_1] → [mobile_unittests][android_tier_1]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: