Closed Bug 904839 Opened 11 years ago Closed 11 years ago

B2G emulator tests still timing out after 1200s, even with the mozprocess-timeout code

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgriffin, Assigned: jgriffin)

References

Details

Attachments

(2 files, 2 obsolete files)

B2G emulator unittests are still being killed by buildbot after 1200s with no output, despite the fact that we are running them with mozprocess and an output_timeout of 1000, which should cause them to be killed by mozharness instead, and thus allow a dump of the logcat to occur.

sample log:  https://tbpl.mozilla.org/php/getParsedLog.php?id=26160870&full=1&branch=mozilla-central
Blocks: 820739
Blocks: 892563
This adds some logging to the mozprocess section of run_command; hopefully this will help us figure out why the problem described in this bug's summary is occurring.
Attachment #793563 - Flags: review?(ahalberstadt)
Assignee: nobody → jgriffin
Comment on attachment 793563 [details] [diff] [review]
Add some mozprocess logging,

Review of attachment 793563 [details] [diff] [review]:
-----------------------------------------------------------------

A print statement in the on_timeout method might also be helpful.
Attachment #793563 - Flags: review?(ahalberstadt) → review+
I added your suggestion and landed:  https://hg.mozilla.org/build/mozharness/rev/a4e6f54010d3

I'll trigger a run on cedar to make sure nothing bad happens, then push to production.
Merged to production branch of mozharness.
Live as of now.
In all cases this logging shows that we are using mozprocess to run the command, but in some cases we're not catching the timeout.  I can't tell why.
In bug 917252 I noticed that the messages in the on_timeout handler can get interleaved with the process output somehow. Did you ctrl-f for "mozprocess timed out"? Because it could just be hidden among some random line of output above.
That doesn't appear to be the case here; see a recent log:  https://tbpl.mozilla.org/php/getParsedLog.php?id=28100440&full=1&branch=mozilla-inbound
Blocks: 916135
Blocks: 919353
WIP.  This seems to resolve this issue on try, but I need to do a full try run to make sure it doesn't break anything else.

Without this patch: https://tbpl.mozilla.org/?tree=Try&rev=2471fcda8961
With this patch: https://tbpl.mozilla.org/?tree=Try&rev=fc8f06309cfa
This patch seems to work well for B2G mochitests/reftests, but fails for linux tests and for B2G marionette. :(
v2..handle the case where os.read returns only one or more newlines.  try: https://tbpl.mozilla.org/?tree=Try&rev=6f94f56145c8
Attachment #815961 - Attachment is obsolete: true
(In reply to Jonathan Griffin (:jgriffin) from comment #12)
> Created attachment 816074 [details] [diff] [review]
> Use os.read instead of readline in readWithTimeout,
> 
> v2..handle the case where os.read returns only one or more newlines.  try:
> https://tbpl.mozilla.org/?tree=Try&rev=6f94f56145c8

This is much better; the only problem left is that we're somehow messing up Windows builds.  Will investigate a little more.
Attached patch mozprocess.diffSplinter Review
v3: locally fixes the Windows build problems
Attachment #816074 - Attachment is obsolete: true
Comment on attachment 816650 [details] [diff] [review]
mozprocess.diff

Review of attachment 816650 [details] [diff] [review]:
-----------------------------------------------------------------

This is looking good on try, so I'm putting up for review.
Attachment #816650 - Flags: review?(jhammel)
Comment on attachment 816650 [details] [diff] [review]
mozprocess.diff

Review of attachment 816650 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.  Looks good.

::: testing/mozbase/mozprocess/mozprocess/processhandler.py
@@ +695,5 @@
>                  lineReadTimeout = outputTimeout
>  
> +            (lines, self.didTimeout) = self.readWithTimeout(logsource, lineReadTimeout)
> +            while lines != "" and not self.didTimeout:
> +                for line in lines.split('\n'):

why not splitlines? Is this from universal newlines? (do we enforce that?)

@@ +793,5 @@
> +                self.read_buffer += output
> +                if '\n' not in self.read_buffer:
> +                    time.sleep(0.01)
> +                    continue
> +                tmp = self.read_buffer.split('\n')

again, '\n' v splitlines
Attachment #816650 - Flags: review?(jhammel) → review+
split('\n') seemed more precise.  Here's a try run using splitlines() instead:

https://tbpl.mozilla.org/?tree=Try&rev=8b2d8b3c7af7
(In reply to Jonathan Griffin (:jgriffin) from comment #18)
> split('\n') seemed more precise.  Here's a try run using splitlines()
> instead:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=8b2d8b3c7af7

It turns out this doesn't quite work, at least for the second case.  In that case, we want to differentiate between 'test\nfoo\n' and 'test\nfoo', which is a little clumsier with splitlines().  So, I'll keep split('\n') in that case.
(In reply to Jonathan Griffin (:jgriffin) from comment #19)
> (In reply to Jonathan Griffin (:jgriffin) from comment #18)
> > split('\n') seemed more precise.  Here's a try run using splitlines()
> > instead:
> > 
> > https://tbpl.mozilla.org/?tree=Try&rev=8b2d8b3c7af7
> 
> It turns out this doesn't quite work, at least for the second case.  In that
> case, we want to differentiate between 'test\nfoo\n' and 'test\nfoo', which
> is a little clumsier with splitlines().  So, I'll keep split('\n') in that
> case.

And here's a green try run which shows this works ok:  https://hg.mozilla.org/try/rev/1c31fd8992c5
https://github.com/mozilla/mozbase/commit/576b5285c981b6ba938cb1f8716aa96cd7e3b046

We'll have to mirror to m-c before we can tell if this fixes this problem, so I'll leave this bug open until then.
(In reply to Jonathan Griffin (:jgriffin) from comment #21)
> https://github.com/mozilla/mozbase/commit/
> 576b5285c981b6ba938cb1f8716aa96cd7e3b046
> 
> We'll have to mirror to m-c before we can tell if this fixes this problem,
> so I'll leave this bug open until then.

Totally green!  After the versionbump patch is r+'d, I'll land that in github and do another try run with the result, since it will include a few more packages than the above.
(In reply to Jonathan Griffin (:jgriffin) from comment #22)
> (In reply to Jonathan Griffin (:jgriffin) from comment #21)
> > https://github.com/mozilla/mozbase/commit/
> > 576b5285c981b6ba938cb1f8716aa96cd7e3b046
> > 
> > We'll have to mirror to m-c before we can tell if this fixes this problem,
> > so I'll leave this bug open until then.
> 
> Totally green!  After the versionbump patch is r+'d, I'll land that in
> github and do another try run with the result, since it will include a few
> more packages than the above.

wrong bug, sorry !
This appears to be fixed by some recent changes to mozrunner and/or mozprocess.  I'll re-open if I see this again.

Getting stack traces appropriately is being handled under separate bugs, like bug 919353.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
this is causing talos to fail if I run with mozprocess 0.13.  I get a never ending stream of __metrics (a line printed at the end of the logfile for the warmup run).  I really need another fix that is in 0.13 and not in 0.11.
I added a \n to the last __metrics line printed out here: http://hg.mozilla.org/build/talos/file/tip/talos/getInfo.html#l66, and the problem is solved.  We really need to fix this in the mozprocess code as this is not handling generic process output.
(In reply to Joel Maher (:jmaher) from comment #26)
> I added a \n to the last __metrics line printed out here:
> http://hg.mozilla.org/build/talos/file/tip/talos/getInfo.html#l66, and the
> problem is solved.  We really need to fix this in the mozprocess code as
> this is not handling generic process output.

Agreed.  Is this problem you're seeing on Windows, or everywhere?
oh, bad bug reporting.  this was seen on all linux and osx platforms, which was easily seen here:
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L779

windows worked great.
Thanks, I'll try to write a test to reproduce this.
Depends on: 932382
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: