Closed Bug 959520 Opened 10 years ago Closed 10 years ago

Make socket.timeout error messages more useful

Categories

(Remote Protocol :: Marionette, defect)

x86
All
defect
Not set
normal

Tracking

(b2g-v1.3 fixed)

RESOLVED FIXED
mozilla30
Tracking Status
b2g-v1.3 --- fixed

People

(Reporter: mdas, Assigned: mdas)

References

Details

Attachments

(1 file, 3 obsolete files)

This is the most mystifying error for users, and it makes it seem like marionette is causing the failure when that is likely not the case. When we see this error in CI, we should have some logic to detect what has crashed or suggest areas to investigate.

When we see this error, we should check our platform, and if it is a b2g device, we should check if the b2g process is still running, if not, report that. If it is still running we should grep some of the logcat for errors and suggest those as investigation points.
This patch adds a new TestResult mixin that will automatically be added by the base runner if it detects that we're running against a device. The mixin will check the test results, and if socket.timeout or Broken Pipe show up in the test result, it will run a diagnosis to figure out what went wrong. I don't like having to shell out to get the logcat output, but I don't see the benefit of doing that work through python.

To test this out locally, I'd recommend the following changes:
- in client.py, in line 60,  "def connect(self, timeout=360.0):" change timeout to be 10
- test_log.py. At the end of the first test, add:
        self.marionette.set_context("chrome")
        self.marionette.navigate("chrome://browser/content/aboutCertError.xhtml")

Then run test_log.py against a device, and desktop. The device will have B2G related output, and desktop output should have no difference.

Also, I've changed the 'bla bla bla socket closed?' string since it's confusing.
Attachment #8363193 - Flags: review?(jgriffin)
Comment on attachment 8363193 [details] [diff] [review]
Determine B2G state and output recent errors

Review of attachment 8363193 [details] [diff] [review]:
-----------------------------------------------------------------

Cool, I think something like this is exactly what we need!

My primary concern is that I believe the system restarts the B2G process if it crashes, at least in some cases, and if that happens, depending on the timing of the Python exception, this code may report that there is an unknown cause of the error.  To avoid that, can we record the PID of the b2g process at some opportune times, and then infer that B2G has crashed if one of these errors occurs and the PID has changed?

::: testing/marionette/client/marionette/runner/mixins/b2g.py
@@ +50,5 @@
> +                b2g_output = device_manager.shellCheckOutput(['b2g-ps'])
> +                if 'b2g' in b2g_output:
> +                    # find recent errors
> +                    # In the event that the logcat is very large(ie: endurance tests), we use files and grep instead of using dm.getLogcat
> +                    b2g_running = subprocess.Popen(['adb', 'logcat'],

You can use logcat -d to dump the logcat and exit, which would be cleaner and less error prone than waiting for a second and then killing the logcat process.

@@ +55,5 @@
> +                                                    stdout=open('logcat_output.log', 'w'),
> +                                                    stderr=subprocess.STDOUT)
> +                    time.sleep(1)
> +                    b2g_running.kill()
> +                    relevant_output  = subprocess.Popen(['grep', '-i', 'error\|exception', 'logcat_output.log'],

It seems like this code would be cleaner using Python re; i.e. you could avoid using the secondary file.  Using mozprocess for the invocation of 'adb logcat -d' with a processOutputLine handler would probably be cleaner still (e.g. http://mxr.mozilla.org/mozilla-central/source/build/mobile/b2gautomation.py#26), but it's not totally necessary.
Attachment #8363193 - Flags: review?(jgriffin) → review-
(In reply to Jonathan Griffin (:jgriffin) from comment #2)
> Comment on attachment 8363193 [details] [diff] [review]
> Determine B2G state and output recent errors
> 
> Review of attachment 8363193 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Cool, I think something like this is exactly what we need!
> 
> My primary concern is that I believe the system restarts the B2G process if
> it crashes, at least in some cases, and if that happens, depending on the
> timing of the Python exception, this code may report that there is an
> unknown cause of the error.  To avoid that, can we record the PID of the b2g
> process at some opportune times, and then infer that B2G has crashed if one
> of these errors occurs and the PID has changed?
> 
Sure I'm fiddling around with that now.

> ::: testing/marionette/client/marionette/runner/mixins/b2g.py
> @@ +50,5 @@
> > +                b2g_output = device_manager.shellCheckOutput(['b2g-ps'])
> > +                if 'b2g' in b2g_output:
> > +                    # find recent errors
> > +                    # In the event that the logcat is very large(ie: endurance tests), we use files and grep instead of using dm.getLogcat
> > +                    b2g_running = subprocess.Popen(['adb', 'logcat'],
> 
> You can use logcat -d to dump the logcat and exit, which would be cleaner
> and less error prone than waiting for a second and then killing the logcat
> process.
> 
> @@ +55,5 @@
> > +                                                    stdout=open('logcat_output.log', 'w'),
> > +                                                    stderr=subprocess.STDOUT)
> > +                    time.sleep(1)
> > +                    b2g_running.kill()
> > +                    relevant_output  = subprocess.Popen(['grep', '-i', 'error\|exception', 'logcat_output.log'],
> 
> It seems like this code would be cleaner using Python re; i.e. you could
> avoid using the secondary file.  Using mozprocess for the invocation of 'adb
> logcat -d' with a processOutputLine handler would probably be cleaner still
> (e.g.
> http://mxr.mozilla.org/mozilla-central/source/build/mobile/b2gautomation.
> py#26), but it's not totally necessary.

Okay, we can use mozdevice for this. I've changed this and it looks simpler, hopefully the size of the logcat won't be an issue.
Updated according to comments
Attachment #8363193 - Attachment is obsolete: true
Attachment #8365177 - Flags: review?(jgriffin)
Comment on attachment 8365177 [details] [diff] [review]
Determine B2G state and output recent errors

Review of attachment 8365177 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/marionette/client/marionette/client.py
@@ +90,4 @@
>  
>          for packet in [data[i:i + self.max_packet_length] for i in
>                         range(0, len(data), self.max_packet_length)]:
> +            try: 

nit: trailing whitespace

::: testing/marionette/client/marionette/runner/base.py
@@ +340,5 @@
> +
> +    def run(self, test):
> +        dm_type = os.environ.get('DM_TRANS', 'adb')
> +        if dm_type == 'adb':
> +            self.b2g_pid = get_b2g_pid(get_dm()) 

nite: trailing whitespace

@@ +341,5 @@
> +    def run(self, test):
> +        dm_type = os.environ.get('DM_TRANS', 'adb')
> +        if dm_type == 'adb':
> +            self.b2g_pid = get_b2g_pid(get_dm()) 
> +        return super(B2GMarionetteTextTestRunner, self).run(test)

We'll want to save self.b2g_pid during gaiatest's start_b2g() method as well, in a separate patch, since that method restarts B2G within the context of a test.  Eventually, we may want to pull a few of the generic GaiaDevice methods into a B2GDevice mixin, along with get_b2g_pid().

::: testing/marionette/client/marionette/runner/mixins/b2g.py
@@ +62,5 @@
> +                    # find recent errors
> +                    message = ""
> +                    error_re = re.compile(r"""[\s\S]*(exception|error)[\s\S]*""", flags=re.IGNORECASE)
> +                    logcat = device_manager.getLogcat()
> +                    latest = [] 

nit: trailing whitespace

@@ +74,5 @@
> +                        iters -= 1
> +                    message += "\nMost recent errors/exceptions are:\n"
> +                    for line in reversed(latest):
> +                        message += "%s" % line
> +                    pid = get_b2g_pid(device_manager)

We call 'b2g-ps' on line 60 too; I think we could consolidate these and just call 'b2g-ps' once.
Attachment #8365177 - Flags: review?(jgriffin) → review+
Thanks, I've addressed the comments and filed Bug 964423 with respect to the Gaia restart issue.

Moving r+ forward
Attachment #8365177 - Attachment is obsolete: true
Attachment #8366156 - Flags: review+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/4ca21fff1c94 - it would have been permaorange in marionette-webapi, from https://tbpl.mozilla.org/php/getParsedLog.php?id=33650158&tree=Mozilla-Inbound, except that buildbot treats "devicemanager.DMError" as a sign that a failure was from a busted tegra or panda and the job should be retried, so we just kept rerunning it and rerunning it all afternoon.
I've fixed this locally and will test on try as soon as the tree is open.

Once green, I have to wait for https://bugzilla.mozilla.org/show_bug.cgi?id=940554 to land, since it changes the 'socket.timeout' message we rely on here.
this patch passed on try:

https://tbpl.mozilla.org/?tree=Try&rev=cc15cec23fd4

The difference here is using the given adb path when running against an emulator, and the change from 'socket.timeout' to 'Connection timed out'
Attachment #8366156 - Attachment is obsolete: true
Attachment #8367309 - Flags: review?(jgriffin)
Blocks: 965256
Comment on attachment 8367309 [details] [diff] [review]
Determine B2G state and output recent errors

Review of attachment 8367309 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8367309 - Flags: review?(jgriffin) → review+
https://hg.mozilla.org/mozilla-central/rev/083e26c54750
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: