Closed Bug 1304063 Opened 4 years ago Closed 3 years ago

Autophone - investigate and resolve failure to emit TEST-PASS, TEST-UNEXPECTED-FAIL for local S1S2Tests

Categories

(Testing :: Autophone, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1381511

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(5 files, 1 obsolete file)

Ever since we moved to having Treeherder parse logs, it became apparent that there is an issue with emitting to the logs on the production Ubuntu servers that is not seen locally by me on my Fedora workstation.

Fedora:
https://treeherder.allizom.org/#/jobs?repo=autoland&revision=2b2ccdb21d028b7e5ad9e7011fcbf52ea52e2b35&filter-searchStr=autophone

https://autophone-dev.s3.amazonaws.com/pub/mobile/tinderbox-builds/autoland-android-api-15/1474280838/autophone-s1s2-s1s2-blank-remote.ini-1-nexus-6p-6-c8311e45-e367-4c9c-abae-10288150ae85-autophone.log

> 2016-09-19 05:21:14,362|16860|MainThread|root|DEBUG|S1S2TestJob|nexus-6p-6|20160919102718|remote-blank|analyze_logcat: base: 2016-09-19 08:21:09.542000, start: 916.0, throbber start: 1349.0, throbber stop: 3054.0, throbber time: 1705.0 
2016-09-19 05:21:14,575|16860|MainThread|root|INFO|S1S2TestJob|nexus-6p-6|20160919102718|remote-blank| TEST-PASS | http://remote/files/s1s2/blank.html | Ok.

Ubuntu:

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=a947ab8f930dba5786ca1a016ed3c68cb21f6c75&filter-searchStr=autophone

https://autophone.s3.amazonaws.com/pub/mobile/tinderbox-builds/autoland-android-api-15/1474372216/autophone-s1s2-s1s2-twitter-local.ini-1-nexus-4-5-fd30d55e-d9d3-405e-a500-b43d2ba4175a-autophone.log

016-09-20 06:09:08,297|6279|MainThread|root|INFO|S1S2TestJob|nexus-4-5|20160920115016|local-twitter|analyze_logcat: base_time: 09-20 13:09:07.399
2016-09-20 06:09:08,297|6279|MainThread|root|INFO|S1S2TestJob|nexus-4-5|20160920115016|local-twitter|analyze_logcat: new start_time: 09-20 13:09:07.850 gecko

I've tried to resolve this in various means. See bug 1300890.

Until this is resolved, the results are still being posted to phonedash but the blank s1s2tests will show orange if more than one measurement is missing. The Failure summary will be blank however.
minor lint type fixes.
Attachment #8834285 - Flags: review?(jmaher)
Attachment #8834286 - Flags: review?(jmaher)
I ran tests earlier with DEBUG logging enabled, but am running another test now with INFO logging. I have an external script which is using ADBAndroid.pkill using signal 11 (SIGSEGV) to kill org.mozilla.fennec on my local devices randomly. This generates crashes with dumps and stresses out the crash detecting and reporting. It is ongoing at the moment but may be finished by the time you see this.

https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&group_state=expanded&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&startdate=2017-01-03&fromchange=f3da0505833f946d1b1767986ebb8e3d7a63f5c1&tochange=56b0d9ecb97b83a94f8edef1a44f3b3926facd5d
Comment on attachment 8834285 [details] [diff] [review]
bug-1304063-linting-v1.patch

Review of attachment 8834285 [details] [diff] [review]:
-----------------------------------------------------------------

linting patches are fun to review
Attachment #8834285 - Flags: review?(jmaher) → review+
Comment on attachment 8834286 [details] [diff] [review]
bug-1304063-results-v1.patch

Review of attachment 8834286 [details] [diff] [review]:
-----------------------------------------------------------------

this is a lot of changes, but ignoring the status messages shift, it is a simpler patch to review.  Lets go with this for now- I suspect in the longer term future if we run on taskcluster we will be revisiting some of this.
Attachment #8834286 - Flags: review?(jmaher) → review+
Comment on attachment 8834286 [details] [diff] [review]
bug-1304063-results-v1.patch


> 
>     def add_failure(self, testpath, test_status, text, testresult_status):
>+        """Report a test failure.
>+
>+        :param testpath: A string identifying the test.
>+        :param test_status: A string identifying the type of test failure.
>+        :param text: A string decribing the failure.
>+        :param testresult_status: Test status to be reported to Treeherder.
>+            One of PhoneTest.{BUSTED,EXCEPTION,TESTFAILED,UNKNOWN,USERCANCEL}.
>+        """
>         self.message = text
>         self.update_status(message=text)
>         testpath = _normalize_testpath(testpath)
>-        if testresult_status:
>+        if self.status == TreeherderStatus.SUCCESS:
>+            # Only use the first testresult_status when reporting to Treeherder.
>             self.status = testresult_status

I'm not sure about this restriction on only using the first failure status. If we get a TESTFAILED first, then a BUSTED we won't see the BUSTED. I think it would be better if I removed the

>+        if self.status == TreeherderStatus.SUCCESS:

and just set the self.status = testresult_status unconditionally. Is that ok?
Flags: needinfo?(jmaher)
fyi, beginning with the build at Mon Feb 6, 19:49:03 I included this change in the test. As you can see, we are getting BUSTED for tpn. Note the Android 4.0 / samsung-gs3-3 was charging before and all of its results will be after this change.
I am fine with ignoring the first failure by marking it as SUCCESS
Flags: needinfo?(jmaher)
Attached file status patch (obsolete) —
Actually I meant this.
I had some network issue which caused the build downloads to temporarily fail perhaps related to my setting up a virtual machine in kvm on my workstation at the same time the test was running. This resulted in some builds not being tested and which were not cleaned up. In addition there was an exception due to errors during a pull:

https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&group_state=expanded&startdate=2017-01-03&fromchange=017a3a0d4a0016326821a54f5e8af909b7b4c075&tochange=28f5dd10784202d1fb9b31548df0909eba8c25e6

worker.py should do a better job of reporting download errors and clean up when it can not download/install a build.

I'll come back to this after I spend some more quality time with awsf.
Assignee: nobody → bob
Make it formal.
Attachment #8834407 - Attachment is obsolete: true
Attachment #8834785 - Flags: review?(jmaher)
1. Tweaks PhoneTest teardown_job to not do the debug logcat.get if debugging is not set.

2. Tweak PhoneWorkerSubProcess install_build to keep trying for the full range of retries and to attempt to recover via ping.

3. Record failure to get the build after the maximum retries as a failure rather than just setting the test's status to BUSTED. Also call teardown_job for each test when we can't get the build so that Treeherder is updated.

Test Run:
https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&fromchange=27f9cf49b9fe0a029def3a56adc2d5ff97ce963b&tochange=2edf2a4c017aa348db91dcc979d0634f9359ad72&group_state=expanded

first build was allowed to complete normally.

second build was triggered with 127.0.0.1 queue.taskcluster.net added to the /etc/hosts to simulate a network failure.

subsequent builds were run while the fennec killer script was running to randomly kill fennec processes on the devices.
Attachment #8834788 - Flags: review?(jmaher)
Attachment #8834785 - Flags: review?(jmaher) → review+
Comment on attachment 8834788 [details] [diff] [review]
bug-1304063-downloads-v1.patch

Review of attachment 8834788 [details] [diff] [review]:
-----------------------------------------------------------------

this looks like a good set of changes
Attachment #8834788 - Flags: review?(jmaher) → review+
Well, we've had two instances where the failure messages were not emitted in the log in production.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a75a77d1bb99e3238d422572fbf5e8fd102af8bb&filter-searchStr=autophone

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=171570725b697ca80261c6caf5acff88682b248c&filter-searchStr=autophone

So, whatever improvements we did make and there were several, we are still not getting the output on the Ubuntu 16.04 machines though we do on my Fedora 25 workstation.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 1347839
I've almost gotten the logging rewritten to only use Filehandlers on the main process and to send logging requests from the worker subprocesses to the main process using a SocketServer.StreamRequestHandler. There is an outstanding issue regarding carrying over the logs from one job to the next, but I expect to have that fixed soon.

Some example jobs for comparison:

Current logging using loglevel DEBUG:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=5178fedbc8f2decf28bf0173260e85d0f1dc5da7&filter-searchStr=autophone&exclusion_profile=false

Current logging using loglevel INFO:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=2e7c10a9b86e30691f67855f6c8f98d984508d7c&filter-searchStr=autophone&exclusion_profile=false

New logging using loglevel DEBUG:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=a812a96caba20dea7bc3815b71b4784afc6b864b&filter-searchStr=autophone&exclusion_profile=false

New logging using loglevel INFO:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=48c0fd9c9ec5d68061ea7b59358874ae8da72572&filter-searchStr=autophone&exclusion_profile=false

Note that one change is that the Autophone log, the logcat log and the Unit test log if there is one are now combined into a single Autophone.log. This will make it much easier to see all errors in one place in the logviewer and will mean all errors are parsed by the log parser and are available sheriffing.

The outstanding issue for the new logging is that some of the logs are not being uploaded and are being combined with logs from other tests. I have an idea for the cause and will test the fix soonest.
Status update: We almost have it. Example runs:

https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&fromchange=b61d6231f6dc9f44d5db7ac797c5ff2f6b2e89b0&group_state=expanded&tochange=aab0dfdae32f14246e3bed8824a5f7ce309276cd

https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=4488424e14ae3230fb773903343253944f4b4f5e&filter-searchStr=autophone

https://treeherder.allizom.org/#/jobs?repo=mozilla-central&filter-searchStr=autophone&exclusion_profile=false&fromchange=23fe0b76a018a5077a0f7234cff91c41e4b6af64

All of the logs are successfully uploaded now but it appears the extra long failure lines with the autophone specific leading data especially with DEBUG logging enabled is causing Treeherder to miss failures due to assertion counts in the Unit tests. One last tweak to emit the "Treeherder"/"Tinderbox" failure message on a separate line and another test of a durable queue with  large number of messages and I think we'll have it.
The tweak I made to put the failure on its own line only affected autophone-only failure messages and not the unit tests. If the issue is the prefix causing Treeherder to miss the assertion failure then, that tweak was insufficient so I've reverted and restarted the server.

Note that with the patch from Bug 1353509, we successfully handled over 16,000 pending build finished messages from the queue though it took close to 50 minutes over my T1 connection. We may want to take the parts of the message handling that has to communicate with hg and taskcluster to determine build properties off of the same thread that consumes the messages but that is for another day.

For the curious you can see the current test run on https://treeherder.allizom.org for autoland, mozilla-inbound, mozilla-central, mozilla-beta and mozilla-release. Each device has 72+ builds queued up including opt and debug. I'll test usercancels then disable pulse and let the tests run. I'll update with exact links after I've disabled pulse and the revision ranges are stable.
This patch changes the logging in Autophone to:

1. Use a logging.handlers.SocketHandler in the worker sub process to communicate with a SocketServer.ThreadingTCPServer to send all logging requests from the non-main processes to the main process where all logging file i/o is performed.

2. Eliminates the use log global logger variables with the exception of the autophone.py main process.

3. Uses the message queue and the log records to communicate to the main process synchronize and control log filehandler flushing, closing and initialization.

4. Improves handling of third partly library loggers which are no longer disabled.

5. Changed some logging messages from debug to info to get them into the normal production logs. We probably can do more but this is a start.

6. The logs are now consolidated into a single log. This allows Treeherder to parse all failure messages and will hopefully improve the user experience and help sheriffing.

7. The worker logs now propagate into the main process's log so that a full accounting for the main process and all worker processes is available on the host even when truncated worker logs are submitted to Treeherder.

8. Logs now contain steps which are parsed and linked by the log viewer.

I did attempt to use a separate Queue for the log control messages to the main process logging server, but had some difficulties and fell back to this approach which I had already demonstrated.

Current runs with loglevel INFO:

https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&fromchange=3e6477d932037d6026ac13bd8c988dc0a51935d4&tochange=3e44cbaec84b4412136e9b36a8403a88cc6b7742

https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&filter-searchStr=autophone&exclusion_profile=false&fromchange=a4b157d9d0e7e05384734605b5bdf989e801c8ea&tochange=0ad1250ede3a5f0d419bcd584945a6669fed9754

https://treeherder.allizom.org/#/jobs?repo=mozilla-central&filter-searchStr=autophone&exclusion_profile=false&fromchange=e7bf9443be2c4a5187c37440e35f3526148d7fa8&tochange=e7bf9443be2c4a5187c37440e35f3526148d7fa8

https://treeherder.allizom.org/#/jobs?repo=mozilla-beta&filter-searchStr=autophone&exclusion_profile=false&fromchange=93700f9bdc9cb4e278d331049dff765c75771632&tochange=93700f9bdc9cb4e278d331049dff765c75771632

https://treeherder.allizom.org/#/jobs?repo=mozilla-release&filter-searchStr=autophone&exclusion_profile=false&fromchange=15ee275c8c9cb283cface2c369a37c35ff5f7472&tochange=5ccc87d83d8546e1254294eafe7c57f126f56dec

If this doesn't fix the error message handling in production, I will switch to Fedora during mozsfo.
Attachment #8865020 - Flags: review?(jmaher)
Comment on attachment 8865020 [details] [diff] [review]
bug-1304063-sockethandler-v1.patch

Review of attachment 8865020 [details] [diff] [review]:
-----------------------------------------------------------------

we are always fiddling with the logger and levels of logging, a full pass on this didn't raise any red flags
Attachment #8865020 - Flags: review?(jmaher) → review+
https://github.com/mozilla/autophone/commit/27372a95b4c8c213eadc23f385a1f29e6b5ab7be

deployed 2017-05-06 ~11:25.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Depends on: 1362899
This *did not* fix the issue with the perf tests. I hate ubuntu.
Depends on: 1366731
Depends on: 1367584
Reopening. We will resolve this by changing to Fedora in production.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 1375621
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1381511
You need to log in before you can comment on or make changes to this bug.