Closed
Bug 1304063
Opened 9 years ago
Closed 8 years ago
Autophone - investigate and resolve failure to emit TEST-PASS, TEST-UNEXPECTED-FAIL for local S1S2Tests
Categories
(Testing Graveyard :: Autophone, defect)
Testing Graveyard
Autophone
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1381511
People
(Reporter: bc, Assigned: bc)
References
Details
Attachments
(5 files, 1 obsolete file)
|
14.63 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
|
44.29 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
|
813 bytes,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
|
8.38 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
|
177.67 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
Ever since we moved to having Treeherder parse logs, it became apparent that there is an issue with emitting to the logs on the production Ubuntu servers that is not seen locally by me on my Fedora workstation.
Fedora:
https://treeherder.allizom.org/#/jobs?repo=autoland&revision=2b2ccdb21d028b7e5ad9e7011fcbf52ea52e2b35&filter-searchStr=autophone
https://autophone-dev.s3.amazonaws.com/pub/mobile/tinderbox-builds/autoland-android-api-15/1474280838/autophone-s1s2-s1s2-blank-remote.ini-1-nexus-6p-6-c8311e45-e367-4c9c-abae-10288150ae85-autophone.log
> 2016-09-19 05:21:14,362|16860|MainThread|root|DEBUG|S1S2TestJob|nexus-6p-6|20160919102718|remote-blank|analyze_logcat: base: 2016-09-19 08:21:09.542000, start: 916.0, throbber start: 1349.0, throbber stop: 3054.0, throbber time: 1705.0
2016-09-19 05:21:14,575|16860|MainThread|root|INFO|S1S2TestJob|nexus-6p-6|20160919102718|remote-blank| TEST-PASS | http://remote/files/s1s2/blank.html | Ok.
Ubuntu:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=a947ab8f930dba5786ca1a016ed3c68cb21f6c75&filter-searchStr=autophone
https://autophone.s3.amazonaws.com/pub/mobile/tinderbox-builds/autoland-android-api-15/1474372216/autophone-s1s2-s1s2-twitter-local.ini-1-nexus-4-5-fd30d55e-d9d3-405e-a500-b43d2ba4175a-autophone.log
016-09-20 06:09:08,297|6279|MainThread|root|INFO|S1S2TestJob|nexus-4-5|20160920115016|local-twitter|analyze_logcat: base_time: 09-20 13:09:07.399
2016-09-20 06:09:08,297|6279|MainThread|root|INFO|S1S2TestJob|nexus-4-5|20160920115016|local-twitter|analyze_logcat: new start_time: 09-20 13:09:07.850 gecko
I've tried to resolve this in various means. See bug 1300890.
Until this is resolved, the results are still being posted to phonedash but the blank s1s2tests will show orange if more than one measurement is missing. The Failure summary will be blank however.
| Assignee | ||
Comment 1•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f5cbf0ee1014bd5a305866294e6325483b484499&filter-searchStr=autophone
Crash occurred but was not added to the Failure Summary tab.
| Assignee | ||
Comment 2•9 years ago
|
||
minor lint type fixes.
Attachment #8834285 -
Flags: review?(jmaher)
| Assignee | ||
Comment 3•9 years ago
|
||
Attachment #8834286 -
Flags: review?(jmaher)
| Assignee | ||
Comment 4•9 years ago
|
||
I ran tests earlier with DEBUG logging enabled, but am running another test now with INFO logging. I have an external script which is using ADBAndroid.pkill using signal 11 (SIGSEGV) to kill org.mozilla.fennec on my local devices randomly. This generates crashes with dumps and stresses out the crash detecting and reporting. It is ongoing at the moment but may be finished by the time you see this.
https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&group_state=expanded&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&startdate=2017-01-03&fromchange=f3da0505833f946d1b1767986ebb8e3d7a63f5c1&tochange=56b0d9ecb97b83a94f8edef1a44f3b3926facd5d
Comment 5•9 years ago
|
||
Comment on attachment 8834285 [details] [diff] [review]
bug-1304063-linting-v1.patch
Review of attachment 8834285 [details] [diff] [review]:
-----------------------------------------------------------------
linting patches are fun to review
Attachment #8834285 -
Flags: review?(jmaher) → review+
Comment 6•9 years ago
|
||
Comment on attachment 8834286 [details] [diff] [review]
bug-1304063-results-v1.patch
Review of attachment 8834286 [details] [diff] [review]:
-----------------------------------------------------------------
this is a lot of changes, but ignoring the status messages shift, it is a simpler patch to review. Lets go with this for now- I suspect in the longer term future if we run on taskcluster we will be revisiting some of this.
Attachment #8834286 -
Flags: review?(jmaher) → review+
| Assignee | ||
Comment 7•9 years ago
|
||
Comment on attachment 8834286 [details] [diff] [review]
bug-1304063-results-v1.patch
>
> def add_failure(self, testpath, test_status, text, testresult_status):
>+ """Report a test failure.
>+
>+ :param testpath: A string identifying the test.
>+ :param test_status: A string identifying the type of test failure.
>+ :param text: A string decribing the failure.
>+ :param testresult_status: Test status to be reported to Treeherder.
>+ One of PhoneTest.{BUSTED,EXCEPTION,TESTFAILED,UNKNOWN,USERCANCEL}.
>+ """
> self.message = text
> self.update_status(message=text)
> testpath = _normalize_testpath(testpath)
>- if testresult_status:
>+ if self.status == TreeherderStatus.SUCCESS:
>+ # Only use the first testresult_status when reporting to Treeherder.
> self.status = testresult_status
I'm not sure about this restriction on only using the first failure status. If we get a TESTFAILED first, then a BUSTED we won't see the BUSTED. I think it would be better if I removed the
>+ if self.status == TreeherderStatus.SUCCESS:
and just set the self.status = testresult_status unconditionally. Is that ok?
Flags: needinfo?(jmaher)
| Assignee | ||
Comment 8•9 years ago
|
||
fyi, beginning with the build at Mon Feb 6, 19:49:03 I included this change in the test. As you can see, we are getting BUSTED for tpn. Note the Android 4.0 / samsung-gs3-3 was charging before and all of its results will be after this change.
Comment 9•9 years ago
|
||
I am fine with ignoring the first failure by marking it as SUCCESS
Flags: needinfo?(jmaher)
| Assignee | ||
Comment 10•9 years ago
|
||
Actually I meant this.
| Assignee | ||
Comment 11•9 years ago
|
||
I had some network issue which caused the build downloads to temporarily fail perhaps related to my setting up a virtual machine in kvm on my workstation at the same time the test was running. This resulted in some builds not being tested and which were not cleaned up. In addition there was an exception due to errors during a pull:
https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&group_state=expanded&startdate=2017-01-03&fromchange=017a3a0d4a0016326821a54f5e8af909b7b4c075&tochange=28f5dd10784202d1fb9b31548df0909eba8c25e6
worker.py should do a better job of reporting download errors and clean up when it can not download/install a build.
I'll come back to this after I spend some more quality time with awsf.
Assignee: nobody → bob
| Assignee | ||
Comment 12•9 years ago
|
||
Make it formal.
Attachment #8834407 -
Attachment is obsolete: true
Attachment #8834785 -
Flags: review?(jmaher)
| Assignee | ||
Comment 13•9 years ago
|
||
1. Tweaks PhoneTest teardown_job to not do the debug logcat.get if debugging is not set.
2. Tweak PhoneWorkerSubProcess install_build to keep trying for the full range of retries and to attempt to recover via ping.
3. Record failure to get the build after the maximum retries as a failure rather than just setting the test's status to BUSTED. Also call teardown_job for each test when we can't get the build so that Treeherder is updated.
Test Run:
https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&fromchange=27f9cf49b9fe0a029def3a56adc2d5ff97ce963b&tochange=2edf2a4c017aa348db91dcc979d0634f9359ad72&group_state=expanded
first build was allowed to complete normally.
second build was triggered with 127.0.0.1 queue.taskcluster.net added to the /etc/hosts to simulate a network failure.
subsequent builds were run while the fennec killer script was running to randomly kill fennec processes on the devices.
Attachment #8834788 -
Flags: review?(jmaher)
Updated•9 years ago
|
Attachment #8834785 -
Flags: review?(jmaher) → review+
Comment 14•9 years ago
|
||
Comment on attachment 8834788 [details] [diff] [review]
bug-1304063-downloads-v1.patch
Review of attachment 8834788 [details] [diff] [review]:
-----------------------------------------------------------------
this looks like a good set of changes
Attachment #8834788 -
Flags: review?(jmaher) → review+
| Assignee | ||
Comment 15•9 years ago
|
||
https://github.com/mozilla/autophone/commit/7f0b3c7ee4fa9c963afc7fd9ac3e549fa268c4b0
https://github.com/mozilla/autophone/commit/b2ac46227a9512fe40a0a0538ebdf1f6930d6138
https://github.com/mozilla/autophone/commit/947c412f8184bba617c07b3315be8fd4c2b7a008
https://github.com/mozilla/autophone/commit/c97b8b4f7125b2241be70dda18355e8460ac8bec
deployed 2017-02-08 05:30
| Assignee | ||
Comment 16•9 years ago
|
||
Well, we've had two instances where the failure messages were not emitted in the log in production.
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a75a77d1bb99e3238d422572fbf5e8fd102af8bb&filter-searchStr=autophone
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=171570725b697ca80261c6caf5acff88682b248c&filter-searchStr=autophone
So, whatever improvements we did make and there were several, we are still not getting the output on the Ubuntu 16.04 machines though we do on my Fedora 25 workstation.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 17•9 years ago
|
||
I've almost gotten the logging rewritten to only use Filehandlers on the main process and to send logging requests from the worker subprocesses to the main process using a SocketServer.StreamRequestHandler. There is an outstanding issue regarding carrying over the logs from one job to the next, but I expect to have that fixed soon.
Some example jobs for comparison:
Current logging using loglevel DEBUG:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=5178fedbc8f2decf28bf0173260e85d0f1dc5da7&filter-searchStr=autophone&exclusion_profile=false
Current logging using loglevel INFO:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=2e7c10a9b86e30691f67855f6c8f98d984508d7c&filter-searchStr=autophone&exclusion_profile=false
New logging using loglevel DEBUG:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=a812a96caba20dea7bc3815b71b4784afc6b864b&filter-searchStr=autophone&exclusion_profile=false
New logging using loglevel INFO:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=48c0fd9c9ec5d68061ea7b59358874ae8da72572&filter-searchStr=autophone&exclusion_profile=false
Note that one change is that the Autophone log, the logcat log and the Unit test log if there is one are now combined into a single Autophone.log. This will make it much easier to see all errors in one place in the logviewer and will mean all errors are parsed by the log parser and are available sheriffing.
The outstanding issue for the new logging is that some of the logs are not being uploaded and are being combined with logs from other tests. I have an idea for the cause and will test the fix soonest.
| Assignee | ||
Comment 18•9 years ago
|
||
Status update: We almost have it. Example runs:
https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&fromchange=b61d6231f6dc9f44d5db7ac797c5ff2f6b2e89b0&group_state=expanded&tochange=aab0dfdae32f14246e3bed8824a5f7ce309276cd
https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=4488424e14ae3230fb773903343253944f4b4f5e&filter-searchStr=autophone
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&filter-searchStr=autophone&exclusion_profile=false&fromchange=23fe0b76a018a5077a0f7234cff91c41e4b6af64
All of the logs are successfully uploaded now but it appears the extra long failure lines with the autophone specific leading data especially with DEBUG logging enabled is causing Treeherder to miss failures due to assertion counts in the Unit tests. One last tweak to emit the "Treeherder"/"Tinderbox" failure message on a separate line and another test of a durable queue with large number of messages and I think we'll have it.
| Assignee | ||
Comment 19•9 years ago
|
||
The tweak I made to put the failure on its own line only affected autophone-only failure messages and not the unit tests. If the issue is the prefix causing Treeherder to miss the assertion failure then, that tweak was insufficient so I've reverted and restarted the server.
Note that with the patch from Bug 1353509, we successfully handled over 16,000 pending build finished messages from the queue though it took close to 50 minutes over my T1 connection. We may want to take the parts of the message handling that has to communicate with hg and taskcluster to determine build properties off of the same thread that consumes the messages but that is for another day.
For the curious you can see the current test run on https://treeherder.allizom.org for autoland, mozilla-inbound, mozilla-central, mozilla-beta and mozilla-release. Each device has 72+ builds queued up including opt and debug. I'll test usercancels then disable pulse and let the tests run. I'll update with exact links after I've disabled pulse and the revision ranges are stable.
| Assignee | ||
Comment 20•9 years ago
|
||
This patch changes the logging in Autophone to:
1. Use a logging.handlers.SocketHandler in the worker sub process to communicate with a SocketServer.ThreadingTCPServer to send all logging requests from the non-main processes to the main process where all logging file i/o is performed.
2. Eliminates the use log global logger variables with the exception of the autophone.py main process.
3. Uses the message queue and the log records to communicate to the main process synchronize and control log filehandler flushing, closing and initialization.
4. Improves handling of third partly library loggers which are no longer disabled.
5. Changed some logging messages from debug to info to get them into the normal production logs. We probably can do more but this is a start.
6. The logs are now consolidated into a single log. This allows Treeherder to parse all failure messages and will hopefully improve the user experience and help sheriffing.
7. The worker logs now propagate into the main process's log so that a full accounting for the main process and all worker processes is available on the host even when truncated worker logs are submitted to Treeherder.
8. Logs now contain steps which are parsed and linked by the log viewer.
I did attempt to use a separate Queue for the log control messages to the main process logging server, but had some difficulties and fell back to this approach which I had already demonstrated.
Current runs with loglevel INFO:
https://treeherder.allizom.org/#/jobs?repo=autoland&filter-searchStr=autophone&exclusion_profile=false&fromchange=3e6477d932037d6026ac13bd8c988dc0a51935d4&tochange=3e44cbaec84b4412136e9b36a8403a88cc6b7742
https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&filter-searchStr=autophone&exclusion_profile=false&fromchange=a4b157d9d0e7e05384734605b5bdf989e801c8ea&tochange=0ad1250ede3a5f0d419bcd584945a6669fed9754
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&filter-searchStr=autophone&exclusion_profile=false&fromchange=e7bf9443be2c4a5187c37440e35f3526148d7fa8&tochange=e7bf9443be2c4a5187c37440e35f3526148d7fa8
https://treeherder.allizom.org/#/jobs?repo=mozilla-beta&filter-searchStr=autophone&exclusion_profile=false&fromchange=93700f9bdc9cb4e278d331049dff765c75771632&tochange=93700f9bdc9cb4e278d331049dff765c75771632
https://treeherder.allizom.org/#/jobs?repo=mozilla-release&filter-searchStr=autophone&exclusion_profile=false&fromchange=15ee275c8c9cb283cface2c369a37c35ff5f7472&tochange=5ccc87d83d8546e1254294eafe7c57f126f56dec
If this doesn't fix the error message handling in production, I will switch to Fedora during mozsfo.
Attachment #8865020 -
Flags: review?(jmaher)
| Assignee | ||
Comment 21•9 years ago
|
||
Comment 22•9 years ago
|
||
Comment on attachment 8865020 [details] [diff] [review]
bug-1304063-sockethandler-v1.patch
Review of attachment 8865020 [details] [diff] [review]:
-----------------------------------------------------------------
we are always fiddling with the logger and levels of logging, a full pass on this didn't raise any red flags
Attachment #8865020 -
Flags: review?(jmaher) → review+
| Assignee | ||
Comment 23•9 years ago
|
||
https://github.com/mozilla/autophone/commit/27372a95b4c8c213eadc23f385a1f29e6b5ab7be
deployed 2017-05-06 ~11:25.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 24•9 years ago
|
||
This *did not* fix the issue with the perf tests. I hate ubuntu.
| Assignee | ||
Comment 25•8 years ago
|
||
Reopening. We will resolve this by changing to Fedora in production.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Updated•8 years ago
|
Status: REOPENED → RESOLVED
Closed: 9 years ago → 8 years ago
Resolution: --- → DUPLICATE
Updated•4 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•