Closed Bug 1300890 Opened 3 years ago Closed 3 years ago

Autophone - fix some pylint issues and simplify logging

Categories

(Testing :: Autophone, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(6 files)

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=38cb109c653c886d792b012bdbdb104259d2923b&filter-searchStr=autophone&exclusion_profile=false

Shows an example of a crash in a S1S2 test which does not emit a treeherder recognizable failure message though the browser crashed.

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=fafc28ee1d5d1537c790c8d03f1ee34fce0f0204&exclusion_profile=false&filter-searchStr=autophone&selectedJob=3141092

Shows an example of an S1S2 test failing to get a measurement which does not emit a treeherder recognizable failure message.
Cleans dmp files etc after processing and eliminate duplicate crash message.
Attachment #8789434 - Flags: review?(jmaher)
Comment on attachment 8789433 [details] [diff] [review]
bug-1300890-1-fsync-v1.patch

Review of attachment 8789433 [details] [diff] [review]:
-----------------------------------------------------------------

looks good overall!

::: autophonetreeherder.py
@@ +464,5 @@
>                  if t.test_logfile:
>                      try:
>                          t.test_logfilehandler.flush()
> +                        # Force the OS to flush its buffers to disk
> +                        os.fsync(t.test_logfilehandler.stream.fileno())

do we still need the flush() ?
Attachment #8789433 - Flags: review?(jmaher) → review+
Comment on attachment 8789434 [details] [diff] [review]
bug-1300890-2-crash-v1.patch

Review of attachment 8789434 [details] [diff] [review]:
-----------------------------------------------------------------

more polish!
Attachment #8789434 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher) from comment #3)
> Comment on attachment 8789433 [details] [diff] [review]
> bug-1300890-1-fsync-v1.patch
> 
> Review of attachment 8789433 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> looks good overall!
> 
> ::: autophonetreeherder.py
> @@ +464,5 @@
> >                  if t.test_logfile:
> >                      try:
> >                          t.test_logfilehandler.flush()
> > +                        # Force the OS to flush its buffers to disk
> > +                        os.fsync(t.test_logfilehandler.stream.fileno())
> 
> do we still need the flush() ?

I think so. The filehandler flush flushes the python buffers to the OS buffers. the os.fsync flushes the OS buffers to disk. At least that is how I understand it at the moment.
https://github.com/mozilla/autophone/commit/a32920b359ccbcabd8dee2b8d1c06b5e5829e647
https://github.com/mozilla/autophone/commit/cf913b782187993c3afb432e377e8bff1a61f80f

deployed 2016-09-08 ~10:15 PDT

If we still have problems I'll file a follow up bug.
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Actually, just reopening this bug.

https://treeherder.allizom.org/#/jobs?repo=autoland&revision=b535fb46742b36ecc87a8d1fba4a7fad27714588&filter-searchStr=autophone

shows an INFO level log run where the PhoneTest starting job appears but not the PhoneTest stopping job. Note a previous DEBUG level log run did show the PhoneTest stopping job: 

https://autophone-dev.s3.amazonaws.com/v1/task/HdQKg1Y5QKOOOsWGk-Xusg/runs/0/artifacts/public/build/autophone-s1s2-s1s2-nytimes-remote.ini-1-samsung-gs3-3-4c69172e-16d6-4f71-bd0b-7d5d2522d605-autophone.log

Either I messed something up in a last minute change or there is some difference between my workstation and the server.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I have been trying to identify the issue and without going into too much detail, have found differences between the behavior on my Fedora 23 Python 2.7.11 and production Ubuntu 16 Python 2.7.12 with regard to the local vs remote tests. The remote test instances are initialized last in the autophone.py process and this appears to affect whether the PhoneTest.add_(pass|failure) self.loggerdeco.info messages showing TEST-PASS or TEST-UNEXPECTED-FAIL appear in the logs.

I have a partially working version deployed to production on a development branch. This shows the pass/fail messages for the remote tests but not the local ones. It *may* be the use of logger = logging.getLogger() in the global scope and its subsequent re-use is causing the problems. 

I am going to investigate:

* whether eliminating/reducing pylint issues will help reduce disagreement between Fedora and Ubuntu.

* whether cleaning up of the global logger and test instance logging will help.
This patch reduces the number of pylint warnings about the code especially with regard to redefined variables, raw strings containing escape sequences, and the use of logging arguments instead of string interpolations (logging-not-lazy). Not all pylint warnings/errors are eliminated. The command used was:

pylint --disable=fixme,unused-variable,line-too-long,invalid-name,missing-docstring,too-many-arguments,too-many-locals,too-many-instance-attributes,too-many-statements,too-many-public-symbols,too-many-branches,too-many-lines,too-many-return-statements,too-few-public-methods,too-many-public-methods *.py tests/*.py

The other aspects of the patch:

* eliminates the global logger in phontest.py

* initializes the PhoneTest instances in the worker processes with the logger from the worker process to eliminate references to to the original logger first used in the autophone process.

* eliminates PhoneTestResult and make its attributes part of PhoneTest. This helps simplify which logger is to be used when emitting pass/fail messages.

* moves the Logcat class into worker.py where it is more logically associated... One worker process <-> one device rather than many test instances <-> one device.

None of this helps with the blank tests losing the logging output from PhoneTest.add_{pass,failure} on Ubuntu.

I tried a number of other possible work arounds over the weekend though I have discarded them since they did not improve the code quality in of themselves and didn't fix the issue either.

The only thing I can think of to try next is to eliminate the use of the TimedRotatingFileHandler and the separate log file for the test's Autophone log and instead use a single log file which is reinitialized for each test.
Attachment #8792501 - Flags: review?(jmaher)
Attached file pylint-before.txt
Attached file pylint-after.txt
Comment on attachment 8792501 [details] [diff] [review]
bug-1300890-3-pylint-logging-v1.patch

Review of attachment 8792501 [details] [diff] [review]:
-----------------------------------------------------------------

wow, this is a lot of code!  second largest review I have ever done (I think) :)

::: builds.py
@@ +18,4 @@
>  import zipfile
>  
>  import slugid
> +import taskcluster.client

this is an updated version of a library?

@@ +1452,5 @@
>  
>      @property
>      def date(self):
>          if not self._date:
> +            formatstr, self._date = parse_datetime(self.id, tz=UTC)

earlier I see code referencing timezone pacific, is this because that is where the devices are located?

::: phonetest.py
@@ +925,5 @@
> +            if type(handler) == logging.FileHandler:
> +                logger.error('PhoneTest.setup_job: removing unexpected logger file handler: %s %s',
> +                             handler, handler.__dict__)
> +                logger.removeHandler(handler)
> +        handler = None

nice catch
Attachment #8792501 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher) from comment #12)
> Comment on attachment 8792501 [details] [diff] [review]
> bug-1300890-3-pylint-logging-v1.patch
> 
> Review of attachment 8792501 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> wow, this is a lot of code!  second largest review I have ever done (I
> think) :)
> 
> ::: builds.py
> @@ +18,4 @@
> >  import zipfile
> >  
> >  import slugid
> > +import taskcluster.client
> 
> this is an updated version of a library?
> 

No. Just an attempt to silence pylint's warnings about Queue and Index. Since I didn't use taskcluster directly but only taskcluster.client, I went with importing that.

> @@ +1452,5 @@
> >  
> >      @property
> >      def date(self):
> >          if not self._date:
> > +            formatstr, self._date = parse_datetime(self.id, tz=UTC)
> 
> earlier I see code referencing timezone pacific, is this because that is
> where the devices are located?
> 

No. The pacific time zone stuff is for buildbot builds which use pacific while taskcluster builds use utc.
(In reply to Bob Clary [:bc:] from comment #9)
>
> The only thing I can think of to try next is to eliminate the use of the
> TimedRotatingFileHandler and the separate log file for the test's Autophone
> log and instead use a single log file which is reinitialized for each test.

I've tried this (and deployed to production temporarily since 10am PT) and it still doesn't fix this issue.
bummer :(  I think we need to go to plan Band solve this next week/quarter.
Ok. Let's go with this to simplify things. Once I'll land I'll rename this to a code improvement bug and will file a follow to figure out whatever stoopid mistake I am making with the blank logs.
Attachment #8792803 - Flags: review?(jmaher)
Comment on attachment 8792803 [details] [diff] [review]
bug-1300890-4-simple-filehandler-worker-log-v1.patch

Review of attachment 8792803 [details] [diff] [review]:
-----------------------------------------------------------------

::: phonetest.py
@@ -23,4 @@
>  from adb import ADBError
>  from logdecorator import LogDecorator
>  from phonestatus import PhoneStatus
> -from sensitivedatafilter import SensitiveDataFilter

why are we removing this? just to simplify logging?
Attachment #8792803 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher) from comment #17)
> Comment on attachment 8792803 [details] [diff] [review]
> bug-1300890-4-simple-filehandler-worker-log-v1.patch
> 
> Review of attachment 8792803 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: phonetest.py
> @@ -23,4 @@
> >  from adb import ADBError
> >  from logdecorator import LogDecorator
> >  from phonestatus import PhoneStatus
> > -from sensitivedatafilter import SensitiveDataFilter
> 
> why are we removing this? just to simplify logging?

Previously we had

* a logging.handlers.TimedRotatingFileHandler for the worker which persisted the log on the server. This logger had a SensitiveDataFilter attached to prevent sensitive data from being logged.

* a logging.FileHandler for the test which was created during PhoneTest.setup_job and destroyed in PhoneTest.teardown_job. This FileHandler also had a SensitiveDataFilter attached (probably redundantly) for the same reason.

Since we used the same root logger in the worker and test, logging messages would appear in both the worker's log and in the test's log.

The "advantage" of this approach was that we could "supposedly" easily upload the log for the specific test run to Treeherder while at the same time keeping a full log of the worker's operation and of all of the test runs in the worker's log on the server.

The new approach uses only the worker's log file which is uploaded to Treeherder. We no longer have to create or destroy a FileHandler for the test log and no longer have to add a filter to the test's FileHandler. The single log is now truncated either on first use or on first use after the previous test's PhoneTest.teardown_job which closed it. We lose the ability to keep a log of everything for the worker on the server. We gain a bit in terms of reducing complexity and marginally in logging more of the worker's operations in the log to be uploaded to Treeherder (before PhoneTest.setup_job).

Since we are not creating the additional FileHandler in phonetest.py, we no longer need to add or import SensitiveDataFilter in phonetest.py.

I could probably keep the separate worker and test logs thus keeping the complete worker log and simplify things a little bit by only creating the test's FileHandler once during __init__ and taking advantage of the automatic truncation when mode='w'. It is a question whether creating the FileHandler in the main autophone process and then using it during the test's execution in the worker's process would cause any problems.

Since the original motiviation was to fix the issue with the missing TEST-PASS, TEST-UNEXPECTED-FAIL message and the fact that this patch *did not help*, we could just drop it and leave the situation as it currently stands.

Which version do you like / find more understandable?
the new code is easier to read, so lets land this patch!
https://github.com/mozilla/autophone/commit/c3a145eb6be124cc715931dedd07519d7a4814f4
https://github.com/mozilla/autophone/commit/b1a5f7f5976906251cf256d8c5f894b9e300f2da
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Summary: Autophone - Perf tests do not emit Treeherder recognizable failure log lines → Autophone - fix some pylint issues and simplify logging
See Also: → 1304063
You need to log in before you can comment on or make changes to this bug.