Closed Bug 1164082 Opened 9 years ago Closed 9 years ago

Autophone - intermittent failure to submit results to phonedash

Categories

(Testing Graveyard :: Autophone, defect)

defect
Not set
major

Tracking

(firefox41 affected)

RESOLVED FIXED
Tracking Status
firefox41 --- affected

People

(Reporter: bc, Assigned: bc)

References

Details

(Keywords: regression)

Attachments

(2 files)

Attached file Traceback
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=678c2a90dd64&exclusion_profile=false&filter-searchStr=autophone

https://autophone.s3.amazonaws.com/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-api-9/1431437021/autophone-autophone-s1s2-s1s2-nytimes-remote.ini-1-nexus-s-3.log

tests/perftest.py:publish_results.py should retry on a timeout error.

This has been happening to one degree or another after bug 1161784 and the foul up with updating Python. It results in a lost measurement and causes manual retriggers to have to be performed to recover.
Attached file PR 33
This uses hard coded 3 max attempts and a 10 second delay. Not sure it is worth trying to parametrize this.
Attachment #8604947 - Flags: review?(gbrown)
Comment on attachment 8604947 [details] [review]
PR 33

I think this is fine as-is; the hard-coded 3 and 10 don't offend me in this case.

However, I find myself wondering if it might be useful to retry a wider group of errors. Could we recover from other conditions (temporary network glitches encountered at this point in the code, producing error codes other than 60)? If you wanted to try, 3 x 10 seconds seems like too little time -- I would retry for maybe 5 or 10 minutes in total. So then that starts getting more complicated...probably not justified if you are not encountering other problems.
Attachment #8604947 - Flags: review?(gbrown) → review+
Yeah, 3x10 might not be enough. If I fail to submit the results, I have to spend 7-10 minutes at least re-testing the build so maybe 5 minutes total is reasonable.

I'm not sure of the cause of these errors... whether this is due to my problems with the Python upgrade or if there is some issue with phonedash or the network to it. Looking at https://docs.python.org/2/library/errno.html#module-errno I'm conflicted on which errors we would want to consider transient.

I'll update the patch to try 10 times with a delay of 30 seconds; import errno and change the test to be errno.ETIMEDOUT so we don't have the magic 60.
https://github.com/mozilla/autophone/commit/5fa555ae3d9de1dfd4a76e81adee732a57c3376e

I didn't use errno, it turns out 60 is errno.ENOSTR and 110 is errno.ETIMEOUT. I couldn't figure out a good list of values to check, so I guess it will have to be empirical.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1165414
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: