Closed Bug 948912 Opened 11 years ago Closed 10 years ago

Mozmill fails to send large reports on OS X 10.6

Categories

(Testing Graveyard :: Mozmill, defect, P1)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: noni, Assigned: cosmin-malutan)

References

Details

(Whiteboard: [mozmill-2.0.4+])

Attachments

(2 files, 3 obsolete files)

There are no memory results for Mac OS X 10.6.8 since Nov 28th.

Reports: http://mozmill-daily.blargon7.com/#/endurance/charts?branch=All&platform=Mac&from=2013-11-17&to=2013-12-07
I see the following in the console log for a recent endurance testrun on 10.6:

01:28:20.147 Sending results to 'http://mozauto.iriscouch.com/mozmill-daily/' failed ([Errno 35] Resource temporarily unavailable).
01:28:20.147 *** Removing old installation at /var/folders/GO/GOxsDh5sGOS22wqEa1t6VU+++TI/-Tmp-/tmpn83DFy.binary/FirefoxNightly.app
01:28:20.148 *** Removing test repository '/var/folders/GO/GOxsDh5sGOS22wqEa1t6VU+++TI/-Tmp-/tmpfyaZeS.mozmill-tests'
01:28:20.400 Archiving artifacts
01:28:20.440 WARN: No artifacts found that match the file pattern "screenshots/". Configuration error?
01:28:20.486 Recording test results
01:28:22.114 No emails were triggered.
01:28:22.241 Finished: SUCCESS

The two issues here are the failure to submit the report, but also that the build succeeded despite this failure. It's also affecting other testruns from this node.
This appears to be affecting all testruns on all 10.6 nodes. How has this gone unnoticed?
Severity: normal → blocker
Priority: -- → P1
Summary: No memory results on endurance testruns for Mac OS X 10.6.8 → No results on any testruns for Mac OS X 10.6.8 are appearing in the dashboards
Whiteboard: [qa-automation-blocked]
Looking into this.
Assignee: nobody → andrei.eftimie
Status: NEW → ASSIGNED
I've successfully ran an endurance testrun on mm-osx-106-3 directly from the machine.
http://mozmill-crowd.blargon7.com/#/endurance/report/8ff37f5518a39f6af2983cfe44a87ee7

Lets see if I successfully trigger a testrun from jenkins
(In reply to Dave Hunt (:davehunt) from comment #2)
> This appears to be affecting all testruns on all 10.6 nodes. How has this
> gone unnoticed?

This question has to go to Cornel. I would also be interested in an answer.
Flags: needinfo?(cornel.ionce)
We got results on newer Mac OS X versions for the endurance tests so I thought that tests will not be performed anymore on 10.6.8.
Flags: needinfo?(cornel.ionce)
For anything you are not clear about you should really ask for further information, and not assume something. Please obey this in the future.
From the Console Log of such a failed build:
> 04:23:22 [mozilla-central_endurance] $ mozmill-env-$ENV_PLATFORM/run testrun_endurance --iterations=$ITERATIONS --entities=$ENTITIES --delay=$DELAY --repository=mozmill-tests --junit=report.xml --screenshot-path=screenshots --report=$REPORT_URL builds/
> 04:23:25 Checksumming Driver Descriptor Map (DDM : 0)…
> 04:23:25      Driver Descriptor Map (DDM : 0): verified   CRC32 $9EC54A8B
> 04:23:25 Checksumming Apple (Apple_partition_map : 1)…
> 04:23:25      Apple (Apple_partition_map : 1): verified   CRC32 $0CFD7826
> 04:23:25 Checksumming DiscRecording 6.0d1 (Apple_HFS : 2)…
> 04:23:36  DiscRecording 6.0d1 (Apple_HFS : 2): verified   CRC32 $CDC848A2
> 04:23:36 verified   CRC32 $A348F310
> 04:23:39 /dev/disk1           Apple_partition_scheme          
> 04:23:39 /dev/disk1s1         Apple_partition_map             
> 04:23:39 /dev/disk1s2         Apple_HFS                       /private/var/folders/GO/GOxsDh5sGOS22wqEa1t6VU+++TI/-Tmp-/tmpTo63mB
> 04:23:50 "disk1" unmounted.
> 04:23:50 "disk1" ejected.
> 04:23:52 2013-12-10 04:35:13.296 firefox-bin[29576:890b] invalid pixel format
> 04:23:52 2013-12-10 04:35:13.298 firefox-bin[29576:890b] invalid context
> 04:23:52 2013-12-10 04:35:13.299 firefox-bin[29576:890b] invalid pixel format
> [...]
> 05:46:47 Tue Dec 10 05:58:08 mm-osx-106-1.qa.scl3.mozilla.com firefox-bin[29747] <Error>: CGContextSetInterpolationQuality: invalid context 0x0
> 05:46:47 Tue Dec 10 05:58:08 mm-osx-106-1.qa.scl3.mozilla.com firefox-bin[29747] <Error>: CGContextDrawImage: invalid context 0x0
> 05:47:03 2013-12-10 05:58:24.981 firefox-bin[29765:920b] invalid pixel format
> 05:47:03 2013-12-10 05:58:24.983 firefox-bin[29765:920b] invalid context
> 05:47:03 2013-12-10 05:58:24.984 firefox-bin[29765:920b] invalid pixel format
> 05:47:03 2013-12-10 05:58:24.984 firefox-bin[29765:920b] invalid context
> 05:48:26 *** Cloning test repository to '/var/folders/GO/GOxsDh5sGOS22wqEa1t6VU+++TI/-Tmp-/tmpfyaZeS.mozmill-tests'
> 05:48:26 *** Platform: Mac OS X 10.6.8 64bit

Notice that we wait almost 1.5 hours to actually clone the repository and start the test.
I've triggered a testrun from Jenkins, and it is actually running tests the whole time.
Therefore comment 8 is probably safe to ignore.

The output is probably skewed in the log (I don't see how else we could *clone* the tests repository after we've already run the tests).
If I remember correctly this non-linear output is known.
Interestingly if the test finishes early (we've had some jsbridge problems today), we successfully send the report (see http://mozmill-daily.blargon7.com/#/functional/report/40742e0746fd767e6d2fd5865957e192 ).

Wondering if the document size could be the problem.
I checked yesterday the proxy settings and I restarted the node to be sure they are applied and the testruns still failed.
Any references for testruns? Also how does 10.6 in staging behave? Has anyone checked that yet?
On staging setuptools where not installed on mm-osx-106 so all testruns were aborted.
I fixed that and after I've got the same error as on production:

>08:02:41 TEST-START | testToolbar/testHomeButton.js | teardownModule
>08:02:41 TEST-END | testToolbar/testHomeButton.js | finished in 325ms
>08:02:44 RESULTS | Passed: 89
>08:02:44 RESULTS | Failed: 0
>08:02:44 RESULTS | Skipped: 17
>08:02:45 Sending results to 'http://mozauto.iriscouch.com/mozmill-staging/' failed ([Errno 35] Resource temporarily unavailable).
>08:02:45 *** Removing profile: /Users/mozauto/jenkins/workspace/mozilla-central_functional/data/profile
>08:02:45 *** Uninstalling build: /Users/mozauto/jenkins/workspace/mozilla-central_functional/data/binary/FirefoxNightly.app
Interesting. I might have missed to install it when I upgraded to Python 2.7. So this is always reproducible, Cosmin? In this case we should have a little testcase on that machine, extracted from the code in mozmill which sends the report. Then we can load a locally saved report (--report=file://report.json) for testing purposes.
Attached file send_report.py
(In reply to Henrik Skupin (:whimboo) from comment #15)
> In this case we should have a little testcase on that machine, extracted from the code in mozmill which
> sends the report. Then we can load a locally saved report
> (--report=file://report.json) for testing purposes.
Here is the code that sends the report.
First I ran a testrun with --report=file://report.json
Then when I run "send_report.py report.json http://mozmill-crowd.blargon7.com/db/" it reproduce the failure every time, so I would say this is a testcase.
Assignee: andrei.eftimie → cosmin.malutan
What is the error we get back from urllib2 given that we most likely hide the real one in Mozmill.
Is the last exception urllib2.URLError:
urllib2.URLError <urlopen error [Errno 35] Resource temporarily unavailable>
Asserting this is a urllib / python issue on OSX yields interesting results:
https://github.com/shazow/urllib3/issues/63
One hacky solution there was to monkeypatch socket.sendall

The issue tracked in python:
http://bugs.python.org/issue8493
A solution here would be to use a timeout on the socket connection that would actually make the call non-blocking. Not sure where we would need that and if that might be a solution for us.
What I wonder is why haven't we seen this before? What has changed on our systems that this problem started to appear? When have we sent the last report with an OS X 10.6 machine? All that should help us to figure out why this is happening now.
We're using Python 2.7.3 (April 2013) so that is not it.
(In reply to Andrei Eftimie from comment #21)
> We're using Python 2.7.3 (April 2013) so that is not it.

Not sure what this should mean. Can you please clarify?
(In reply to Henrik Skupin (:whimboo) from comment #22)
> (In reply to Andrei Eftimie from comment #21)
> > We're using Python 2.7.3 (April 2013) so that is not it.
> 
> Not sure what this should mean. Can you please clarify?

I remember we upgraded python versions some time ago, but details are fuzzy (and I might have misremembered), so I checked which Python we are using on the 10.6 boxes.
Right, but I want to know the dates since when we are not able to push to the dashboard. That I have to know to make a relation to the Python 2.7 upgrade.
On staging reports for OS X 10.6 have been sent until Dec 19th. Then it stopped. Via bug 950007 we got the mini for 10.6 in January. So the borrowed box from production showed the problem. Can you please check the 106-4 machine when Python 2.7 has been installed? I might give us an indication.

Beside this has someone tried to run Mozmill 1.5 on such a box and send results? I doubt so. Please do it given that it might even be a regression in Mozmill 2.0.
So sometimes we send a report, but then also get this failure message. This is strange:

> View the build in Jenkins:
> http://mm-ci-master.qa.scl3.mozilla.com:8080/job/mozilla-central_functional/19889/
>
> View the results in the Mozmill Dashboard:
> http://mozmill-daily.blargon7.com/#/functional/report/f892d42cad846376757d7854017d0ef8
I checked with mozmill 1.5.24 and we get the same error, the last update we had was in the time span when it stopped sending reports was the nodes updates(bug 929931), so I'm inclined to think this is caused by the os update.
No, I wouldn't do that given that this could cause a hang. Instead we should really catch the EAGAIN exception and try sending the report again. This should also work, right?
I tried to try-catch and send the report again but it did't helped as I ran in a infinite loop:
RuntimeError: maximum recursion depth exceeded

I will investigate this more, but I didn't made any other progress today.
I don't know what you have done here. But implementing this properly should never require a recursive call.
Attached patch patch_v1.0 (obsolete) — Splinter Review
(In reply to Henrik Skupin (:whimboo) from comment #30)
> No, I wouldn't do that given that this could cause a hang. Instead we should
> really catch the EAGAIN exception and try sending the report again. This
> should also work, right?
It took me a while to find this fix. so I catch the EAGAIN exception and try to send the package again. I did this by overwrithing the sendall method because the at our level we get the urllib2 error.

The fix I've got from https://github.com/shazow/urllib3/issues/63#issuecomment-15447770

If this looks better let me know so I can create an issue on github.
Attachment #8367354 - Flags: feedback?(hskupin)
Comment on attachment 8367354 [details] [diff] [review]
patch_v1.0

Review of attachment 8367354 [details] [diff] [review]:
-----------------------------------------------------------------

I think that is fine but please add a see url so people know where it has been taken from. Also update do our coding style and fix spelling issues.
Attachment #8367354 - Flags: feedback?(hskupin) → feedback+
Component: Infrastructure → Mozmill
Product: Mozilla QA → Testing
QA Contact: hskupin
Summary: No results on any testruns for Mac OS X 10.6.8 are appearing in the dashboards → Mozmill fails to send test results to couchdb on OS X 10.6
Whiteboard: [qa-automation-blocked]
Not a blocker but a major loss of functionality given that reports cannot be analyzed for failures. Sadly we can't wait with 2.0.4 for a fix here, unless we can land it today. For now setting as blocker for 2.0.5.
Severity: blocker → major
Whiteboard: [mozmill-2.0.5+]
We do not work with pull requests for Mozmill. So please add an attachment to this bug, and as mentioned fix the coding style and the spelling errors in the taken code.
Attached patch patch_v1.1 (obsolete) — Splinter Review
Hi, here is the patch, I had it in internal review, I hope I didn't miss anything.
Attachment #8367825 - Flags: review?(hskupin)
Comment on attachment 8367825 [details] [diff] [review]
patch_v1.1

Review of attachment 8367825 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozmill/mozmill/report.py
@@ +14,5 @@
>  
> +# Due to an issue in urllib2 on mac we get EAGAIN (Errno 35) exception when
> +# sending bigger reports https://github.com/shazow/urllib3/issues/63
> +# Fix from https://github.com/shazow/urllib3/issues/63#issuecomment-15447770
> +if sys.platform == "darwin":

Please enhance this check so we only modify this method for 10.6.

@@ +15,5 @@
> +# Due to an issue in urllib2 on mac we get EAGAIN (Errno 35) exception when
> +# sending bigger reports https://github.com/shazow/urllib3/issues/63
> +# Fix from https://github.com/shazow/urllib3/issues/63#issuecomment-15447770
> +if sys.platform == "darwin":
> +    # Monkey path socket.sendall to handle EAGAIN (Errno 35) on mac.

nit: patch and not path

@@ +31,5 @@
> +                    time.sleep(0.1)
> +                else:
> +                    raise e
> +
> +    socket._socketobject.sendall = socket_sendall

I would like to see the original implementation of socket_sendall. Can you please give me a link or a diff, so I can see the changes?
Attachment #8367825 - Flags: review?(hskupin) → review-
Attached patch patch_v1.2 (obsolete) — Splinter Review
I've done that, here is the source code of method is C
http://svn.python.org/view/python/trunk/Modules/socketmodule.c?view=markup#l2729
Attachment #8367825 - Attachment is obsolete: true
Attachment #8367870 - Flags: review?(hskupin)
Comment on attachment 8367870 [details] [diff] [review]
patch_v1.2

Review of attachment 8367870 [details] [diff] [review]:
-----------------------------------------------------------------

This is mostly done. Just two nits here to get fixed. See my inline comment. I assume this has been tested by running a lot of testruns on 10.6? If not please do that and send all to the crowd dashboard. Especially those reports which can become larger.

::: mozmill/mozmill/report.py
@@ +14,5 @@
>  
>  
> +# Due to an issue in urllib2 on mac we get EAGAIN (Errno 35) exception when
> +# sending bigger reports https://github.com/shazow/urllib3/issues/63
> +# Fix from https://github.com/shazow/urllib3/issues/63#issuecomment-15447770

It will be enough to list the link once. Kill the first one.

@@ +15,5 @@
>  
> +# Due to an issue in urllib2 on mac we get EAGAIN (Errno 35) exception when
> +# sending bigger reports https://github.com/shazow/urllib3/issues/63
> +# Fix from https://github.com/shazow/urllib3/issues/63#issuecomment-15447770
> +if sys.platform == "darwin" and platform.mac_ver()[0][:4] == "10.6":

I would feel better when we use mozinfo here like:

if mozinfo.isMac and '10.6' in mozinfo.version:
Attachment #8367870 - Flags: review?(hskupin) → review-
Lets move to mozmill 2.0.4 given that we don't have landed all the other patches yet.
Blocks: 960495
No longer blocks: 965216
Whiteboard: [mozmill-2.0.5+] → [mozmill-2.0.4+]
Comment on attachment 8367941 [details] [diff] [review]
patch_v2.0

Review of attachment 8367941 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozmill/mozmill/report.py
@@ +14,5 @@
>  
> +# Due to an issue in urllib2 on mac we get EAGAIN (Errno 35) exception when
> +# sending bigger reports. Fix from:
> +# https://github.com/shazow/urllib3/issues/63#issuecomment-15447770
> +if mozinfo.isMac and "10.6" in mozinfo.version:

In Python we use single quotes around strings. And in general you should follow the coding styles of the file. So to get this landed I will change this before merging the patch.
Attachment #8367941 - Flags: review?(hskupin) → review+
Comment on attachment 8367354 [details] [diff] [review]
patch_v1.0

Please make older patches obsolete to reduce confusion. Thanks.
Attachment #8367354 - Attachment is obsolete: true
https://github.com/mozilla/mozmill/commit/b55ed4cfa25df9672d8505330cdf925aa3427e3c (master)
https://github.com/mozilla/mozmill/commit/3967440b449439b181dd8a847febf26702a8fb6b (hotfix-2.0)

Thanks Cosmin! Good work, and glad to see we were able to include it for version 2.0.4.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Summary: Mozmill fails to send test results to couchdb on OS X 10.6 → Mozmill fails to send large reports on OS X 10.6
This is actually not fixed. I have seen this problem again today on staging with Mozmill 2.0.5:
http://mm-ci-staging.qa.scl3.mozilla.com:8080/job/mozilla-central_functional/777/console
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [mozmill-2.0.4+]
This did not fail again in the last week, so it might was a real network issue here. Previously it didn't ever sent the reports on the affected nodes.
I really  think that the underlying problem from this bug has been fixed.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Thanks for checking that Cosmin! Good to hear that it was a temporarily glitch only.
Whiteboard: [mozmill-2.0.4+]
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: