Closed Bug 1195288 Opened 9 years ago Closed 9 years ago

consider using python webserver for production talos

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(firefox44 fixed)

RESOLVED FIXED
mozilla44
Tracking Status
firefox44 --- fixed

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(6 files, 1 obsolete file)

Many times in production when machines get reimaged we end up with talos failures because the apache webserver path is not set to what we copy the talos pagesets to.

I question the need for this, mostly since we have a python webserver built into talos for use with --develop.  In fact, if we always used this, it would simplify a lot of things with our production and testing locally vs in production.  Also it would simplify our machine configs in production as well as the mozharness scripts.

The biggest concern I can think of is that we would be introducing another process to the system and might add more noise to the tests.

I guess the steps would be run all talos tests, if it works smoothly, then retrigger 20 times and measure the noise vs something off of m-c (based on 20 times).

This offers some benefits for cleanup in the code- more specifically the commandline flags for webserver and related logic around starting the webserver.  Maybe even the replacement of localhost could be removed.
(In reply to Joel Maher (:jmaher) from comment #0)
> Many times in production when machines get reimaged we end up with talos
> failures because the apache webserver path is not set to what we copy the
> talos pagesets to.
> 
> I question the need for this, mostly since we have a python webserver built
> into talos for use with --develop.  In fact, if we always used this, it
> would simplify a lot of things with our production and testing locally vs in
> production.  Also it would simplify our machine configs in production as
> well as the mozharness scripts.
> 
> The biggest concern I can think of is that we would be introducing another
> process to the system and might add more noise to the tests.

+10000000

I would try to make sure that the web server is running in its own process, just to make sure that the harness doesn't interfere with serving web traffic (which could affect results). You can do this with concurrent.futures, which IIRC was just imported into the tree. Aside from that, I don't see any reason why this shouldn't just work and I doubt the numbers will change significantly.
simple hack here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=80d105fa0381

lets make sure it works solid, then look at actually doing this correctly with concurrent.futures, etc.
this patch does the trick.  I forgot that mozharness does the '--webserver,localhost' which was invalidating my code.  These changes help cleanup the --develop flag, etc.

We could do more cleanup:
* remove 15707 as the default port, although we need a way to have a port as default
* ensure we run the webserver in another process

parkouss, you indicated interest in hacking on this a bit- I trust your opinions.
Attached patch 1195288.patch (obsolete) — Splinter Review
Well I assigned myself this bug, I thought this was ok (feel free to change that if not!)

So there is a bit more cleanup on this patch, and it takes a free port (the port is no more hardcoded). We should remove the --webServer option as it is useless now - but since the harness use that, we can do that later.

Hmm, I don't really see why ensuring to run the webserver in a new process would help ? I would say that we are good if that work like this on try (please tell me if I'm missing something). But if we really need it, this could be done using plain old subprocess, or maybe multiprocessing (both standard libraries).

This needs to be tested on try, but locally this works fine.
Assignee: nobody → j.parkouss
Status: NEW → ASSIGNED
Attachment #8649216 - Flags: review?(jmaher)
Comment on attachment 8649216 [details] [diff] [review]
1195288.patch

Review of attachment 8649216 [details] [diff] [review]:
-----------------------------------------------------------------

I don't like the nested try loops, it should be fine, but that looks sloppy and depending on random interpretter versions/osflavors we might end up with slightly different behaviour.

Also I would like to ensure we have a port >10000, so it is non standard and potentially non conflicting with other random things which might be hardcoded.  I recall something in moznetwork or mozhttpd in the past that found an open port, it might have been removed though.

::: talos/config.py
@@ +371,5 @@
> +    import socket
> +    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +    sock.bind(('', 0))
> +    port = sock.getsockname()[1]
> +    sock.close()

does this logic work on windows, osx, linux?

::: talos/run_tests.py
@@ +110,5 @@
>          test['setup'] = utils.interpolate(test['setup'])
>          test['cleanup'] = utils.interpolate(test['cleanup'])
>  
>      # pass --no-remote to firefox launch, if --develop is specified
> +    # TODO: why such a difference ?

--no-remote allows us to launch firefox when there is another instance running.  This way developers of talos do not need to close their main instance.

@@ +204,2 @@
>  
> +            try:

do we need this inner try loop?  it seems we could remove this try call given the new one above, keep the except, and keep the newly added finally.
Attachment #8649216 - Flags: review?(jmaher) → review-
I will let wlach comment about a new process vs spinning the webserver up in the process.
(In reply to Joel Maher (:jmaher) from comment #5)
> Comment on attachment 8649216 [details] [diff] [review]
> 1195288.patch
> 
> Review of attachment 8649216 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I don't like the nested try loops, it should be fine, but that looks sloppy
> and depending on random interpretter versions/osflavors we might end up with
> slightly different behaviour.

Still, this is better than having three times httpd.close() I think. I was not writing any
big refactoring here. And we are using python 2.7 everywhere, there is no known bug on
nested try or finally for this version as far as I know.

> Also I would like to ensure we have a port >10000, so it is non standard and
> potentially non conflicting with other random things which might be
> hardcoded.  I recall something in moznetwork or mozhttpd in the past that
> found an open port, it might have been removed though.

The code I wrote basically ask the OS to choose a FREE port - that is, not in use.
So it can't conflict with any other opened previous port.

http://stackoverflow.com/questions/1365265/on-localhost-how-to-pick-a-free-port-number

> ::: talos/config.py
> @@ +371,5 @@
> > +    import socket
> > +    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> > +    sock.bind(('', 0))
> > +    port = sock.getsockname()[1]
> > +    sock.close()
> 
> does this logic work on windows, osx, linux?

I think yes. Try will say.

> ::: talos/run_tests.py
> @@ +110,5 @@
> >          test['setup'] = utils.interpolate(test['setup'])
> >          test['cleanup'] = utils.interpolate(test['cleanup'])
> >  
> >      # pass --no-remote to firefox launch, if --develop is specified
> > +    # TODO: why such a difference ?
> 
> --no-remote allows us to launch firefox when there is another instance
> running.  This way developers of talos do not need to close their main
> instance.

Ok thanks. I could update with a comment so everyone will understand.

> @@ +204,2 @@
> >  
> > +            try:
> 
> do we need this inner try loop?  it seems we could remove this try call
> given the new one above, keep the except, and keep the newly added finally.

We don't need it. We can keep the httpd.close call in three places, but basically
that's why finally are for, avoid code duplication en ensure that something is done.

A better solution might be to rewrite this function, but this was not intended here.
I might have miscommunicated my comment about the try block.  I like the outer try/finally- could we remove the inner try since we have an outer try?

regarding the free/open port, I was thinking of the old model where we configure, then run later- I think we can ignore my concerns about a free/open port.
Attached patch 1195288.patchSplinter Review
Ok, fixed the inner try. This is a bit strange because we use a variable that is used from inside the loop in the try, but I agree it is still better than two nested try.

I will open a bug for the logcat stuff as we discussed on irc.
Attachment #8649216 - Attachment is obsolete: true
Attachment #8649917 - Flags: review?(jmaher)
Comment on attachment 8649917 [details] [diff] [review]
1195288.patch

Review of attachment 8649917 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/run_tests.py
@@ +204,2 @@
>  
> +            mozfile.remove('logcat.log')

this line can be removed now or in a future patch.

@@ +231,4 @@
>          print_logcat()
>  
>      elapsed = timer.elapsed()
>      print "cycle time: " + elapsed

while cleaning up messages, I don't believe this cycle time is useful at all.
Attachment #8649917 - Flags: review?(jmaher) → review+
Thanks Joel. Try is green, I think we could land that.

I created bug 1196287 for logcat, and I suggest to create a new bug to clean log messages that we don't find useful. What do you think ?
lets land and file a new bug for pointless print/log messages :)
Landed in https://hg.mozilla.org/build/talos/rev/335a890ad040.

We should keep this bug open to fix mozharness, and completely remove the --webServer option after that.
Blocks: 1197483
this is live in production, A lot of test values have changed and I suspect it is related to this.
this failed on windows due to the length of the file (255 max characters).  The talos root is 47 characters and the apache root is 39 characters.

src root: C:\slave\test\build\talos_repo\talos\page_load_test
apache root: c:/slave/talos-data\talos/page_load_test

one example of a long filename is:
tp5n/amazon.com/www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B002Y27P3M/%5C%22http%3A/g-ecx.images-amazon.com/images/G/01/x-locale/personalization/yourstore/message-bullet._V192186523_.gif%5C%22.html

could we just rename page_load_test in talos to be: tests?
We had to back talos out for this change due to tp5 not installing the pageset correctly, we realized that we couldn't install the pageset correctly due to Windows filename limits of 255 maxchars.

This patch replaces everything in the talos repository of s/page_load_test/tests/.  To add to the confusion, tp5n.zip is downloaded and tp5o.manifest is in there with hardcoded page_load_test directory references.  So when we create the tp5o.manifest.develop file, I do the replacement there.

In addition to this, we need to change mozharness talos.py and testing/talos/talos.json to fix references and installation points.
Assignee: j.parkouss → jmaher
Attachment #8652402 - Flags: review?(j.parkouss)
Attachment #8652402 - Flags: feedback?(wlachance)
Attachment #8652403 - Flags: review?(wlachance)
Attachment #8652403 - Flags: feedback?(j.parkouss)
Comment on attachment 8652402 [details] [diff] [review]
convert talos page_load_test directory to tests (1.0)

Nice catch. :)

The patch looks fine - but maybe we could try to fix the root cause (later?) ? I wonder if it is it normal to have such long file names in the zip file.
Attachment #8652402 - Flags: review?(j.parkouss) → review+
the .zip file has wget of webpages, in this case we get some long filenames, especially when there are subdirs related to it.  Ideally we will come up with a tp6 sometime and that will have updated webpages and be more friendly to run locally :)
Comment on attachment 8652403 [details] [diff] [review]
changes to talos.py and talos.json for running the harness (1.0)

Review of attachment 8652403 [details] [diff] [review]:
-----------------------------------------------------------------

You can probably remove "python_webserver", "webroot" and "populate_webroot" inside (m-c)/testing/mozharness/configs/talos/*config.py files, but this can be addressed as a follow-up.

::: testing/mozharness/mozharness/mozilla/testing/talos.py
@@ -474,5 @@
>          }
>          self.vcs_checkout(**repo)
>          self.has_cloned_talos = True
>  
>          # the apache server needs the talos directory (talos/talos)

This comment should be removed I think.
Attachment #8652403 - Flags: feedback?(j.parkouss) → feedback+
landed the talos changes:
https://hg.mozilla.org/build/talos/rev/57baf0c8ef5d

been waiting all day for windows retriggers.
Comment on attachment 8652402 [details] [diff] [review]
convert talos page_load_test directory to tests (1.0)

Review of attachment 8652402 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/config.py
@@ +457,5 @@
>  
>      # write modified manifest lines
>      with open(manifestName + '.develop', 'w') as newHandle:
>          for line in manifestLines:
> +            newline = line.replace('localhost', config['webserver'])

Could you add some kind of comment here explaining what you're doing? I suspect at some point we'll want to remove this hack.
Attachment #8652402 - Flags: feedback?(wlachance) → feedback+
Comment on attachment 8652403 [details] [diff] [review]
changes to talos.py and talos.json for running the harness (1.0)

Review of attachment 8652403 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/mozharness/mozilla/testing/talos.py
@@ +476,5 @@
>          if c.get('use_talos_json'):
>              if self.query_pagesets_url():
>                  self.info("Downloading pageset...")
> +                src_talos_pageset = os.path.join(src_talos_webdir, 'tests')
> +                self._download_unzip(self.pagesets_url, src_talos_pageset)

Wow, was the rest of that unnecessary? Nice cleanup!
Attachment #8652403 - Flags: review?(wlachance) → review+
did a final push to try, if all is well I will land in the morning.
Blocks: 1198510
windows XP has G2 failures, specifically tps.  from the logs, I just see it hanging after (or during) the loading.

a try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cdadff8bc96

using some experimentations, I have come to the conclusion that we can support 45 (of the 50) pages, and it isn't dependent upon the pages (at least any of the last 5 pages in the list), once we have the 46th page, we hang, even if the 46th page is a duplicate of an existing page which works.

Why is this failing with the new webserver?  It most likely is related to:
* we are not serving content the same, therefore we are serving more/less and causing an error.  Odd that we don't see this in tp5o_scroll, xperf, or tp5o
* we are hitting memory limitations due to the python webserver being run in the same process.

:wlach, you had mentioned running the python webserver outside the process, that might be worthwhile to try out.

Another hack is removing the 5 pages, but when will we hit a limit and need to remove another page.

I think it is time to look at running the webserver in a unique process.
ok, using multiprocess didn't do anything different, ok for 45, failed for 46+.

Not sure of what else to try.  I would be interested to try a few things out if folks have ideas.
It's possible (maybe even likely) that the problem is in mozhttpd.

It might be instructional to compare mozhttpd against wptserve, which is based on similar ideas but might have more testing. I believe we have a copy of wptserve somewhere in the tree so you could also just cargo-cult using it instead of mozhttpd and see if it helps.

https://github.com/w3c/wptserve/blob/master/wptserve/server.py

http://hg.mozilla.org/mozilla-central/file/fea87cbeaa6b/testing/mozbase/mozhttpd/mozhttpd/mozhttpd.py

James, have you seen anything like :jmaher's problem with wptserve? See comment 26 above.
I don't recall seeing problems like that, but I run the server in its own process (using multiprocessing).

At one point there was a problem on OSX due to an inability to queue enough requests.

If you want to try wptserve it's in testing/web-platform/tests/tools/wptserve.
using wptserve is easy, unfortunately it doesn't allow the 45 pages to run whereas mozhttpd does.  Other tests run just fine with it.

I am leaning towards resource usage as the issue here.
(In reply to Joel Maher (:jmaher) from comment #30)
> using wptserve is easy, unfortunately it doesn't allow the 45 pages to run
> whereas mozhttpd does.  Other tests run just fine with it.
> 
> I am leaning towards resource usage as the issue here.

I guess the other possibility is that there's a genuine bug in the test or browser (i.e. a race condition) which is only triggered on WinXP when mozhttpd is used.
Based on the conversation in irc I suspect the problem is that the default python SocketServer/ThreadingMixin is too slow to accept connections, and Windows XP only has a rather small listen() buffer, so if more connections are requested than then they will be dropped (this is the same problem as the OSX one I alluded to above, but OSX allowed the buffer size to be increased more).

I think the way to fix this is to make accepting a connection faster. I imagine that using a threadpool would help since then the only work you need to do synchronously after accepting a connection is to add it to a queue rather than spawning a thread. I haven't looked at this in detail though.
So, we found the real issue.

This is related to mozharness, because it run the talos process with mozprocess.ProcessHandler (I think an old copy of it). Running the webserver with raw subprocess fixes the issue it seems.

Joel pushed to try so we can be sure of it:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=298ca8c1a63d

Note that his patch only remove the output_timeout arg from the mozharness run_command, forcing to use subprocess then:

https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1077
Well, I'm sorry, we made a false assumption...

G2 is still broken on winXP. :jmaher, can you confirm this is the same issue ?
We discussed a bit on irc with Joel, and as I said maybe we should look for a workaround since it is hard to fix that issue. Some ideas:

- consider reducing the number of loaded pages only for windows xp
- consider keeping apache only for windows xp
maybe we should consider not running tps on windows xp :)

In all seriousness, we should determine if we can do something based on the OS=='winxp', if not then we need to consider reducing the pageset or not running tps.

We detect the os version here:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/ttest.py?case=true&from=ttest.py#38

That is mainly used for counters, but I imagine we could pass the flags into tp to reduce the pageset?  this could be done by:
* adding platform_type to https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/ttest.py?case=true&from=ttest.py#143
** result in cli flag to tp that limits pages to 'x'
* modify pageloader to have --tplimitedpageset: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/pageloader/components/tp-cmdline.js#84
* and then modify pageloader.js to reduce the pages when we load the URI: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/pageloader/chrome/pageloader.js?offset=0#133

Instead of crazy modifications, we could then hack in ttest if platform_type == 'winxp', then develop=False.  The problem there is we need to then hack the manifest and path locations to support the changes required to run tp5 pageset in automation.


I would like to know from blassey how important this test is on windows XP specifically.  My understanding is winxp is <13% marketshare (and slowly declining).
Flags: needinfo?(blassey.bugs)
(In reply to Joel Maher (:jmaher) from comment #36)
> I would like to know from blassey how important this test is on windows XP
> specifically.  My understanding is winxp is <13% marketshare (and slowly
> declining).

I would like data for XP, it seems like we should be able to reduce the page set on XP.
Flags: needinfo?(blassey.bugs)
Parkous, can you take a stab at this?  I would be able to pick this up randomly this week or next- so we could move forward faster if you have time.
The push to try in comment 39 is basically what we did before for this bug, but rebased (and using the talos in-tree). A bit more cleanup for the mozharness side.

So I want to see how that goes (if all goes well, winxp g2 should be busted), then I will look to implement something based on comment 36.
Bug 1195288 - consider using python webserver for production talos. r=jmaher

Always use an in-process webserver, removing the need for apache - and
hopefuly providing better accuracy for numbers.

This means that we know have to copy the pagesets in the talos dir on
harness.

On windows, some pagesets paths were too long due to that, so the
solution is to replace "page_load_test" with "tests".
Attachment #8666396 - Flags: review?(jmaher)
We should get back with the patch in comment 41 with what we had - at least the try push in comment 39 seems to reflect that.

So we can go on by reducing the page set on win xp from there.
Attachment #8666396 - Flags: review?(jmaher) → review+
Comment on attachment 8666396 [details]
MozReview Request: Bug 1195288 - consider using python webserver for production talos. r=jmaher

https://reviewboard.mozilla.org/r/20541/#review18409

this looks great, and thanks for the try push.  Now we just need to modifications to limit the pageset!
Comment on attachment 8666396 [details]
MozReview Request: Bug 1195288 - consider using python webserver for production talos. r=jmaher

Bug 1195288 - consider using python webserver for production talos. r=jmaher

Always use an in-process webserver, removing the need for apache - and
hopefuly providing better accuracy for numbers.

This means that we know have to copy the pagesets in the talos dir on
harness.

On windows, some pagesets paths were too long due to that, so the
solution is to replace "page_load_test" with "tests".
Bug 1195288 - reduce tps page set number on windows xp. r=jmaher
Attachment #8666896 - Flags: review?(jmaher)
Comment on attachment 8666896 [details]
MozReview Request: Bug 1195288 - reduce tps page set number on windows xp. r=jmaher

https://reviewboard.mozilla.org/r/20617/#review18467

as we discussed on irc, this is great.  We need to do a lot of due diligence on the try server logs and verifying the numbers look good overall.
Attachment #8666896 - Flags: review?(jmaher) → review+
we should be good to land this,  I verified this is stable and that our only reduction in test pages is on windows xp for tps.  The question is, will hg work with all the renames?  I know there were issues importing the main patch.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: