Closed Bug 770718 Opened 10 years ago Closed 5 years ago

Add websockets test to Talos

Categories

(Testing :: Talos, defect)

x86_64
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: cmtalbert, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(4 files)

From bug 768705, we need a regression test for websocket performance. We already have a websocket server that we use for mochitest, and so we can use that as our server implementation. However, the exact metrics of what the test should be and what it should measure are quite fuzzy to me.

For those not reading 768705, the gist is that *something* changed in the core platform and made our implementation of the websockets protocol drastically slower. We need a test that would catch a regression like this.

Jason, Patrick, can you detail what kind of test would be useful here and what should be captured for measurement to determine such a regression in network performance?
Most basically we want to cover perfomance of latency (connection times and small message perf), and bandwidth (large message perf).

test_websocket.html has lots of connections, so it tests connection latency.  I think test_websocket_basic might cover the large message case decently.  So perhaps just snapshots of those two tests?   On the minus side neither test is designed to be a clean, simple test of a specific aspect of performance, but they'd capture regressions well, so it's a start.  I'm not going to have many cycles to write up something from scratch.

Can you get me info on how our perf regression tests work?  That'd give me some idea of what would need to change to get the tests integrated.
Glad there are tests we could reuse to create what we need.  Some questions:

* What specifically are we going to measure here?  It sounds like throughput, connection/sec and MB/sec?
* What platforms will this test run on?  Mobile?  B2G?
* What will be a regression?  
* What other variables will add to this test?  Are there preferences we need to set?
* Will one test affect another test?  This is common in unittests and we have seen this in tp5 when we cycle through the pages.

In general, we run most talos tests as a pageloader test.  This means we load up a page and wait for it to load (or report back a number).  This is recorded and we do this for 25 cycles, then move onto the next page.  This is all written in python, so it is very feasible to reuse the mochitest websocket server/tests.  The biggest caveat is we run python2.4 on windows.

here is more information about talos in general:
https://wiki.mozilla.org/Buildbot/Talos
(ccing Nick because we'll want to run this with varying TCP latency at least, so we'll want Stone Ridge--do we do any Talos runs with Stone Ridge now Nick?)

I think we want to measure
1) connection latency (reconnect in tight loop)
2) small msg latency (aggregate time for lots of small msgs ping-pong)
3) large msg latency

I'd like to see these in 3 different tests, so if we regress we have an idea of which of these is affected.

> * What platforms will this test run on?  Mobile?  B2G?

So long as we've got a harness for a platform, we should test on it.  No reason why we don't want as full coverage as regular Talos or can't get it AFAICT.   

We do want it to run on "broadband" and "mobile" style networks (with appropriate latency/bandwidth: I'm less concerned about packet loss), hence Stone Ridge, at least at some point.

> * What will be a regression?  

From your description, sounds like we'll just look for some significant % variation in the overall page run time.  But I'm not clear yet on how the test reports that it's finished--you said Talos waits "for it to load (or report back a number)."  We need to run past onload(), for sure, so how do we report a number? (I scanned the Talos wiki page but didn't see it)
There is some mystery around how tests can report numbers.  There are really two types:
1) startup tests which wait for an onload handler and then do something
2) page loader tests which use the pageloader extension (http://hg.mozilla.org/build/talos/file/18ae2ce2749b/talos/pageloader) to iterate through a set of pages.

We have tests of both types reporting numbers on their own.  Using the pageloader style tests, we require the test page to call tpRecordTime which is inserted into the page here: http://hg.mozilla.org/build/talos/file/18ae2ce2749b/talos/pageloader/chrome/pageloader.js#l381.  This allows for telling the browser we are finished and collecting numbers (ex.: http://hg.mozilla.org/build/talos/file/18ae2ce2749b/talos/page_load_test/a11y/dhtml.html#l42).

For the startup style tests, you would need to report your own numbers in a similar format and quit the browser yourself.

In general we steer towards the pageloader tests.  If there is additional data or some other requirement that would make using pageloader a poor choice, we can talk about going the startuptest route.
So I'm attaching 3 mochitests I wrote that time websocket latency (connections, small msgs) and bandwidth (large msgs).  Is there any chance someone can pick these up and convert them to something we can run on Talos? (I assume we'll need to tackle having Talos run a pywebsocket instance too).
is this still of interest?
Yes, it would be great to have these basic websocket timing tests in talos.

It's been a while since I ran these--IIRC there was a fair amount of variance in the results--that might drop if we ran a large number of tests, and/or warmed up the test with a dry run.
This patch adds the files and the configs for three tests:
* twsping
* twscxn
* twsbigmsgs

Right now these are designed to be like startup tests where they load the page once and then output a single value, we repeat 20 times and then report a mean.  

I have no idea if that is what we want to do.  But we can change the details later.  I view this patch as 80% of the way towards a successful talos patch, then there is the onslaught of other bugs to add (trychooser, tbpl, graph server sql, buildbot, talos-compare)

Lets start reviewing this code and figuring out the proper way to summarize the data and run the tests.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #671576 - Flags: feedback?(jhammel)
Christina: can you help us figure out what we need to do here for number of cycles?
:jduell: can you explain what the 3 tests are and exactly what we are measuring?
1) connection latency (reconnect in tight loop).  This is mostly TCP connection time + HTTP request/response, with a small amount of Upgrade logic (but upgrade doesn't add any round trips, so it should basically just be network bound).

2) small msg latency (aggregate time for lots of small msgs ping-pong).  Again mostly network bound, but if we're doing something dumb/bad we could add to the latency.

3) large msg latency.  Again, network bound--this time by bandwidth, not latency, with chance that we could be adding software overhead.

Number of cycles will need to be high enough to minimize variance of result.  May depend a lot on how much network (and/or CPU) jitter is in testbed.
I don't think we should add networking performance tests to non networking controlled environment. This, imo, should be done on stoneridge or else we'll be chasing false positives all day.
For desktop we are running tests against localhost, I believe that should be good.

Talking about mobile, we have a much different picture and that is something we should tackle later if it is of any interest.
(In reply to Joel Maher (:jmaher) from comment #14)
> For desktop we are running tests against localhost, I believe that should be
> good.
> 

desktop talos is not primarily a networking test. This, as jason suggests in comment 12, is.
My understanding of this bug is that we had a regression in the times it took websockets to do transaction X.  This was seen in the mochitest websocket tests but those tests just looked for accuracy of the data, not the time it took.  The mochitest websocket server and browser is all on the same machine: localhost.  We would be doing the same thing here, only measuring the times it takes for various transactions (pings, connections, large data).  

While it might be ideal to run this in a multi node network environment, we won't have the bandwidth to do that for our thousands of test slaves which could pick up this talos job.  Actually for Android it would be easier as we have a webserver on the network which all talos jobs do transactions against, the question is whether pywebsocket could handle 50 sessions at a time and if that network topology would be acceptable.

We can hack up how we run the tests, what we run before and after them, but to roll in additional hardware/networking requirements on the entire pool of test slaves will take a lot more time, planning and resources.
(In reply to Joel Maher (:jmaher) from comment #16)

> 
> We can hack up how we run the tests, what we run before and after them, but
> to roll in additional hardware/networking requirements on the entire pool of
> test slaves will take a lot more time, planning and resources.

right, that's why this shouldn't be done on talos. It should be part of stone ridge.
in the current stoneridge setup, would it make sense to run existing talos tests?  i.e. would it make sense to continue to get this reviewed and landed as a 'talos' test even though we wouldn't intend to run it in a traditional talos environment?
Given that the entire genesis of this test began because we noticed a regression on websockets performance when talking to *localhost* it seems like we should proceed with this effort to have something we can run for every checkin to the product.

Stone ridge cannot handle that kind of on-change testing for all trees and try at this point.  Talos can.

The test should be run in both places. I agree the data out of stoneridge will be more actionable. But to protect against completely terrible performance regressions like we saw with mochitest, we need something in place to run per-change on the main buildbot automation infrastructure.

And a question for Joel and the Talos guys: why on earth are we even considering reporting a mean? Is this for the old graphserver upload? We are going to report all the raw data to datazilla and let datazilla do the calculations, right? Anything else would feel like walking backwards.
The only reason for a mean is that we need to report a single number for graph server.  This will be reporting to datazilla, but right now we don't have datazilla hooked up to tbpl and other per-changeset tooling.

I would like to know how many iterations we should run of this test (this affects our runtime and available machine time which we are clamping down on) and how what we should do with the numbers.

When running locally, I would get a range of numbers out of 20 cycles, we could do a mean, stddev or some kind of calculation on the X datapoints we care to collect.
From chatting with Patrick, I think we both agree that it's fine to run this test in regular talos on localhost, provided that we realize that we're mainly doing so in order to catch big regressions (orders of magnitude in timings).  We don't want a fire drill if there's just a small regression caused by localhost networking noise.

> I would like to know how many iterations we should run of this test

I'm not sure--run it a bunch of times and see when the variance seems to settle down?

> why on earth are we even considering reporting a mean?

'Cause I wrote it that way? 8-)  It's true that we'd be interested in results where only a single data point took, say, 10 times as long as the rest.  I guess I figured we'd capture that somewhat with a mean.  Feel free to tweak the code to report data as you see fit.
We should be explicit in what we are reporting from a single run of the test.  For example if we do 10000 pings for the ping test, we might only need to run the test once.  I know that I get different results for subsequent runs of this.

My concern about turning this on for regular per checkin talos is that we will only detect major regressions from this, if we are testing this daily in stoneridge and the networking team is only excited about the stoneridge testing, then we should seriously consider not running these in talos.  There are a lot of resources required to run these tests per checkin, which is fine, but only if we really care about these numbers.
Sure, if/when we have these in stoneridge we can turn these off in Talos.  That's a ways off from happening.  Meanwhile I think it's useful to have some backstop testing for major perf regressions in websockets.
Before going through the motions to turn these tests on, I want to make sure that if these detect a regression we will back people out and not just collect data and fix stuff after the fact.
As long as we can be confident that any regression is genuine I'm fine with backing stuff out.  I've never been clear on the general standard for that in talos--I just had a patch last week that allegedly caused a 4% regression in a test, which clearly wasn't the result of anything my patch did, so I'm guessing there's some threshold we use as a rule of thumb.  Do we have any idea yet how noisy these websocket tests timings are?
twsping
0,0.31
1,0.33
2,0.37
3,0.35
4,0.31
5,0.34
6,0.33
7,0.33
8,0.34
9,0.33
10,0.32
11,0.23
12,0.33
13,0.31
14,0.32
15,0.31
16,0.34
17,0.30
18,0.36
19,0.30

twscxn
0,3.38
1,3.73
2,3.47
3,3.48
4,3.45
5,3.36
6,3.56
7,3.63
8,3.28
9,3.53
10,3.63
11,3.68
12,3.52
13,3.52
14,3.54
15,3.61
16,3.64
17,3.55
18,3.54
19,3.70

twsbigmsgs
0,84.70
1,87.62
2,87.40
3,94.68
4,85.41
5,90.90
6,87.05
7,87.85
8,85.45
9,84.06
10,88.14
11,87.27
12,86.49
13,89.98
14,84.52
15,87.33
16,89.60
17,88.28
18,87.05
19,88.33
So cxn test largest/smallest time = 13%
Bigmsg = 13%
small message = 60%.

Interesting that the small message test has such larger variance, despite running 10K times (vs 2000/100 for cxn/bigmsg tests).  If the test doesn't take too long to run we could bump that # even higher?

So re: backout.  I'm still not sure of general talos backout policy, and whether we need to make any sort of exception for these tests.  Obviously jitter tends to settle into a mean at least, while true regressions cause the mean to increase.  I assume we can sort out the difference via graph server or something similar?  I'm not familiar with our process here.
let me run the small message test in a bit with 40 cycles;

It sounds like as long as we can figure out and identify what a regression looks like we would enforce a backout.

jhammel, if you get a chance to send some feedback on the patch in here, we can start moving forward.  Once we get through there it would be nice to start doing the 10 bug dance to get a new test up and live.
Attachment #671576 - Flags: review?(jhammel)
Comment on attachment 671576 [details] [diff] [review]
add the 3 suggested tests to talos (0.9)

It looks like we are working around pywebsocket not being on pypi.  Is that the case?  If so, the standard procedure is to:

1. ask politely for the software authors to release their package to pypi
2. use a package link of form e.g. https://code.google.com/p/pywebsocket/downloads/list as a dependency
2.a. (and for talos.zip we need to update create_talos_zip.py)
3. if they upload to pypi, remove our special dependency link
4. or we can always upload their software for them if they're not communicative

I can help with this if you want.

I'm also wondering if this is something mozhttpd would want

diff --git a/talos/startup_test/websocket/file_websocket_time_cxn_wsh.py b/talos/startup_test/websocket/file_websocket_time_cxn_wsh.py
new file mode 100644
--- /dev/null
+++ b/talos/startup_test/websocket/file_websocket_time_cxn_wsh.py
@@ -0,0 +1,7 @@
+from mod_pywebsocket import msgutil
+
+def web_socket_do_extra_handshake(request):
+  pass
+
+def web_socket_transfer_data(request):
+  pass
diff --git a/talos/startup_test/websocket/file_websocket_time_pings_wsh.py b/talos/startup_test/websocket/file_websocket_time_pings_wsh.py
new file mode 100644
--- /dev/null
+++ b/talos/startup_test/websocket/file_websocket_time_pings_wsh.py
@@ -0,0 +1,9 @@
+from mod_pywebsocket import msgutil
+
+def web_socket_do_extra_handshake(request):
+  pass
+
+def web_socket_transfer_data(request):
+  # blindly echo back what we received
+  while not request.client_terminated:
+    msgutil.send_message(request, msgutil.receive_message(request))

I'm not sure why we do this

+def interpolatePath(path, server=None, port=None):
+  if not server or not port:
+      return string.Template(path).safe_substitute(talos=here)
+  return string.Template(path).safe_substitute(talos=here, server=server, port=port)

Might be better to do:

+def interpolatePath(path, server='', port=''):
+  return string.Template(path).safe_substitute(talos=here, server=server, port=port)
Attachment #671576 - Flags: review?(jhammel)
Attachment #671576 - Flags: review-
Attachment #671576 - Flags: feedback?(jhammel)
for the websockets stuff, I don't see it on pypi, but it is from the google code site.  I am using what we use in the desktop unittests.  We need to use the same toolchain between the unittests and the talos tests.  

could we pull the bits from what we have checked into m-c?  I know we have some custom mozilla stuff in the pywebsockets code.  Also we would need to ensure we are using the version as specified here:
https://bugzilla.mozilla.org/show_bug.cgi?id=710345


For adding this to mozhttpd, I don't think it is something we should put in there.  Lets resolve the overall issue of the pywebsockets server, then tackle the tests and integration.
jhammel, your thoughts on my response to your review would be appreciated.
Flags: needinfo?(jhammel)
Looking at things more closely, the package name is actually
mod_pywebsocket and it is on pypi:

http://pypi.python.org/pypi/mod_pywebsocket

If we want a more recent version, I would suggest putting it up on
http://puppetagain.pub.build.mozilla.org/data/python/packages/ (see also
https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Python ) and
specifying that location with `dependency_links` in talos's setup.py.
Concurrently we should also ask the package owner to make a new
release.

We'll also want to make create_talos_zip.py download the files from
e.g. https://pywebsocket.googlecode.com/svn/trunk/src/ in the usual
way (boy I'll be glad when that monstrosity is gone).

In theory, we could peg to
https://pywebsocket.googlecode.com/svn/trunk/src/ (or what not)
directly in the setup.py, but this requires the `svn` binary to be on
the path:

Searching for mod-pywebsocket
Best match: mod-pywebsocket [unknown version]
Downloading
https://pywebsocket.googlecode.com/svn/trunk/src/#egg=mod_pywebsocket
Doing subversion checkout from
https://pywebsocket.googlecode.com/svn/trunk/src/ to
/tmp/easy_install-LjTu6V/src
sh: 1: svn: not found
Processing src
error: Couldn't find a setup script in /tmp/easy_install-LjTu6V/src

using (pseudo-code):

    setup(install_requires=['mod_pywebsocket'],
          dependency_links=['https://pywebsocket.googlecode.com/svn/trunk/src/#egg=mod_pywebsocket'],
          ...)

IMHO, it is better just to deal with the tarballs as they don't have
any hidden magic requirements like svn being installed.

As far as keeping the code in parity, I don't see any excellent
options there.  For create_talos_zip.py, we can use the code in the
m-c tree.  But there is no (good) way of using the code in m-c as part
of setup.py's dependency resolution, ignoring entirely that we didn't
port the setup.py from google code which would be a prerequisite.  We
could, if we ported the setup.py, use this as the basis for our
package.  However note that in pypi and in google code that standalone
was moved to the parent directory.  This was probably done because of
these imports:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/pywebsocket/standalone.py#123
This implies that mod_pywebsocket is meant to be run as a python
package since `from mod_pywebsocket import common` implies that
mod_pywebsocket should be on sys.path.  So we're already futzing with
things.

I would like to transition to a system where we don't futz with
things.  Asking the original author and consulting the README.mozilla,
no one knows why we moved standalone.py!
( http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/pywebsocket/README-MOZILLA#30
) . We should be running this in a virtualenv, like all python
packages we rely on, and preferably installing from a canonical net
location for local installation+testing and
http://puppetagain.pub.build.mozilla.org/data/python/packages/ for
automation purposes.

If having code parity between the version of mod_pywebsocket used in
m-c and talos is very important right now, I would recommend the
following:

- Use a tagged release of pywebsocket instead of the svn bleeding edge
  recommended in
  http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/pywebsocket/README-MOZILLA#14
  ; I would HIGHLY RECOMMEND using
  http://pypi.python.org/pypi/mod_pywebsocket/0.6b1 if possible as
  in that case we are done (except for the annoying parts of actually
  making create_talos_zip.py work correctly)

- we should be installing this in a friggin virtualenv instead of
  moving standalone.py around.

- If we need a later version, it should be uploaded to
  http://puppetagain.pub.build.mozilla.org/data/python/packages/ and
  talos's setup.py should point to this version

- and of course, update the instructions

I would really like to devote more time+effort into getting our m-c,
mozharness, etc python stories less archaic vs. continuing to leave
messes that we're just going to have to clean up.
Flags: needinfo?(jhammel)
closing out old bugs that haven't been a priority
Assignee: jmaher → nobody
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.