Closed Bug 1021119 Opened 10 years ago Closed 10 years ago

Investigate the load test errors for loop-server

Categories

(Hello (Loop) :: Server, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jbonacci, Unassigned)

Details

(Whiteboard: [qa+])

We can break these items out as we see fit, but for now, I want to record them here for further investigation by the Dev team.


For each load test I run, I am seeing two types of errors on the loads dashboard, and some errors in the loop_out.log.

Additionally, we are seeing a mix of 200s, 204s, and 404s in the nginx access.log file, but these are all expected.

Finally, hekad is not currently running in Stage, so we are not getting the better logging and aggregating that would be helpful for debug.


From the dashboard, I am seeing 1 or 2 errors of this type during each load test:
string indices must be integers
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 19, in test_all
    call_id, session, caller_token, api_key = self.initiate_call(token)
  File "loadtest.py", line 51, in initiate_call
    return (res['callId'], res['sessionId'], res['sessionToken'],


From the dashboard, I am also seeing several errors of this type (for example, 23 for a 30min load test):

No JSON object could be decoded
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 19, in test_all
    call_id, session, caller_token, api_key = self.initiate_call(token)
  File "loadtest.py", line 50, in initiate_call
    res = self.session.post(self.server_url + '/calls/%s' % token).json()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
    return json.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")


On the stage host (there is only one right now), for each load test, I am seeing the following in the /var/log/loop_out.log - one or more lines that look like this:
{ [Error: connect ETIMEDOUT] code: 'ETIMEDOUT', errno: 'ETIMEDOUT', syscall: 'connect' }
{ [Error: socket hang up] code: 'ECONNRESET' }
Whiteboard: [qa+]
I continue to get the JSON Object errors with each load test, but on the most recent load test, I see these:
82 occurrences:

'hawk-session-token'
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 17, in test_all
    self.register()
  File "loadtest.py", line 36, in register
    resp.headers['hawk-session-token'])
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/structures.py", line 77, in __getitem__
    return self._store[key.lower()][1]

On this same load test, I get these stats:
Successes     192230
Failures     0
Errors     90936
addFailure     13

REF: https://loads.services.mozilla.com/run/b6b41a0e-2af6-460b-b6bf-2e08f985f687


There was a much higher count of 404s in this last load test, as seen in the nginx access.log file, with a long long listing of these:
1402014532.516 "-" "PUT /push HTTP/1.1" 404 28 "-" "-" 0.040 0.000
Yes it looks like we have some issues creating the session or the call-url reasons why you can see some 404.

The error are not good enough to explain what is going on. We need to catch them and raise useful information instead.
Going to try some load tests in a bit.
:natim added some debug code.
:whd redeployed Stage to include some changes to get Production ready to be hit by Nightly.
(to keep Stage and Production in parallel right now)
Status: NEW → ASSIGNED
After deployment:
Two 30min tests, one 60min test. No errors on the dashboard.

Logs:
/var/log/nginx/access.log
There is the usual mix of 200s, 204s, and 404s seen before.
But now, we also have the following:
a large number of 405s, example: 1402105795.552 "-" "PUT /push HTTP/1.1" 405 172 "-" "-" 0.001 -
a large number of 503s and 502s

Some of this could be related to the redeployment/upgrade except for the 405s which did show up throughout the load testing.


/var/log/nginx/error.log (NEW)
There are a lot of "connection refused" errors which probably correlate to the 503s.
The rest of them look like errors from before the load tests, mostly of this type:
2014/06/06 23:41:11 [error] 2745#0: *226681 upstream prematurely closed connection while reading response header from upstream...


/var/log/loop_err.log
I am not sure where these come from: 
connect: res.on("header"): use on-headers module directly

And there is some traceback:
events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: connect ECONNRESET
    at errnoException (net.js:904:11)
    at Object.afterConnect [as oncomplete] (net.js:895:19)


The /var/log/loop_out.log was rolled/archived several times. The most recent file shows the following:
a good mix of calleeToken, callerToken,and sessionId entries
The 404 are expected, that's because the simple push callback is set to localhost to avoid a lookup.
Will look at these tomorrow.
All of this seems okay to me.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
OK. I can open new bugs on specific errors for mock testing vs. live testing of the perf servers.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.