1021119 - Investigate the load test errors for loop-server

Reporter

Description

•

10 years ago

We can break these items out as we see fit, but for now, I want to record them here for further investigation by the Dev team.


For each load test I run, I am seeing two types of errors on the loads dashboard, and some errors in the loop_out.log.

Additionally, we are seeing a mix of 200s, 204s, and 404s in the nginx access.log file, but these are all expected.

Finally, hekad is not currently running in Stage, so we are not getting the better logging and aggregating that would be helpful for debug.


From the dashboard, I am seeing 1 or 2 errors of this type during each load test:
string indices must be integers
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 19, in test_all
    call_id, session, caller_token, api_key = self.initiate_call(token)
  File "loadtest.py", line 51, in initiate_call
    return (res['callId'], res['sessionId'], res['sessionToken'],


From the dashboard, I am also seeing several errors of this type (for example, 23 for a 30min load test):

No JSON object could be decoded
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 19, in test_all
    call_id, session, caller_token, api_key = self.initiate_call(token)
  File "loadtest.py", line 50, in initiate_call
    res = self.session.post(self.server_url + '/calls/%s' % token).json()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
    return json.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")


On the stage host (there is only one right now), for each load test, I am seeing the following in the /var/log/loop_out.log - one or more lines that look like this:
{ [Error: connect ETIMEDOUT] code: 'ETIMEDOUT', errno: 'ETIMEDOUT', syscall: 'connect' }
{ [Error: socket hang up] code: 'ECONNRESET' }

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Whiteboard: [qa+]

James Bonacci [:jbonacci]

Reporter

Comment 1

•

10 years ago

I continue to get the JSON Object errors with each load test, but on the most recent load test, I see these:
82 occurrences:

'hawk-session-token'
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 17, in test_all
    self.register()
  File "loadtest.py", line 36, in register
    resp.headers['hawk-session-token'])
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/structures.py", line 77, in __getitem__
    return self._store[key.lower()][1]

On this same load test, I get these stats:
Successes     192230
Failures     0
Errors     90936
addFailure     13

REF: https://loads.services.mozilla.com/run/b6b41a0e-2af6-460b-b6bf-2e08f985f687


There was a much higher count of 404s in this last load test, as seen in the nginx access.log file, with a long long listing of these:
1402014532.516 "-" "PUT /push HTTP/1.1" 404 28 "-" "-" 0.040 0.000

Rémy Hubscher (:natim)

Comment 2

•

10 years ago

Yes it looks like we have some issues creating the session or the call-url reasons why you can see some 404.

The error are not good enough to explain what is going on. We need to catch them and raise useful information instead.

James Bonacci [:jbonacci]

Reporter

Comment 3

•

10 years ago

Going to try some load tests in a bit.
:natim added some debug code.
:whd redeployed Stage to include some changes to get Production ready to be hit by Nightly.
(to keep Stage and Production in parallel right now)

Status: NEW → ASSIGNED

James Bonacci [:jbonacci]

Reporter

Comment 4

•

10 years ago

After deployment:
Two 30min tests, one 60min test. No errors on the dashboard.

Logs:
/var/log/nginx/access.log
There is the usual mix of 200s, 204s, and 404s seen before.
But now, we also have the following:
a large number of 405s, example: 1402105795.552 "-" "PUT /push HTTP/1.1" 405 172 "-" "-" 0.001 -
a large number of 503s and 502s

Some of this could be related to the redeployment/upgrade except for the 405s which did show up throughout the load testing.


/var/log/nginx/error.log (NEW)
There are a lot of "connection refused" errors which probably correlate to the 503s.
The rest of them look like errors from before the load tests, mostly of this type:
2014/06/06 23:41:11 [error] 2745#0: *226681 upstream prematurely closed connection while reading response header from upstream...


/var/log/loop_err.log
I am not sure where these come from: 
connect: res.on("header"): use on-headers module directly

And there is some traceback:
events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: connect ECONNRESET
    at errnoException (net.js:904:11)
    at Object.afterConnect [as oncomplete] (net.js:895:19)


The /var/log/loop_out.log was rolled/archived several times. The most recent file shows the following:
a good mix of calleeToken, callerToken,and sessionId entries

Alexis Metaireau (:alexis)

Comment 5

•

10 years ago

The 404 are expected, that's because the simple push callback is set to localhost to avoid a lookup.

Alexis Metaireau (:alexis)

Comment 6

•

10 years ago

Will look at these tomorrow.

Alexis Metaireau (:alexis)

Comment 7

•

10 years ago

All of this seems okay to me.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Reporter

Comment 8

•

10 years ago

OK. I can open new bugs on specific errors for mock testing vs. live testing of the perf servers.

Status: RESOLVED → VERIFIED

Bugzilla

Quick Search

Investigate the load test errors for loop-server

Categories

(Hello (Loop) :: Server, defect)

Tracking

(Not tracked)

People

(Reporter: jbonacci, Unassigned)

References

Details

(Whiteboard: [qa+])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8