Closed Bug 1205005 Opened 4 years ago Closed 4 years ago

Talos-e10s(tp) occasionally fails with TALOSDATA was seen 0 times, expected 1.

Categories

(Testing :: Talos, defect)

defect
Not set

Tracking

(e10s+)

RESOLVED FIXED
Tracking Status
e10s + ---

People

(Reporter: KWierso, Assigned: jmaher)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Blocks: e10s-tests
tracking-e10s: --- → +
a few things here:
1) xperf is finding issues:
10:27:52     INFO -  TEST-UNEXPECTED-FAIL : xperf: File '{profile}\prefs-1.js' was accessed and we were not expecting it.  DiskReadCount: 0, DiskWriteCount: 10, DiskReadBytes: 0, DiskWriteBytes: 53950
10:27:52     INFO -  TEST-UNEXPECTED-FAIL : xperf: File 'c:\$logfile' was accessed and we were not expecting it.  DiskReadCount: 0, DiskWriteCount: 4, DiskReadBytes: 0, DiskWriteBytes: 73728

2) blobber upload is experiencing a memory error:
10:28:05     INFO -  (blobuploader) - CRITICAL - Unexpected error in client: Traceback (most recent call last):
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\Scripts\blobberc.py", line 144, in upload_file
10:28:05     INFO -      blobhash, compress)
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\Scripts\blobberc.py", line 215, in post_file
10:28:05     INFO -      verify=cert.where())
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\lib\site-packages\requests\api.py", line 88, in post
10:28:05     INFO -      return request('post', url, data=data, **kwargs)
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\lib\site-packages\requests\api.py", line 44, in request
10:28:05     INFO -      return session.request(method=method, url=url, **kwargs)
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\lib\site-packages\requests\sessions.py", line 324, in request
10:28:05     INFO -      prep = req.prepare()
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\lib\site-packages\requests\models.py", line 225, in prepare
10:28:05     INFO -      p.prepare_body(self.data, self.files)
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\lib\site-packages\requests\models.py", line 385, in prepare_body
10:28:05     INFO -      (body, content_type) = self._encode_files(files, data)
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\lib\site-packages\requests\models.py", line 133, in _encode_files
10:28:05     INFO -      body, content_type = encode_multipart_formdata(new_fields)
10:28:05     INFO -    File "C:\slave\test-pgo\build\venv\lib\site-packages\requests\packages\urllib3\filepost.py", line 92, in encode_multipart_formdata
10:28:05     INFO -      body.write(b'\r\n')
10:28:05     INFO -  MemoryError
10:28:05     INFO -  (blobuploader) - INFO - Done attempting.


I assume that is because we are uploading test.etl which is huge and large.


a few fixes we can do:
* not upload test.etl on failure- this is causing confusion
* ensure annotations in xperf_whitelist.json match the above data: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/xtalos/xperf_whitelist.json?from=xperf_whitelist.json

This should reduce the failures from this by 60-70%.  Then we can focus on the other issues.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8673274 - Flags: review?(aklotz)
Comment on attachment 8673274 [details] [diff] [review]
clean up xperf to work nicely

Review of attachment 8673274 [details] [diff] [review]:
-----------------------------------------------------------------

ok
Attachment #8673274 - Flags: review?(aklotz) → review+
The error message is really misleading, if you look at most logs (e.g. https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=15830813#L2100), you see something like this:

 20:12:12     INFO -  2015-10-18 20:12:12,228 DEBUG : BROWSER_OUTPUT: __WARNTimeout (18/20) exceeded on http://localhost:53742/tests/tp5n/cnn.com/www.cnn.com/index.html__WARN
 20:12:12     INFO -  2015-10-18 20:12:12,229 DEBUG : BROWSER_OUTPUT: __WARNTimeout (19/20) exceeded on http://localhost:53742/tests/tp5n/cnn.com/www.cnn.com/index.html__WARN
 20:12:12     INFO -  2015-10-18 20:12:12,236 DEBUG : BROWSER_OUTPUT: Cycle 1(4): loaded http://localhost:53742/tests/tp5n/cnn.com/www.cnn.com/index.html (next: http://localhost:53742/tests/tp5n/dailymail.co.uk/www.dailymail.co.uk/ushome/index.html)
 20:12:12     INFO -  2015-10-18 20:12:12,478 DEBUG : BROWSER_OUTPUT: RSS: Main: 240283648
 20:12:12     INFO -  2015-10-18 20:12:12,478 DEBUG : BROWSER_OUTPUT:
 20:12:12     INFO -  2015-10-18 20:12:12,481 DEBUG : BROWSER_OUTPUT: RSS: Main: 240287744
 20:12:12     INFO -  2015-10-18 20:12:12,482 DEBUG : BROWSER_OUTPUT:
 20:12:12     INFO -  2015-10-18 20:12:12,725 DEBUG : BROWSER_OUTPUT: RSS: Main: 239329280
 20:12:12     INFO -  2015-10-18 20:12:12,726 DEBUG : BROWSER_OUTPUT:
 20:12:13     INFO -  2015-10-18 20:12:13,150 DEBUG : BROWSER_OUTPUT: Cycle 1(5): loaded http://localhost:53742/tests/tp5n/cnn.com/www.cnn.com/index.html (next: http://localhost:53742/tests/tp5n/dailymail.co.uk/www.dailymail.co.uk/ushome/index.html)
 20:12:13     INFO -  2015-10-18 20:12:13,510 DEBUG : BROWSER_OUTPUT: RSS: Main: 240885760
 20:12:13     INFO -  2015-10-18 20:12:13,510 DEBUG : BROWSER_OUTPUT:
 20:12:13     INFO -  2015-10-18 20:12:13,873 DEBUG : BROWSER_OUTPUT: Cycle 1(6): loaded http://localhost:53742/tests/tp5n/cnn.com/www.cnn.com/index.html (next: http://localhost:53742/tests/tp5n/dailymail.co.uk/www.dailymail.co.uk/ushome/index.html)
 20:12:13     INFO -  2015-10-18 20:12:13,970 DEBUG : BROWSER_OUTPUT: __FAILTimeout in tp5n__FAIL
 20:12:13     INFO -  2015-10-18 20:12:13,970 DEBUG : BROWSER_OUTPUT: __FAILTimeout (20/20) exceeded on http://localhost:53742/tests/tp5n/cnn.com/www.cnn.com/index.html__FAIL
 20:12:13     INFO -  2015-10-18 20:12:13,970 DEBUG : BROWSER_OUTPUT: RSS: Main: 239149056
 20:12:13     INFO -  2015-10-18 20:12:13,970 DEBUG : BROWSER_OUTPUT:
 20:12:15     INFO -  2015-10-18 20:12:15,959 DEBUG : Terminating psutil.Process(pid=3435, name='firefox')
 20:12:16     INFO -  2015-10-18 20:12:16,093 INFO : Browser exited with error code: None
 20:12:16    ERROR -  Traceback (most recent call last):
 20:12:16     INFO -    File "/builds/slave/test/build/tests/talos/talos/run_tests.py", line 196, in run_tests
 20:12:16     INFO -      talos_results.add(mytest.runTest(browser_config, test))
 20:12:16     INFO -    File "/builds/slave/test/build/tests/talos/talos/ttest.py", line 61, in runTest
 20:12:16     INFO -      return self._runTest(browser_config, test_config, setup)
 20:12:16     INFO -    File "/builds/slave/test/build/tests/talos/talos/ttest.py", line 233, in _runTest
 20:12:16     INFO -      else None)
 20:12:16     INFO -    File "/builds/slave/test/build/tests/talos/talos/results.py", line 121, in add
 20:12:16     INFO -      global_counters=self.global_counters
 20:12:16     INFO -    File "/builds/slave/test/build/tests/talos/talos/results.py", line 335, in __init__
 20:12:16     INFO -      self.error(match.group(1))
 20:12:16     INFO -    File "/builds/slave/test/build/tests/talos/talos/results.py", line 349, in error
 20:12:16     INFO -      raise utils.TalosError(message)
 20:12:16     INFO -  TalosError: Timeout in tp5n
 20:12:16    ERROR - Return code: 2
 20:12:16  WARNING - setting return code to 2

I suspect Talos has some sort of problem when run in e10s mode.
Note I have a patch in bug 1218453 which will likely change the error message to be more meaningful, but does nothing to address the root cause here.
See Also: → 1218453
I'm going to resolve this, as bug 1218453 is going to make this bogus error summary go away, to replaced with something more meaningful. I assume the sheriffs will file a new intermittent orange then. I don't think there's a lot of value to be had in keeping this bug around, most of the error reports are just noise. Feel free to reopen if my assumption is wrong.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.