Closed Bug 1220327 Opened 9 years ago Closed 4 years ago

Intermittent TalosError: Timeout in tp5n

Categories

(Testing :: Talos, defect, P3)

defect

Tracking

(firefox45 affected, b2g-v2.5 fixed)

RESOLVED INCOMPLETE
Tracking Status
firefox45 --- affected
b2g-v2.5 --- fixed

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(2 files)

Okay, fine, steal away the dumping-ground bug and make me file this thing which has been failing for months.

https://treeherder.mozilla.org/logviewer.html#?job_id=16582820&repo=mozilla-inbound
we get 40-50 of these per week in tp5n.  Inside of Talos there is the pageloader addon which manages loading pages and recording metrics from it, in this case pageload times.  What is interesting is that we allow for up to 20 missed pages (we have an internal timeout that is 5000ms, I rarely see pages load >2000ms)- if we miss a page we try again and it usually works well for quite a few more loads.

I do wonder why we are hitting these issues in general- it is spread across win7/win8 (oddly no winxp) for the most part with a few linux64/osx10.10 instances.

could this be an issue with the python webserver?  Maybe we need to make it more robust, high performance?  Maybe we need to experiment with the pageloader?  I suspect there are timing issues with pageloading, and the webserver is either loading too fast, or too slow.

One step to look into this is add some simple logging to determine when we have a timeout if we have received a load event or a mozafterpaint event.  Doing that would help determine what we are seeing.  Ideally dumping some info when we call loadFail() (http://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/pageloader/chrome/pageloader.js#l396), to determine what is loaded in the browser.
:parkouss, do you have other thoughts?
Flags: needinfo?(j.parkouss)
Well adding some logs into the tests is probably a good idea.

I don't know if the issue is related to the web server. As discussed on irc, we could also add some system measurement (ram, swap, cpu usage) to know if we are facing overloaded systems.

I wrote a simple patch for this, https://pastebin.mozilla.org/8852984. Tell me if you want to move on with it.


Also, I played a bit with talos logs. The failure we see are caused by "__FAILTimeout (20/20) exceeded on" meaning a timeout in tp5. These log lines are always preceded by 19 attemps, looking like "__WARNTimeout (18/20) exceeded on". I wanted to know if green builds (on success) also have these warn lines, and using activedata and a script I found that in the first 1000 talos green build logs from november 1th (almost 7 days of data), there is only two builds that have these lines:

http://archive.mozilla.org/pub/firefox/tinderbox-builds/b2g-inbound-linux64-pgo/1446373821/b2g-inbound_ubuntu64_hw_test-tp5o-pgo-bm103-tests1-linux-build917.txt.gz
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win64/1446597333/mozilla-inbound_win8_64_test-tp5o-e10s-bm127-tests1-windows-build16.txt.gz

and each time only one retry was sufficient. Not sure how that can help, but I wanted to share this.
Flags: needinfo?(j.parkouss)
So, unless I was lucky for the builds, I think I found the issue. Here are more than 70 tp builds, all green:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ad2fb74a2e1

patch: https://hg.mozilla.org/try/rev/6aaae4fa621f

So, basically the idea is to be sure that when we use clearTimeout to unregister the fail method, we do that with the timeout id that was registered at setTimeout call.

I believe we were calling clearTimeout with a wrong id.

I suppose what we wanted is the following flow:

> tid = setTimeout(onFail, timeout);
> clearTimeout(tid);

Instead, I suppose sometimes we are doing:

> tid = setTimeout(onFail, timeout);
> tid = setTimeout(onFail, timeout);
> clearTimeout(tid);

So we only clear the latest registered event, but not the previous one(s). And then we call the failure method. What made me think that is the logs on your latest push try, https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb2f0e45d820&selectedJob=14113615. You can see:

> 11:59:21     INFO -  2015-11-25 11:59:21,529 DEBUG : BROWSER_OUTPUT: JMAHER: inside _contentLoadHandler, load received.  JMAHER: inside _contentLoadHandler, defining contentPaintHandler.  JMAHER: inside _contentLoadHandler, calling _contentPaintHandler.  JMAHER: inside _contentPaintHandler, MozAfterPaint received.  JMAHER: plLoadHandlerMessage: top
> 11:59:21     INFO -  2015-11-25 11:59:21,529 DEBUG : BROWSER_OUTPUT: JMAHER: plLoadHandlerMessage: going to _loadHandler instead
> 11:59:21     INFO -  2015-11-25 11:59:21,529 DEBUG : BROWSER_OUTPUT: JMAHER: loadHandler: top
> 11:59:21     INFO -  2015-11-25 11:59:21,529 DEBUG : BROWSER_OUTPUT: JMAHER: loadHandler: recordTime: 294
> 11:59:21     INFO -  2015-11-25 11:59:21,529 DEBUG : BROWSER_OUTPUT: Cycle 1(14): loaded http://localhost:49864/tests/tp5n/guardian.co.uk/www.guardian.co.uk/index.html (next: > http://localhost:49864/tests/tp5n/homeway.com.cn/www.hexun.com/index.html)
> 11:59:21     INFO -  2015-11-25 11:59:21,529 DEBUG : BROWSER_OUTPUT: JMAHER: loadHandler: nextPage
> 11:59:21     INFO -  2015-11-25 11:59:21,857 DEBUG : BROWSER_OUTPUT: RSS: Main: 160030720
> 11:59:21     INFO -  2015-11-25 11:59:21,857 DEBUG : BROWSER_OUTPUT:
> 11:59:22     INFO -  2015-11-25 11:59:22,127 DEBUG : BROWSER_OUTPUT: __FAILTimeout in tp5n__FAIL

Basicaly, we have a load event (and that should clear the fail handler), and less than a second after we fail - which does not make sense - unless we are not clearing the handler we think, as explained above.

Tell me what you think Joel!
Flags: needinfo?(jmaher)
Bug 1220327 - Intermittent TalosError: Timeout in tp5n. r=jmaher
Attachment #8692840 - Flags: review?(jmaher)
Here is a push to try with the patch I want you to review: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c4cd42138ea0 (same kind of logic, but better looking code to me).

Here again, more than 70 tp builds all green. :)

Clearing the NI as I ask for a review.
Flags: needinfo?(jmaher)
Comment on attachment 8692840 [details]
MozReview Request: Bug 1220327 - Intermittent TalosError: Timeout in tp5n. r=jmaher

https://reviewboard.mozilla.org/r/26341/#review23769

r=me; just one question.

::: testing/talos/talos/pageloader/chrome/pageloader.js:93
(Diff revision 1)
> +      that.timeoutEvent = undefined;

why do we need to define that?  could we not call:
clearTimeout(this.timeoutEvent);
callback();
Attachment #8692840 - Flags: review?(jmaher) → review+
https://reviewboard.mozilla.org/r/26341/#review23769

> why do we need to define that?  could we not call:
> clearTimeout(this.timeoutEvent);
> callback();

Well, two things:

- the callback has been called when we are here, so clearTimeout would be a no-op. The current line just define our timeoutEvent to the undefined value, so next call to register for example it will know that nothing is registered and just register a new callback (without attempting to clear first). That may work without this, but I find it more robust/readable to track the timeoutEvent value everywhere. (it is defined when a callback is going to be called, undefined for every other reason).
- "that" seems required, https://developer.mozilla.org/en-US/docs/Web/API/WindowTimers/setTimeout#The_this_problem. I did not tested, but I often had issues with this and context.

Tell me what do you think, and if something needs an adjustment.
thanks for answering, mostly just questions- lets leave it as is!
Assignee: nobody → j.parkouss
Status: NEW → ASSIGNED
I think after landing this we should leave the bug open, to see if it is totally fixed.
Keywords: leave-open
this doesn't look like bug 1345735, but until that bug is resolved I don't think we have the bandwidth to look at this.
this is not the windows 8 crashes like we see typically- this is a mix of a few failures.

1) windows 7 xperf*, failure to parse values (all seem to be on t-w732-ix-097)
2) windows 8 tp5 - timeout while loading page- this has been around forever, would like to ignore for now.


regarding the xperf failures, I see this in the log [0]:
14:34:38     INFO -  Merged Etl: test.etl
14:36:50     INFO -  c:/Program Files/Microsoft Windows Performance Toolkit/xperf: warning: applying restriction of access for trace processing
14:37:42     INFO -  reading etl filename: test.etl
14:37:42     INFO -  etlparser: in readfile: test.etl.csv
14:37:42    ERROR -  Traceback (most recent call last):
14:37:42     INFO -    File "C:\slave\test\build\tests\talos\talos/xtalos/parse_xperf.py", line 74, in <module>
14:37:42     INFO -      main()
14:37:42     INFO -    File "C:\slave\test\build\tests\talos\talos/xtalos/parse_xperf.py", line 69, in main
14:37:42     INFO -      **args.__dict__)
14:37:42     INFO -    File "C:\slave\test\build\tests\talos\talos/xtalos/parse_xperf.py", line 56, in stop_from_config
14:37:42     INFO -      processID=kwargs['processID'])
14:37:42     INFO -    File "C:\slave\test\build\tests\talos\talos\xtalos\etlparser.py", line 498, in etlparser_from_config
14:37:42     INFO -      etlparser(**args)
14:37:42     INFO -    File "C:\slave\test\build\tests\talos\talos\xtalos\etlparser.py", line 355, in etlparser
14:37:42     INFO -      wl_temp = json.load(fHandle)
14:37:42     INFO -    File "c:\mozilla-build\python27\Lib\json\__init__.py", line 278, in load
14:37:42     INFO -      **kw)
14:37:42     INFO -    File "c:\mozilla-build\python27\Lib\json\__init__.py", line 326, in loads
14:37:42     INFO -      return _default_decoder.decode(s)
14:37:42     INFO -    File "c:\mozilla-build\python27\Lib\json\decoder.py", line 366, in decode
14:37:42     INFO -      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
14:37:42     INFO -    File "c:\mozilla-build\python27\Lib\json\decoder.py", line 382, in raw_decode
14:37:42     INFO -      obj, end = self.scan_once(s, idx)
14:37:42     INFO -  ValueError: Expecting property name: line 58 column 1 (char 5987)
14:37:51     INFO -  extending with xperf!
14:37:51     INFO -  TEST-UNEXPECTED-ERROR | tp5n | Timeout in tp5n
14:37:51    ERROR -  Traceback (most recent call last):
14:37:51     INFO -    File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests
14:37:51     INFO -      talos_results.add(mytest.runTest(browser_config, test))
14:37:51     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest
14:37:51     INFO -      return self._runTest(browser_config, test_config, setup)
14:37:51     INFO -    File "C:\slave\test\build\tests\talos\talos\ttest.py", line 236, in _runTest
14:37:51     INFO -      else None)
14:37:51     INFO -    File "C:\slave\test\build\tests\talos\talos\results.py", line 94, in add
14:37:51     INFO -      global_counters=self.global_counters
14:37:51     INFO -    File "C:\slave\test\build\tests\talos\talos\results.py", line 308, in __init__
14:37:51     INFO -      self.error(match.group(1))
14:37:51     INFO -    File "C:\slave\test\build\tests\talos\talos\results.py", line 322, in error
14:37:51     INFO -      raise utils.TalosError(message)
14:37:51     INFO -  TalosError: Timeout in tp5n
14:37:51     INFO -  TEST-INFO took 475489ms


I believe fixing the xperf errors would do wonders here.  First step is to handle the error properly and move on.  Second step is to maybe figure out what is going on.


Ionut, can you look into the xperf code to make sure it is stable and handling errors properly?


[0] https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=91988370&lineNumber=1863
Flags: needinfo?(ionut.goldan)
Whiteboard: [stockwell needswork]
Yes, will do!
Flags: needinfo?(ionut.goldan)
Assignee: j.parkouss → ionut.goldan
:jmaher Looks like the talos/xtalos/xperf_whitelist.json has an invalid JSON.

Specifically, the last attribute-value pair ends in comma. The JSON standard doesn't allow this, only the Python syntax:
{
...
"{profile}\\favicons.sqlite-shm": {"mincount": 0, "maxcount": 4, "minbytes": 65536, "maxbytes": 131072},
"{profile}\\favicons.sqlite-wal": {"mincount": 0, "maxcount": 32, "minbytes": 0, "maxbytes": 131072},
}

This bug got introduced on the 30th of March, from a change required by bug 977177.
great, can you fix that Ionut?
Whiteboard: [stockwell needswork] → [stockwell unknown]
we need to add:
TEST-UNEXPECTED-FAIL : xperf: File '{profile}\handlers.json' was accessed and we were not expecting it. DiskReadCount: 4, DiskWriteCount: 0, DiskReadBytes: 1388, DiskWriteBytes: 0

I think we had that show up while this was broken, so lets not worry about where it came from and add it to the list and push again.
your latest try run looks great.
Comment on attachment 8879501 [details]
Bug 1220327 - properly handle invalid json error : -b o -p win32 -u none -t xperf-e10s

https://reviewboard.mozilla.org/r/150814/#review155780

our current problems are unrelated to xperf_whitelist.json (I had accidentally fixed this while debugging xperf).  This code is good.  Can you somehow push to try with a bad xperf_whitelist.json and show me a failure message so we can see this work.

Also this doesn't solve this bug as what we see is tp5n page loads timing out, but I would like to see this code landed.

::: commit-message-e9683:1
(Diff revision 1)
> +Bug 1220327 - properly handle invalid json error r?jmaher; try: -b o -p win32 -u none -t xperf-e10s

remove the try: syntax from the commit message.

::: testing/talos/talos/xtalos/etlparser.py:386
(Diff revision 1)
> +        except ValueError:
> +            # upload erroneous file for investigation
> +            mud = os.environ.get('MOZ_UPLOAD_DIR', None)
> +            if mud:
> +                invalid_mud_file = os.path.join(mud, filename)
> +                shutil.copyfile(filename, invalid_mud_file)

when uploading files to blobber, they need specific file extensions like .log or .txt, so lets make sure this works.
Attachment #8879501 - Flags: review?(jmaher) → review-
Comment on attachment 8879501 [details]
Bug 1220327 - properly handle invalid json error : -b o -p win32 -u none -t xperf-e10s

https://reviewboard.mozilla.org/r/150814/#review155792
There will be many errors coming up on linux64-ccov next week. I am increasing the timeout for that build in bug 1377915 as we are only looking for coverage, not performance. Linux64-ccov is a "slow" build, so the modification I'll be making is necessary for this build and will only affect this one and nothing else.
Assignee: igoldan → nobody
Status: ASSIGNED → NEW
Priority: -- → P3
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE

This seems to happen again, recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=244758256&repo=mozilla-central&lineNumber=5333

21:59:19     INFO -  PID 10136 | [Child 1556, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
21:59:19     INFO -  PID 10136 | [Parent 6520, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
21:59:19     INFO -  PID 10136 | [Child 7028, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
21:59:19     INFO -  PID 10136 | [Child 7028, Chrome_ChildThread] WARNING: pipe error: 109: fil[Parent 6520, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
21:59:19     INFO -  PID 10136 | [GPU
21:59:19     INFO -  PID 10136 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
21:59:19     INFO -  PID 10136 |
21:59:19     INFO -  PID 10136 |  6228, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
21:59:19     INFO -  PID 10136 |
21:59:19     INFO -  PID 10136 | ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
21:59:19     INFO -  PID 10136 |
21:59:20     INFO -  TEST-INFO | 10136: exit 0
21:59:24     INFO -  TEST-UNEXPECTED-ERROR | tp5o_webext | Timeout in tp5n
21:59:24    ERROR -  Traceback (most recent call last):
21:59:24     INFO -    File "C:\tasks\task_1557082600\build\tests\talos\talos\run_tests.py", line 300, in run_tests
21:59:24     INFO -      talos_results.add(mytest.runTest(browser_config, test))
21:59:24     INFO -    File "C:\tasks\task_1557082600\build\tests\talos\talos\ttest.py", line 64, in runTest
21:59:24     INFO -      return self._runTest(browser_config, test_config, setup)
21:59:24     INFO -    File "C:\tasks\task_1557082600\build\tests\talos\talos\ttest.py", line 278, in _runTest
21:59:24     INFO -      else None)
21:59:24     INFO -    File "C:\tasks\task_1557082600\build\tests\talos\talos\results.py", line 95, in add
21:59:24     INFO -      global_counters=self.global_counters
21:59:24     INFO -    File "C:\tasks\task_1557082600\build\tests\talos\talos\results.py", line 323, in __init__
21:59:24     INFO -      self.error(match.group(1))
21:59:24     INFO -    File "C:\tasks\task_1557082600\build\tests\talos\talos\results.py", line 337, in error
21:59:24     INFO -      raise utils.TalosError(message)
21:59:24     INFO -  TalosError: Timeout in tp5n
21:59:24     INFO -  TEST-INFO took 1303288ms
21:59:24     INFO -  SUITE-END | took 1387s
21:59:24    ERROR - Return code: 2
21:59:24  WARNING - setting return code to 2
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: