Closed Bug 946922 Opened 12 years ago Closed 12 years ago

Talos sometimes eats the error message

Categories

(Testing :: Talos, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: ehsan.akhgari, Unassigned)

Details

See this try run for example: <https://tbpl.mozilla.org/?tree=Try&rev=44c5ad990849> Here I have a bug which prevents a bunch of shared libraries from being packaged, hence xpcshell and firefox cannot be started. But Talos shows up as green. The following is in the logs: 12:54:24 INFO - INFO : Raw results:XPCOMGlueLoad error for file /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: 12:54:24 INFO - /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: cannot open shared object file: No such file or directory 12:54:24 INFO - Couldn't load XPCOM. 12:54:24 INFO - INFO : Initialization of new profile failed 12:54:24 INFO - INFO : XPCOMGlueLoad error for file /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: 12:54:24 INFO - /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: cannot open shared object file: No such file or directory 12:54:24 INFO - Couldn't load XPCOM. 12:54:24 INFO - DEBUG : Terminating: firefox, plugin-container, crashreporter 12:54:24 INFO - XPCOMGlueLoad error for file /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: 12:54:24 INFO - /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: cannot open shared object file: No such file or directory 12:54:24 INFO - Couldn't load XPCOM. 12:54:24 INFO - Failed tresize: 12:54:24 INFO - Stopped Thu, 05 Dec 2013 12:54:24 12:54:24 ERROR - Traceback (most recent call last): 12:54:24 INFO - File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/run_tests.py", line 279, in run_tests 12:54:24 INFO - talos_results.add(mytest.runTest(browser_config, test)) 12:54:24 INFO - File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/ttest.py", line 311, in runTest 12:54:24 INFO - self.initializeProfile(profile_dir, browser_config) 12:54:24 INFO - File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/ttest.py", line 113, in initializeProfile 12:54:24 CRITICAL - raise talosError("failed to initialize browser") 12:54:24 CRITICAL - talosError: failed to initialize browser 12:54:24 ERROR - Return code: 512 12:54:24 INFO - Running post-action listener: _resource_record_post_action 12:54:24 INFO - Running post-run listener: _resource_record_post_run 12:54:25 INFO - Total resource usage - Wall time: 15s; CPU: 9.0%; Read bytes: 1351680; Write bytes: 85069824; Read time: 292; Write time: 64084 12:54:25 INFO - install - Wall time: 10s; CPU: 13.0%; Read bytes: 0; Write bytes: 84807680; Read time: 0; Write time: 64052 12:54:25 INFO - run-tests - Wall time: 6s; CPU: 1.0%; Read bytes: 688128; Write bytes: 262144; Read time: 196; Write time: 32 12:54:25 INFO - Running post-run listener: _upload_blobber_files 12:54:25 INFO - Blob upload gear active. 12:54:25 INFO - There are no files to upload in the directory. Skipping the blob upload mechanism ... 12:54:25 INFO - Copying logs to upload dir... 12:54:25 INFO - mkdir: /builds/slave/talos-slave/test/build/upload/logs program finished with exit code 0 I guess we should not be exiting with a 0 error code there.
Joel, dupe of bug 938543?
Flags: needinfo?(jmaher)
jlund: I assume we can fix this in the mozharness script?
Flags: needinfo?(jmaher) → needinfo?(jlund)
I wonder if this is the same cause as https://bugzilla.mozilla.org/show_bug.cgi?id=823787#c7 i.e. the return codes are being returned modulo 256. So 512 would also result in a return code of 0.
could be, any chance there is a hard and fast rule for return codes somewhere on a wiki?
(In reply to Joel Maher (:jmaher) from comment #2) > jlund: I assume we can fix this in the mozharness script? I would assume so too. In fact, I wonder if my talos trigger retry patch may have solved the issue already? https://bug940699.bugzilla.mozilla.org/attachment.cgi?id=8342605 after the talos suite runs we - + if self.return_code not in [0, 1]: # not a success or warning and if it's not one of those two numbers, we escalate the buildbot status to a failure (unless our parser obj parsed a line that was worse, eg: a retry): + parser.worst_tbpl_status = parser.worst_level( + TBPL_FAILURE, parser.worst_tbpl_status, + levels=TBPL_WORST_LEVEL_TUPLE + ) finally we: + self.buildbot_status(parser.worst_tbpl_status, + level=parser.worst_log_level) which sets the return code to worst_tbpl_status: http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/buildbot.py#94 which has to be one of these 5 numbers: EXIT_STATUS_DICT = { TBPL_SUCCESS: 0, TBPL_WARNING: 1, TBPL_FAILURE: 2, TBPL_EXCEPTION: 3, TBPL_RETRY: 4, } This patch landed in production today. Maybe we could try to reproduce?
Flags: needinfo?(jlund)
(In reply to Joel Maher (:jmaher) from comment #4) > could be, any chance there is a hard and fast rule for return codes > somewhere on a wiki? I don't think so. I based my 0 and 1 return code condition and buildbot status off of the lines around here: http://hg.mozilla.org/build/talos/file/93e170af571f/talos/run_tests.py#l289 and what I found here in our parser's regex lines: http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/testing/talos.py#26 I'd be happy to try and get something started. I'm sure someone with more talos experience could fill in my holes. /me adds it to todos
(In reply to comment #4) > could be, any chance there is a hard and fast rule for return codes somewhere > on a wiki? Hmm, shouldn't we use 0 to indicate success and anything else to indicate failure? Note that the return code is usually a 32-bit number.
I am not sure how talos could return a 512, I looked at the return codes and all of them are either 0,1,2. It could be that raising an exception is causing this and avoiding a return/sys.exit path, but testing locally yields a value of 1, not 512. Is it possible that there is a different shell or the mozharness script is somehow getting a value of 512?
The waitpid system call can return > 255 if the process exited due to a signal. That could be the case here?
(In reply to Chris AtLee [:catlee] from comment #9) > The waitpid system call can return > 255 if the process exited due to a > signal. That could be the case here? So I think my patch is fixing this? It is grabbing high return codes and setting them to fail(2). For example, this used to be green before today: https://tbpl.mozilla.org/php/getParsedLog.php?id=31698140&full=1&branch=mozilla-b2g26_v1_2f
Is this fixed now?
I believe so.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.