Closed Bug 946922 Opened 11 years ago Closed 11 years ago

Talos sometimes eats the error message

Categories

(Testing :: Talos, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: ehsan.akhgari, Unassigned)

Details

See this try run for example: <https://tbpl.mozilla.org/?tree=Try&rev=44c5ad990849>

Here I have a bug which prevents a bunch of shared libraries from being packaged, hence xpcshell and firefox cannot be started.  But Talos shows up as green.

The following is in the logs:

12:54:24     INFO -  INFO : Raw results:XPCOMGlueLoad error for file /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50:
12:54:24     INFO -  /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: cannot open shared object file: No such file or directory
12:54:24     INFO -  Couldn't load XPCOM.
12:54:24     INFO -  INFO : Initialization of new profile failed
12:54:24     INFO -  INFO : XPCOMGlueLoad error for file /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50:
12:54:24     INFO -  /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: cannot open shared object file: No such file or directory
12:54:24     INFO -  Couldn't load XPCOM.
12:54:24     INFO -  DEBUG : Terminating: firefox, plugin-container, crashreporter
12:54:24     INFO -  XPCOMGlueLoad error for file /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50:
12:54:24     INFO -  /builds/slave/talos-slave/test/build/application/firefox/libicudata.so.50: cannot open shared object file: No such file or directory
12:54:24     INFO -  Couldn't load XPCOM.
12:54:24     INFO -  Failed tresize:
12:54:24     INFO -  		Stopped Thu, 05 Dec 2013 12:54:24
12:54:24    ERROR -  Traceback (most recent call last):
12:54:24     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/run_tests.py", line 279, in run_tests
12:54:24     INFO -      talos_results.add(mytest.runTest(browser_config, test))
12:54:24     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/ttest.py", line 311, in runTest
12:54:24     INFO -      self.initializeProfile(profile_dir, browser_config)
12:54:24     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/ttest.py", line 113, in initializeProfile
12:54:24 CRITICAL -      raise talosError("failed to initialize browser")
12:54:24 CRITICAL -  talosError: failed to initialize browser
12:54:24    ERROR - Return code: 512
12:54:24     INFO - Running post-action listener: _resource_record_post_action
12:54:24     INFO - Running post-run listener: _resource_record_post_run
12:54:25     INFO - Total resource usage - Wall time: 15s; CPU: 9.0%; Read bytes: 1351680; Write bytes: 85069824; Read time: 292; Write time: 64084
12:54:25     INFO - install - Wall time: 10s; CPU: 13.0%; Read bytes: 0; Write bytes: 84807680; Read time: 0; Write time: 64052
12:54:25     INFO - run-tests - Wall time: 6s; CPU: 1.0%; Read bytes: 688128; Write bytes: 262144; Read time: 196; Write time: 32
12:54:25     INFO - Running post-run listener: _upload_blobber_files
12:54:25     INFO - Blob upload gear active.
12:54:25     INFO - There are no files to upload in the directory. Skipping the blob upload mechanism ...
12:54:25     INFO - Copying logs to upload dir...
12:54:25     INFO - mkdir: /builds/slave/talos-slave/test/build/upload/logs
program finished with exit code 0

I guess we should not be exiting with a 0 error code there.
Joel, dupe of bug 938543?
Flags: needinfo?(jmaher)
jlund: I assume we can fix this in the mozharness script?
Flags: needinfo?(jmaher) → needinfo?(jlund)
I wonder if this is the same cause as https://bugzilla.mozilla.org/show_bug.cgi?id=823787#c7 i.e. the return codes are being returned modulo 256. So 512 would also result in a return code of 0.
could be, any chance there is a hard and fast rule for return codes somewhere on a wiki?
(In reply to Joel Maher (:jmaher) from comment #2)
> jlund: I assume we can fix this in the mozharness script?

I would assume so too. In fact, I wonder if my talos trigger retry patch may have solved the issue already?
https://bug940699.bugzilla.mozilla.org/attachment.cgi?id=8342605

after the talos suite runs we -
+        if self.return_code not in [0, 1]:   # not a success or warning

and if it's not one of those two numbers, we escalate the buildbot status to a failure (unless our parser obj parsed a line that was worse, eg: a retry):
+            parser.worst_tbpl_status = parser.worst_level(
+                TBPL_FAILURE, parser.worst_tbpl_status,
+                levels=TBPL_WORST_LEVEL_TUPLE
+            )

finally we:
+        self.buildbot_status(parser.worst_tbpl_status,
+                             level=parser.worst_log_level)

which sets the return code to worst_tbpl_status:
http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/buildbot.py#94

which has to be one of these 5 numbers:
EXIT_STATUS_DICT = {
    TBPL_SUCCESS: 0,
    TBPL_WARNING: 1,
    TBPL_FAILURE: 2,
    TBPL_EXCEPTION: 3,
    TBPL_RETRY: 4,
}

This patch landed in production today. Maybe we could try to reproduce?
Flags: needinfo?(jlund)
(In reply to Joel Maher (:jmaher) from comment #4)
> could be, any chance there is a hard and fast rule for return codes
> somewhere on a wiki?

I don't think so. I based my 0 and 1 return code condition and buildbot status off of the lines around here:
   http://hg.mozilla.org/build/talos/file/93e170af571f/talos/run_tests.py#l289

and what I found here in our parser's regex lines:
   http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/testing/talos.py#26

I'd be happy to try and get something started. I'm sure someone with more talos experience could fill in my holes. /me adds it to todos
(In reply to comment #4)
> could be, any chance there is a hard and fast rule for return codes somewhere
> on a wiki?

Hmm, shouldn't we use 0 to indicate success and anything else to indicate failure?  Note that the return code is usually a 32-bit number.
I am not sure how talos could return a 512, I looked at the return codes and all of them are either 0,1,2.  It could be that raising an exception is causing this and avoiding a return/sys.exit path, but testing locally yields a value of 1, not 512.

Is it possible that there is a different shell or the mozharness script is somehow getting a value of 512?
The waitpid system call can return > 255 if the process exited due to a signal. That could be the case here?
(In reply to Chris AtLee [:catlee] from comment #9)
> The waitpid system call can return > 255 if the process exited due to a
> signal. That could be the case here?

So I think my patch is fixing this? It is grabbing high return codes and setting them to fail(2). 

For example, this used to be green before today:
https://tbpl.mozilla.org/php/getParsedLog.php?id=31698140&full=1&branch=mozilla-b2g26_v1_2f
Is this fixed now?
I believe so.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.