Closed Bug 1048920 Opened 5 years ago Closed 5 years ago

OrangeFactor/logparser not ingesting data since 29th July

Categories

(Tree Management :: OrangeFactor, defect, critical)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

Details

On:
http://brasstacks.mozilla.com/orangefactor/?display=OrangeFactor&endday=2014-08-05&startday=2014-07-28&tree=trunk

There are no new data-points since 29th July, and:
"plus 1646 oranges with no daily test-run count"
At the end of /home/webtools/apps/logparser/savelogs.err was:

2014-08-05 08:17:24,939 - BuildLogMonitor - ERROR - HTTP Error 404: Not Found
Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 170, in on_build_complete
    buildername=buildername)
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 134, in _download_and_parse_log
    remote = urllib2.urlopen(logurl)
  File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found

I tried restarting the logparser service and then started seeing the following in savelogs.err:

...2014-08-05 08:23:03,942 - BuildLogMonitor - ERROR - [Errno 2] No such file or directory: u'/home/webtools/apps/logparser/incoming-logs/mozilla-inbound-panda_android-android-opt-1406667129-jsreftest-1.txt.gz'
Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 60, in parse
    lp.parseFiles()
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 103, in parseFiles
    raise inst
IOError: [Errno 2] No such file or directory: u'/home/webtools/apps/logparser/incoming-logs/mozilla-inbound-panda_android-android-opt-1406667129-jsreftest-1.txt.gz'
2014-08-05 08:23:06,298 - BuildLogMonitor - ERROR - error parsing file /home/webtools/apps/logparser/incoming-logs/mozilla-inbound-panda_android-android-debug-1406667129-mochitest-7.txt.gz
Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 99, in parseFiles
    testdata = self._parseSingleFile(logname)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 70, in _parseSingleFile
    fp = open(log, "rb")
IOError: [Errno 2] No such file or directory: u'/home/webtools/apps/logparser/incoming-logs/mozilla-inbound-panda_android-android-debug-1406667129-mochitest-7.txt.gz'
2014-08-05 08:23:06,298 - BuildLogMonitor - ERROR - [Errno 2] No such file or directory: u'/home/webtools/apps/logparser/incoming-logs/mozilla-inbound-panda_android-android-debug-1406667129-mochitest-7.txt.gz'
Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 60, in parse
    lp.parseFiles()
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 103, in parseFiles
    raise inst
IOError: [Errno 2] No such file or directory: u'/home/webtools/apps/logparser/incoming-logs/mozilla-inbound-panda_android-android-debug-1406667129-mochitest-7.txt.gz'
[root@orangefactor1.dmz.phx1 logparser]# ps aux | egrep '^webtools'
webtools 29966 46.7  1.6 319408 32532 ?        Sl   08:15   6:07 /home/webtools/apps/logparser/bin/python /home/webtools/apps/logparser/bin/savelogs --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=/home/webtools/apps/logparser/incoming-logs --outputdir=/home/webtools/apps/logparser/finished-logs --outputlog=/home/webtools/apps/logparser/savelogs.out --errorlog=/home/webtools/apps/logparser/savelogs.err --pidfile=/home/webtools/apps/logparser/logparser.pid

webtools 29973 47.0  0.9 224696 18924 ?        S    08:15   6:09 /home/webtools/apps/logparser/bin/python /home/webtools/apps/logparser/bin/savelogs --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=/home/webtools/apps/logparser/incoming-logs --outputdir=/home/webtools/apps/logparser/finished-logs --outputlog=/home/webtools/apps/logparser/savelogs.out --errorlog=/home/webtools/apps/logparser/savelogs.err --pidfile=/home/webtools/apps/logparser/logparser.pid

webtools 29974 45.6  0.9 224792 19040 ?        R    08:15   5:58 /home/webtools/apps/logparser/bin/python /home/webtools/apps/logparser/bin/savelogs --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=/home/webtools/apps/logparser/incoming-logs --outputdir=/home/webtools/apps/logparser/finished-logs --outputlog=/home/webtools/apps/logparser/savelogs.out --errorlog=/home/webtools/apps/logparser/savelogs.err --pidfile=/home/webtools/apps/logparser/logparser.pid

[root@orangefactor1.dmz.phx1 logparser]# cat logparser.pid
29966

Are we supposed to have multiple processes running?
Flags: needinfo?(jgriffin)
Yes, there are 3 processes...1 the parent process, and 2 child processes that handle the logs themselves.  I'll take a look and see what else might be going wrong.
Flags: needinfo?(jgriffin)
Ah ok - thank you.
With the "No such file or directory" IOError in comment 1, my first thought was race condition between the processes (for that exception at least).
For the 404s, the stdout log seems to show test and build logs being parsed successfully, so I don't know if it's just the odd log (for special job types perhaps eg fuzzer?) that we're not finding?
So the logparser is behaving correctly, but for some reason it's far behind, and is currently processing logs from July 30.

I'm not sure why this is the case.  We can see if it catches up naturally, or I can kill the pending queue.  Any preferences?
Do we think that the additional suites/platforms/repos added in bug 817269 might be the cause of the backlog?

Given that the actual log parsing itself (beyond the reading of the header of the log) isn't actually resulting in anything we use at the moment - could we perhaps skip parsing the whole log to speed things up? I'm also unsure as to why we extract quite so much from the log, when many of the fields are already present in the pulse data?

Either way, roll on OrangeFactor v2 based on the treeherder API! :-)
(In reply to Ed Morley [:edmorley] from comment #6)
> Do we think that the additional suites/platforms/repos added in bug 817269
> might be the cause of the backlog?
> 
Possibly.  I'm going to turn off Talos tests again and that should help things a bit.
(In reply to Ed Morley [:edmorley] from comment #6)
> Given that the actual log parsing itself (beyond the reading of the header
> of the log) isn't actually resulting in anything we use at the moment -
> could we perhaps skip parsing the whole log to speed things up? I'm also
> unsure as to why we extract quite so much from the log, when many of the
> fields are already present in the pulse data?

It's because we used to display the actual failure data from the logs in OF, but that hasn't worked in a long time due to problems with ES.

We could certainly hook things up more efficiently now, e.g., by writing to ES directly based on pulse data.
Seems like the logparser is slowly catching up...it's now parsing logs from Aug 2.
This has caught up now; I'm not sure if turning off Talos test parsing caused this or not, but it seems resolved.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Great, thank you :-)
Product: Testing → Tree Management
You need to log in before you can comment on or make changes to this bug.