Closed Bug 1121507 Opened 9 years ago Closed 9 years ago

Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions after dev ES upgrade

Categories

(Tree Management Graveyard :: OrangeFactor, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rwatson, Assigned: emorley)

References

Details

The MOC is getting alerts for orangefactor1.dmz.phx1.mozilla.com. Running low on disk (10%). Can this be looked at?
[root@orangefactor1.dmz.phx1 logparser]# df
Filesystem                      1K-blocks     Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00  39091656 33149620   3956552  90% /
tmpfs                              961348       12    961336   1% /dev/shm
/dev/sda1                           99150    73630     20400  79% /boot

Virtually all of that is under:
/home/webtools/apps/logparser/incoming-logs

[root@orangefactor1.dmz.phx1 logparser]# du -sh *
76K     bin
5.8M    finished-logs
4.0K    include
30G     incoming-logs
11M     lib
0       lib64
4.0K    logparser.pid
4.0K    pending_jobs.txt
4.0K    pulse.ini
210M    savelogs.err
0       savelogs.log
592K    savelogs.out
688K    savelogs.out.1
672K    savelogs.out.2
980K    savelogs.out.3
27M     src
savelogs.err contains lots of:

2015-01-14 07:48:08,301 - BuildLogMonitor - ERROR - Unknown exception type
Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
    lp.parseFiles()
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
    self.postResultsToElasticSearch(testdata)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 196, in postResultsToElasticSearch
    self._post_testgroup_to_elasticsearch(data)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 183, in _post_testgroup_to_elasticsearch
    testgroup.submit()
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 80, in submit
    self._generate_testrun()
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 58, in _generate_testrun
    doc_type = [self.doc_type])
  File "/home/webtools/apps/logparser/src/mozautoeslib/mozautoeslib/eslib.py", line 186, in query
    doc_types=self.doc_type)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 793, in count
    return self._query_call("_count", query, indexes, doc_types, **query_params)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 246, in _query_call
    response = self._send_request('GET', path, body, querystring_args)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 218, in _send_request
    raise_if_error(response.status, decoded)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/convert_errors.py", line 57, in raise_if_error
    raise pyes.exceptions.ElasticSearchException("Unknown exception type", status, result)
ElasticSearchException: Unknown exception type

and

2015-01-14 07:48:06,402 - BuildLogMonitor - ERROR - global name 'json' is not defined
Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
    lp.parseFiles()
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
    self.postResultsToElasticSearch(testdata)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 198, in postResultsToElasticSearch
    self._post_buildlog_to_elasticsearch(data)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 190, in _post_buildlog_to_elasticsearch
    AutologBuildLog(index='logs', server=server, **postdata).submit()
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 87, in submit
    build_id = self._add_doc(json, self.doc_type, self.id)
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 57, in _add_doc
    raise Exception(json.dumps(result))
NameError: global name 'json' is not defined


-> Presumably the logparser is failing, so we've ended up with a backlog of unparsed logs taking up 30G on disk?
Yeah finished-logs is empty (is purged periodically [1]) and has a last modified of Jan 7th (presumably the last time we parsed/purged logs).

And incoming logs has some really old logs there:

[root@orangefactor1.dmz.phx1 incoming-logs]# ls -altr | head
total 30566200
-rw-r--r-- 1 webtools webtools    47710 Dec  4 14:15 mozilla-inbound-ubuntu32_vm-linux-opt-1417725651-mochitest-e10s-4.txt.gz
-rw-r--r-- 1 webtools webtools   145470 Dec  4 14:42 b2g-inbound-win7-ix-win32-debug-1417725708-mochitest-gl.txt.gz
-rw-r--r-- 1 webtools webtools    27301 Dec  5 08:52 mozilla-inbound-mountainlion-macosx64-opt-1417791709-mochitest-browser-chrome-2.txt.gz
-rw-r--r-- 1 webtools webtools    47427 Dec  5 13:47 mozilla-inbound-ubuntu64_vm-linux64-opt-1417814675-mochitest-e10s-4.txt.gz
-rw-r--r-- 1 webtools webtools    27734 Dec  5 14:17 mozilla-inbound-ubuntu64_vm-linux64-opt-1417816652-mochitest-browser-chrome-2.txt.gz
-rw-r--r-- 1 webtools webtools   633480 Dec  5 15:38 mozilla-inbound-win7-ix-win32-debug-1417817912-mochitest-3.txt.gz
-rw-r--r-- 1 webtools webtools   244373 Dec  5 15:39 fx-team-ubuntu32_vm-linux-opt-1417818155-mochitest-3.txt.gz
-rw-r--r-- 1 webtools webtools   149176 Dec  5 15:39 mozilla-inbound-win8_64-win64-debug-1417816652-mochitest-gl.txt.gz -rw-r--r-- 1 webtools webtools    29591 Dec  5 15:39 mozilla-inbound-ubuntu64-asan_vm-linux64-asan-opt-1417819050-mochitest-browser-chrome-2.txt.gz

I'm confused why we have the odd old log from Dec (presumably we're not imported/cleaning those up for some reason, will check the error logs once we sort this other problem out), but most of the contents of incoming-logs is from Jan 7th onwards.



[1] [root@orangefactor1.dmz.phx1 incoming-logs]# crontab -l | grep 'finished-logs'
0 1 * * * rm -rf /home/webtools/apps/logparser/finished-logs/*
Depends on: 1121517
With bug 1121517 fixed, that exception became:

Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
    lp.parseFiles()
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
    self.postResultsToElasticSearch(testdata)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 198, in postResultsToElasticSearch
    self._post_buildlog_to_elasticsearch(data)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 190, in _post_buildlog_to_elasticsearch
    AutologBuildLog(index='logs', server=server, **postdata).submit()
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 88, in submit
    build_id = self._add_doc(json, self.doc_type, self.id)
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 58, in _add_doc
    raise Exception(json.dumps(result))
Exception: {"_type": "buildlogs", "_id": "1421253183-b89c420c8b7069254a54e82828667c1db2e6913c", "created": true, "_version": 1, "_index": "logs"}

There's also still the "ElasticSearchException: Unknown exception type" (comment 2).
(In reply to Ricky Rosario [:rrosario, :r1cky] from comment #20)
> January 7 sounds great to me!

The above being from "Bug 1071044 - Upgrade ElasticSearch PHX1 Development cluster to 1.2.x", which matches the timeline here - even though that was only supposed to affect dev, not prod.

@cyliang, were any other changes made that might have affected the use of elasticsearch-zlb.webapp.scl3.mozilla.com ?
Flags: needinfo?(cliang)
OS: Mac OS X → All
Hardware: x86 → All
Summary: Low disk on orangefactor1.dmz.phx1.mozilla.com → Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions
Since this bug was filed, free disk space had already dropped from 10% to 9%, so I've purged all logs older than 3 days from incoming-logs, to buy us some time. The OrangeFactor scores look no different from Jan 7th onwards compared to the days before, so it seemed unlikely that we'd need to re-parse the logs anyway (I presume the failures must occur after the insert into ES that we need for OF).
There were no changes made to the production ES cluster.  The only production modification I had to make was to update the script that is used to copy indexes from production to the development ES instances in PHX.  

Would it be possible to clarify which ES cluster and which index you're trying to write to?  Based on the mention of a "logs" index, I'm assuming that you're writing to the cluster in SCL3.  There is both a "logs" index and an "of_logs" index. 

The  only errors that I'm seeing on the SCL3 instance have to do with bzcache.  These occur before January 4th, tho, so I don't think they are related.
Flags: needinfo?(cliang)
(In reply to Ed Morley [:edmorley] from comment #4)
> "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line
> 58, in _add_doc
>     raise Exception(json.dumps(result))
> Exception: {"_type": "buildlogs", "_id":
> "1421253183-b89c420c8b7069254a54e82828667c1db2e6913c", "created": true,
> "_version": 1, "_index": "logs"}

That line corresponds to:

    if not 'ok' in result or not result['ok'] or not '_id' in result:
      raise Exception(json.dumps(result))

Looking at the response, it appears successful, there's just no "ok" key in the response any more. A bit of digging found this upstream ElasticSearch change (which removed the 'ok' since it's redundant given the HTTP response code):
https://github.com/elasticsearch/elasticsearch/commit/d23f640cd1488727c8337a888527d662c6135fac

However that's only supposed to have landed in ES versions above v1.0 (see tags on the github commit), but elasticsearch-zlb.webapp.scl3.mozilla.com is on 0.9.x:
[root@orangefactor1.dmz.phx1 logparser]# curl elasticsearch-zlb.webapp.scl3.mozilla.com:9200
...
  "name" : "elasticsearch1_scl3",
  "version" : {
    "number" : "0.90.10",

Grepping for ES server references (in case we're using the wrong instance):

./logparser/src/mozautolog/mozautolog/buildlog.py:      self.server = 'elasticsearch-zlb.webapp.scl3.mozilla.com:9200'

./logparser/src/mozautolog/mozautolog/mozautolog.py:      self.server = 'elasticsearch-zlb.webapp.scl3.mozilla.com:9200'

./logparser/src/logparser/scripts/savelogs-daemon.sh:     start-stop-daemon --start --user ${USER} --chuid ${USER}:${USER} --pidfile ${PIDFILE} --nicelevel 10 --make-pidfile --background --exec ${SAVELOGS} -- --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} > savelogs.log 2>&1

./logparser/src/logparser/scripts/savelogs.sh:     nohup ${SAVELOGS} --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} > ${ENV}/savelogs.log 2>&1 &

./logparser/bin/savelogs-gen.sh:     nohup ${SAVELOGS} --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} --pulse-cfg=${PULSECFG} > ${ENV}/savelogs.log 2>&1 &

./logparser/bin/savelogs-daemon-gen.sh:     start-stop-daemon --start --user ${USER} --chuid ${USER}:${USER} --pidfile ${PIDFILE} --nicelevel 10 --make-pidfile --background --exec ${SAVELOGS} -- --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pulse-cfg=${PULSECFG} > savelogs.log 2>&1

Now elasticsearch1.metrics.scl3 is decommissioned aiui, and either way there is no response to it from the orangefactor1 box, so the scripts that reference it must be unused. Which seems to be confirmed by:

[root@orangefactor1.dmz.phx1 logparser]# egrep 'savelogs.*sh' /etc/init.d/logparser
DAEMON="$BASEDIR/bin/savelogs-gen.sh"

So we don't appear to be using dev/another v1.0+ instance inadvertently, but we're still seeing the behaviour from the upstream ES change, that supposedly only landed in v1.0+?

Bizarre. Let's just start checking for "created" rather than "ok" for now anyway.
Depends on: 1121941
(In reply to Ed Morley [:edmorley] from comment #4)
> There's also still the "ElasticSearchException: Unknown exception type"
> (comment 2).

We're currently using pyes v0.16 (was v0.15 until bug 1121517 comment ), but v0.19+ gives much better output for this exception:
https://github.com/aparo/pyes/commit/b61b4be45804abbec6fdd0f43e6127d3643e1e8e

However there are lots of breaking changes, so I've backported that fix by hand to logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/convert_errors.py temporarily.

This revealed:

Traceback (most recent call last):
  File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
    lp.parseFiles()
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
    self.postResultsToElasticSearch(testdata)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 196, in postResultsToElasticSearch
    self._post_testgroup_to_elasticsearch(data)
  File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 183, in _post_testgroup_to_elasticsearch
    testgroup.submit()
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 80, in submit
    self._generate_testrun()
  File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 58, in _generate_testrun
    doc_type = [self.doc_type])
  File "/home/webtools/apps/logparser/src/mozautoeslib/mozautoeslib/eslib.py", line 186, in query
    doc_types=self.doc_type)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 893, in count
    return self._query_call("_count", query, indexes, doc_types, **query_params)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 251, in _query_call
    response = self._send_request('GET', path, body, querystring_args)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 223, in _send_request
    raise_if_error(response.status, decoded)
  File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/convert_errors.py", line 58, in raise_if_error
    raise pyes.exceptions.ElasticSearchException(u"Unknown exception type: %d, %s" % (status, result), status, result)
ElasticSearchException: Unknown exception type: 400, {u'count': 0, u'_shards': {u'successful': 0, u'failed': 5, u'total': 5, u'failures': [{u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][0] ]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 0}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][2] ]; nested: RemoteTransportException[[elasticsearch2_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 2}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][1] ]; nested: RemoteTransportException[[elasticsearch2_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 1}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][4] ]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 4}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][3] ]; nested: RemoteTransportException[[elasticsearch1_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 3}]}}

I've removed the IP addresses, but even without them it's clear that we're accessing elasticsearch2_dev_phx1.

Re-reading comment 8 shows that I missed:

./logparser/bin/savelogs-gen.sh:     nohup ${SAVELOGS} --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} --pulse-cfg=${PULSECFG} > ${ENV}/savelogs.log 2>&1 &

Looking at history shows:

  436  2014-10-09 11:45:46 vi bin/savelogs-gen.sh
  437  2014-10-09 11:46:34 vi bin/savelogs-
  438  2014-10-09 11:46:36 vi bin/savelogs-daemon-gen.sh

Jonathan, was something being tested then? I'm presuming we're not intentionally using the dev ES instance? (elasticsearch-zlb.dev.vlan81.phx1.mozilla.com)
Flags: needinfo?(jgriffin)
(In reply to Ed Morley [:edmorley] from comment #9)
> Jonathan, was something being tested then? I'm presuming we're not
> intentionally using the dev ES instance?
> (elasticsearch-zlb.dev.vlan81.phx1.mozilla.com)

Or more - I don't think logparser should be submitting to both prod and dev ES, seeing as aiui there is a script for mirroring from prod to dev already. Either way, at least we know now that these failures are just for the submission to dev, which we can switch off for now.

I've removed the additional "--es-server" param specifying the dev ES instance from savelogs-gen.sh. Whilst I was there I fixed the out of date ES instance listed in the unused savelogs-daemon-gen.sh. I've cleared the contents of incoming-logs (since it was only the submission to dev that failed) and restarted the logparser service.

All seems good now: no errors in savelogs.err, seemingly normal output in savelogs.out & logs appearing in finished-logs/.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(jgriffin)
Resolution: --- → FIXED
Summary: Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions → Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions after dev ES upgrade
Meant to say: I'm looking forward to when we can get OrangeFactor v2 created and using Treeherder's API, so we can switch off of this off! The use of customised-not-in-the-repo scripts that pass config params on the command line make this type of issue much harder than it needs be. Even using a not-checked in settings.local.py in the repo is more discoverable, since an |hg status --ignored| surfaces it pretty easily :-)
That is *awesome* detective work.
Thanks Ed.  At some point in the past, the production instance wasn't being mirrored to dev, thus the double posting.
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.