Closed Bug 1121507 Opened 10 years ago Closed 10 years ago

Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions after dev ES upgrade

Categories

(Tree Management Graveyard :: OrangeFactor, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rwatson, Assigned: emorley)

References

Details

The MOC is getting alerts for orangefactor1.dmz.phx1.mozilla.com. Running low on disk (10%). Can this be looked at?
[root@orangefactor1.dmz.phx1 logparser]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 39091656 33149620 3956552 90% / tmpfs 961348 12 961336 1% /dev/shm /dev/sda1 99150 73630 20400 79% /boot Virtually all of that is under: /home/webtools/apps/logparser/incoming-logs [root@orangefactor1.dmz.phx1 logparser]# du -sh * 76K bin 5.8M finished-logs 4.0K include 30G incoming-logs 11M lib 0 lib64 4.0K logparser.pid 4.0K pending_jobs.txt 4.0K pulse.ini 210M savelogs.err 0 savelogs.log 592K savelogs.out 688K savelogs.out.1 672K savelogs.out.2 980K savelogs.out.3 27M src
savelogs.err contains lots of: 2015-01-14 07:48:08,301 - BuildLogMonitor - ERROR - Unknown exception type Traceback (most recent call last): File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse lp.parseFiles() File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles self.postResultsToElasticSearch(testdata) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 196, in postResultsToElasticSearch self._post_testgroup_to_elasticsearch(data) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 183, in _post_testgroup_to_elasticsearch testgroup.submit() File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 80, in submit self._generate_testrun() File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 58, in _generate_testrun doc_type = [self.doc_type]) File "/home/webtools/apps/logparser/src/mozautoeslib/mozautoeslib/eslib.py", line 186, in query doc_types=self.doc_type) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 793, in count return self._query_call("_count", query, indexes, doc_types, **query_params) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 246, in _query_call response = self._send_request('GET', path, body, querystring_args) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 218, in _send_request raise_if_error(response.status, decoded) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/convert_errors.py", line 57, in raise_if_error raise pyes.exceptions.ElasticSearchException("Unknown exception type", status, result) ElasticSearchException: Unknown exception type and 2015-01-14 07:48:06,402 - BuildLogMonitor - ERROR - global name 'json' is not defined Traceback (most recent call last): File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse lp.parseFiles() File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles self.postResultsToElasticSearch(testdata) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 198, in postResultsToElasticSearch self._post_buildlog_to_elasticsearch(data) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 190, in _post_buildlog_to_elasticsearch AutologBuildLog(index='logs', server=server, **postdata).submit() File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 87, in submit build_id = self._add_doc(json, self.doc_type, self.id) File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 57, in _add_doc raise Exception(json.dumps(result)) NameError: global name 'json' is not defined -> Presumably the logparser is failing, so we've ended up with a backlog of unparsed logs taking up 30G on disk?
Yeah finished-logs is empty (is purged periodically [1]) and has a last modified of Jan 7th (presumably the last time we parsed/purged logs). And incoming logs has some really old logs there: [root@orangefactor1.dmz.phx1 incoming-logs]# ls -altr | head total 30566200 -rw-r--r-- 1 webtools webtools 47710 Dec 4 14:15 mozilla-inbound-ubuntu32_vm-linux-opt-1417725651-mochitest-e10s-4.txt.gz -rw-r--r-- 1 webtools webtools 145470 Dec 4 14:42 b2g-inbound-win7-ix-win32-debug-1417725708-mochitest-gl.txt.gz -rw-r--r-- 1 webtools webtools 27301 Dec 5 08:52 mozilla-inbound-mountainlion-macosx64-opt-1417791709-mochitest-browser-chrome-2.txt.gz -rw-r--r-- 1 webtools webtools 47427 Dec 5 13:47 mozilla-inbound-ubuntu64_vm-linux64-opt-1417814675-mochitest-e10s-4.txt.gz -rw-r--r-- 1 webtools webtools 27734 Dec 5 14:17 mozilla-inbound-ubuntu64_vm-linux64-opt-1417816652-mochitest-browser-chrome-2.txt.gz -rw-r--r-- 1 webtools webtools 633480 Dec 5 15:38 mozilla-inbound-win7-ix-win32-debug-1417817912-mochitest-3.txt.gz -rw-r--r-- 1 webtools webtools 244373 Dec 5 15:39 fx-team-ubuntu32_vm-linux-opt-1417818155-mochitest-3.txt.gz -rw-r--r-- 1 webtools webtools 149176 Dec 5 15:39 mozilla-inbound-win8_64-win64-debug-1417816652-mochitest-gl.txt.gz -rw-r--r-- 1 webtools webtools 29591 Dec 5 15:39 mozilla-inbound-ubuntu64-asan_vm-linux64-asan-opt-1417819050-mochitest-browser-chrome-2.txt.gz I'm confused why we have the odd old log from Dec (presumably we're not imported/cleaning those up for some reason, will check the error logs once we sort this other problem out), but most of the contents of incoming-logs is from Jan 7th onwards. [1] [root@orangefactor1.dmz.phx1 incoming-logs]# crontab -l | grep 'finished-logs' 0 1 * * * rm -rf /home/webtools/apps/logparser/finished-logs/*
Depends on: 1121517
With bug 1121517 fixed, that exception became: Traceback (most recent call last): File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse lp.parseFiles() File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles self.postResultsToElasticSearch(testdata) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 198, in postResultsToElasticSearch self._post_buildlog_to_elasticsearch(data) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 190, in _post_buildlog_to_elasticsearch AutologBuildLog(index='logs', server=server, **postdata).submit() File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 88, in submit build_id = self._add_doc(json, self.doc_type, self.id) File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 58, in _add_doc raise Exception(json.dumps(result)) Exception: {"_type": "buildlogs", "_id": "1421253183-b89c420c8b7069254a54e82828667c1db2e6913c", "created": true, "_version": 1, "_index": "logs"} There's also still the "ElasticSearchException: Unknown exception type" (comment 2).
(In reply to Ricky Rosario [:rrosario, :r1cky] from comment #20) > January 7 sounds great to me! The above being from "Bug 1071044 - Upgrade ElasticSearch PHX1 Development cluster to 1.2.x", which matches the timeline here - even though that was only supposed to affect dev, not prod. @cyliang, were any other changes made that might have affected the use of elasticsearch-zlb.webapp.scl3.mozilla.com ?
Flags: needinfo?(cliang)
OS: Mac OS X → All
Hardware: x86 → All
Summary: Low disk on orangefactor1.dmz.phx1.mozilla.com → Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions
Since this bug was filed, free disk space had already dropped from 10% to 9%, so I've purged all logs older than 3 days from incoming-logs, to buy us some time. The OrangeFactor scores look no different from Jan 7th onwards compared to the days before, so it seemed unlikely that we'd need to re-parse the logs anyway (I presume the failures must occur after the insert into ES that we need for OF).
There were no changes made to the production ES cluster. The only production modification I had to make was to update the script that is used to copy indexes from production to the development ES instances in PHX. Would it be possible to clarify which ES cluster and which index you're trying to write to? Based on the mention of a "logs" index, I'm assuming that you're writing to the cluster in SCL3. There is both a "logs" index and an "of_logs" index. The only errors that I'm seeing on the SCL3 instance have to do with bzcache. These occur before January 4th, tho, so I don't think they are related.
Flags: needinfo?(cliang)
(In reply to Ed Morley [:edmorley] from comment #4) > "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line > 58, in _add_doc > raise Exception(json.dumps(result)) > Exception: {"_type": "buildlogs", "_id": > "1421253183-b89c420c8b7069254a54e82828667c1db2e6913c", "created": true, > "_version": 1, "_index": "logs"} That line corresponds to: if not 'ok' in result or not result['ok'] or not '_id' in result: raise Exception(json.dumps(result)) Looking at the response, it appears successful, there's just no "ok" key in the response any more. A bit of digging found this upstream ElasticSearch change (which removed the 'ok' since it's redundant given the HTTP response code): https://github.com/elasticsearch/elasticsearch/commit/d23f640cd1488727c8337a888527d662c6135fac However that's only supposed to have landed in ES versions above v1.0 (see tags on the github commit), but elasticsearch-zlb.webapp.scl3.mozilla.com is on 0.9.x: [root@orangefactor1.dmz.phx1 logparser]# curl elasticsearch-zlb.webapp.scl3.mozilla.com:9200 ... "name" : "elasticsearch1_scl3", "version" : { "number" : "0.90.10", Grepping for ES server references (in case we're using the wrong instance): ./logparser/src/mozautolog/mozautolog/buildlog.py: self.server = 'elasticsearch-zlb.webapp.scl3.mozilla.com:9200' ./logparser/src/mozautolog/mozautolog/mozautolog.py: self.server = 'elasticsearch-zlb.webapp.scl3.mozilla.com:9200' ./logparser/src/logparser/scripts/savelogs-daemon.sh: start-stop-daemon --start --user ${USER} --chuid ${USER}:${USER} --pidfile ${PIDFILE} --nicelevel 10 --make-pidfile --background --exec ${SAVELOGS} -- --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} > savelogs.log 2>&1 ./logparser/src/logparser/scripts/savelogs.sh: nohup ${SAVELOGS} --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} > ${ENV}/savelogs.log 2>&1 & ./logparser/bin/savelogs-gen.sh: nohup ${SAVELOGS} --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} --pulse-cfg=${PULSECFG} > ${ENV}/savelogs.log 2>&1 & ./logparser/bin/savelogs-daemon-gen.sh: start-stop-daemon --start --user ${USER} --chuid ${USER}:${USER} --pidfile ${PIDFILE} --nicelevel 10 --make-pidfile --background --exec ${SAVELOGS} -- --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pulse-cfg=${PULSECFG} > savelogs.log 2>&1 Now elasticsearch1.metrics.scl3 is decommissioned aiui, and either way there is no response to it from the orangefactor1 box, so the scripts that reference it must be unused. Which seems to be confirmed by: [root@orangefactor1.dmz.phx1 logparser]# egrep 'savelogs.*sh' /etc/init.d/logparser DAEMON="$BASEDIR/bin/savelogs-gen.sh" So we don't appear to be using dev/another v1.0+ instance inadvertently, but we're still seeing the behaviour from the upstream ES change, that supposedly only landed in v1.0+? Bizarre. Let's just start checking for "created" rather than "ok" for now anyway.
Depends on: 1121941
(In reply to Ed Morley [:edmorley] from comment #4) > There's also still the "ElasticSearchException: Unknown exception type" > (comment 2). We're currently using pyes v0.16 (was v0.15 until bug 1121517 comment ), but v0.19+ gives much better output for this exception: https://github.com/aparo/pyes/commit/b61b4be45804abbec6fdd0f43e6127d3643e1e8e However there are lots of breaking changes, so I've backported that fix by hand to logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/convert_errors.py temporarily. This revealed: Traceback (most recent call last): File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse lp.parseFiles() File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles self.postResultsToElasticSearch(testdata) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 196, in postResultsToElasticSearch self._post_testgroup_to_elasticsearch(data) File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 183, in _post_testgroup_to_elasticsearch testgroup.submit() File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 80, in submit self._generate_testrun() File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 58, in _generate_testrun doc_type = [self.doc_type]) File "/home/webtools/apps/logparser/src/mozautoeslib/mozautoeslib/eslib.py", line 186, in query doc_types=self.doc_type) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 893, in count return self._query_call("_count", query, indexes, doc_types, **query_params) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 251, in _query_call response = self._send_request('GET', path, body, querystring_args) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 223, in _send_request raise_if_error(response.status, decoded) File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/convert_errors.py", line 58, in raise_if_error raise pyes.exceptions.ElasticSearchException(u"Unknown exception type: %d, %s" % (status, result), status, result) ElasticSearchException: Unknown exception type: 400, {u'count': 0, u'_shards': {u'successful': 0, u'failed': 5, u'total': 5, u'failures': [{u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][0] ]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 0}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][2] ]; nested: RemoteTransportException[[elasticsearch2_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 2}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][1] ]; nested: RemoteTransportException[[elasticsearch2_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 1}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][4] ]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 4}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][3] ]; nested: RemoteTransportException[[elasticsearch1_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 3}]}} I've removed the IP addresses, but even without them it's clear that we're accessing elasticsearch2_dev_phx1. Re-reading comment 8 shows that I missed: ./logparser/bin/savelogs-gen.sh: nohup ${SAVELOGS} --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} --pulse-cfg=${PULSECFG} > ${ENV}/savelogs.log 2>&1 & Looking at history shows: 436 2014-10-09 11:45:46 vi bin/savelogs-gen.sh 437 2014-10-09 11:46:34 vi bin/savelogs- 438 2014-10-09 11:46:36 vi bin/savelogs-daemon-gen.sh Jonathan, was something being tested then? I'm presuming we're not intentionally using the dev ES instance? (elasticsearch-zlb.dev.vlan81.phx1.mozilla.com)
Flags: needinfo?(jgriffin)
(In reply to Ed Morley [:edmorley] from comment #9) > Jonathan, was something being tested then? I'm presuming we're not > intentionally using the dev ES instance? > (elasticsearch-zlb.dev.vlan81.phx1.mozilla.com) Or more - I don't think logparser should be submitting to both prod and dev ES, seeing as aiui there is a script for mirroring from prod to dev already. Either way, at least we know now that these failures are just for the submission to dev, which we can switch off for now. I've removed the additional "--es-server" param specifying the dev ES instance from savelogs-gen.sh. Whilst I was there I fixed the out of date ES instance listed in the unused savelogs-daemon-gen.sh. I've cleared the contents of incoming-logs (since it was only the submission to dev that failed) and restarted the logparser service. All seems good now: no errors in savelogs.err, seemingly normal output in savelogs.out & logs appearing in finished-logs/.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jgriffin)
Resolution: --- → FIXED
Summary: Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions → Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions after dev ES upgrade
Meant to say: I'm looking forward to when we can get OrangeFactor v2 created and using Treeherder's API, so we can switch off of this off! The use of customised-not-in-the-repo scripts that pass config params on the command line make this type of issue much harder than it needs be. Even using a not-checked in settings.local.py in the repo is more discoverable, since an |hg status --ignored| surfaces it pretty easily :-)
That is *awesome* detective work.
Thanks Ed. At some point in the past, the production instance wasn't being mirrored to dev, thus the double posting.
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.