906840 - [mozillians-dev] Elasticsearch indexing often fails on dev

Reporter

Description

•

12 years ago

For some reason, there are generic celery timeouts quite often when manage.py index_all_profiles runs on dev. This completes in ~1.3 seconds on stage, so I'm wondering if dev needs a netflow or something else odd is going on. Also note that it doesn't always time out.

Andrei Hajdukewycz [:sancus]

Reporter

Updated

•

12 years ago

Severity: normal → minor

OS: Windows 7 → All

Hardware: x86_64 → All

Daniel Maher [:phrawzty]

Comment 1

•

12 years ago

Hello, To confirm, mozillians-dev is configured to use the "mozillians_dev" and "mozillians_public_dev" indices, both of which currently contain zero (0) documents. This would be logical if the indexing is *always* timing out, but I'm somewhat confused by your assertion that the timeout behaviour is inconsistent - is it possible to be more specific about the behaviour ?

Assignee: server-ops-webops → dmaher

Priority: -- → P4

Andrei Hajdukewycz [:sancus]

Reporter

Comment 2

•

12 years ago

Something keeps annihilating the indexes, but when I go into the admin interface and run index_all_profiles from the button that triggers that, it fixes it. I'm really not sure what could be deleting the entire indexes in this way.... it's hard to imagine that it's broken code because stage/prod don't have this problem.

Andrei Hajdukewycz [:sancus]

Reporter

Comment 3

•

12 years ago

Actually, to be clear, the problem is that the index_all_profiles function deletes the index before recreating it. So, if reindexing fails for any reason, including a celery timeout, it will wreck the index until the next time the cron runs.

Stephen Donner [:stephend] Not actively reading bugmail

Comment 4

•

12 years ago

:phrawzty, can you take another look, when you get a chance? This is plaguing our automation, and making us spin needless cycles; thanks!

Flags: needinfo?(dmaher)

Daniel Maher [:phrawzty]

Updated

•

12 years ago

Status: NEW → ASSIGNED

Flags: needinfo?(dmaher)

Daniel Maher [:phrawzty]

Comment 5

•

12 years ago

It would appear that mozillians-dev was configured to use an old (and recently decommissioned) rabbitmq broker - hence the celery issues. I updated the setting to use the appropriate node. 08:09:40 < phrawzty> sancus: could you please test a celery transaction ? 08:09:47 < phrawzty> sancus: i changed the BROKER_HOST 08:10:13 < sancus> I could manually trigger indexing, I guess 08:11:21 < sancus> seems to have worked 08:18:04 < phrawzty> sancus: does this appear to have solved the errant behaviour ? 08:18:39 < sancus> phrawzty: hard to be sure given the intermittent nature, but lets say it has and I'll reopen the bug if it hasn't 08:18:50 < phrawzty> fair enough.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Justin Crawford [:hoosteeno] [:jcrawford]

Comment 6

•

12 years ago

I hate to be the bearer of bad tidings, but the person index on mozillians-dev.allizom.org has disappeared again, making automation fail.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Stephen Donner [:stephend] Not actively reading bugmail

Comment 7

•

12 years ago

Kicking back to Web Ops -- we need traction on this, and I understand :phrawzty is out for a while.

Assignee: dmaher → server-ops-webops

Brandon Burton [:solarce]

Assignee

Comment 8

•

12 years ago

I took a look at things, since :phrawzty is on leave for a bit, at the request of :stephend I thought it might be a configuration problem, because we recently rebuilt and replace the generic-celery server for dev, but a review of the configs looks like everything is order * the application is configured to the RabbitMQ daemon running on generic-celery1.dev.webapp.phx1, which is where the celery jobs also run * RabbitMQ is configured and the app has the correct auth and permissions information * The app is configured to use ES on elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200, which it is able to reach, as shown below * The celeryd logs don't show any errors that would seem to point at a source for this issue ---------- ES test [root@generic-celery1.dev.webapp.phx1 ~]# curl elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 { "ok" : true, "status" : 200, "name" : "elasticsearch3_dev_phx1", "version" : { "number" : "0.20.5", "snapshot_build" : false }, "tagline" : "You Know, for Search" ----- So I can't see anything that jumps out at me from a configuration and network flow POV, but it's not clear to me how these jobs are run on a schedule, usually we'd have a cron job that runs this, but I don't see any but the cron job that updates the code on dev Can you provide some details on what causes this job to be run regularly? Thanks

Assignee: server-ops-webops → bburton

Severity: minor → normal

Status: REOPENED → ASSIGNED

Giorgos Logiotatidis [:giorgos]

Comment 9

•

12 years ago

We had a cronjob updating every hour *and* the site update script which run every 15' updated the index as well. Every hour the two update jobs run in parallel, and maybe that's the root of the problem. We did remove the cronjob (bug 909664) and AFAIK we didn't have this problem again.

Stephen Donner [:stephend] Not actively reading bugmail

Comment 10

•

12 years ago

Solarce, when you get a chance, can you look again? <3

Flags: needinfo?(bburton)

Brandon Burton [:solarce]

Assignee

Comment 11

•

12 years ago

(In reply to Giorgos Logiotatidis [:giorgos] from comment #9) > We had a cronjob updating every hour *and* the site update script which run > every 15' updated the index as well. Every hour the two update jobs run in > parallel, and maybe that's the root of the problem. > > We did remove the cronjob (bug 909664) and AFAIK we didn't have this problem > again. Ok, so I found the relevant code in https://github.com/mozilla/mozillians/blob/c786a8025aa1b71f809bf76de24a3687d4e4fdb5/mozillians/users/cron.py#L15 When I run the command by hand it is acting like it can't connect to RabbitMQ but the connection info in local.py is correct [root@genericadm.private.phx1 mozillians]# python2.6 manage.py cron index_all_profiles /data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/decorators.py:34: CDeprecationWarning: The `celery.decorators` module along with the magic keyword arguments, are deprecated, and will be removed in version 3.0. Please use the `celery.task` module instead of `celery.decorators`, and the `task.request` should be used instead of the magic keyword arguments: from celery.task import task See http://bit.ly/celery22major for more information. """)) Traceback (most recent call last): File "manage.py", line 23, in <module> manage.main() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/src/funfactory/funfactory/manage.py", line 143, in main execute_manager(current_settings) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/__init__.py", line 459, in execute_manager utility.execute() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/__init__.py", line 382, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/base.py", line 196, in run_from_argv self.execute(*args, **options.__dict__) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/base.py", line 232, in execute output = self.handle(*args, **options) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/src/django-cronjobs/cronjobs/management/commands/cron.py", line 64, in handle registered[script](*args) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/mozillians/users/cron.py", line 41, in index_all_profiles TaskSet(ts).apply_async() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/task/sets.py", line 138, in apply_async pub = publisher or self.Publisher(connection=conn) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/app/amqp.py", line 328, in TaskPublisher return TaskPublisher(*args, **self.app.merge(defaults, kwargs)) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/app/amqp.py", line 158, in __init__ super(TaskPublisher, self).__init__(*args, **kwargs) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/compat.py", line 61, in __init__ super(Publisher, self).__init__(connection, self.exchange, **kwargs) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/messaging.py", line 79, in __init__ self.revive(self.channel) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/messaging.py", line 168, in revive channel = channel.default_channel File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 581, in default_channel self.connection File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 574, in connection self._connection = self._establish_connection() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 533, in _establish_connection conn = self.transport.establish_connection() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/transport/amqplib.py", line 279, in establish_connection connect_timeout=conninfo.connect_timeout) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/transport/amqplib.py", line 89, in __init__ super(Connection, self).__init__(*args, **kwargs) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/connection.py", line 129, in __init__ self.transport = create_transport(host, connect_timeout, ssl) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/transport.py", line 281, in create_transport return TCPTransport(host, connect_timeout) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/transport.py", line 85, in __init__ raise socket.error, msg socket.timeout: timed out Where in the code does it decide how to connect to RabbitMQ?

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Flags: needinfo?(bburton)

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Flags: needinfo?(giorgos)

Giorgos Logiotatidis [:giorgos]

Comment 12

•

12 years ago

I guess the code that connects to rabbitmq is somewhere in the amqblib library that django-celery uses. These are "standard" celery libraries and afaik we don't use any beta / old releases anymore (we did until recently, now we're following standard playdoh). Since the conf in local.py is correct, we are sure that there is an open network flow between the machine and rabbitmq?

Flags: needinfo?(giorgos)

Brandon Burton [:solarce]

Assignee

Comment 13

•

12 years ago

(In reply to Giorgos Logiotatidis [:giorgos] from comment #12) > I guess the code that connects to rabbitmq is somewhere in the amqblib > library that django-celery uses. These are "standard" celery libraries and > afaik we don't use any beta / old releases anymore (we did until recently, > now we're following standard playdoh). > > Since the conf in local.py is correct, we are sure that there is an open > network flow between the machine and rabbitmq? Yes, I confirmed all flows when I first saw the error, can you add some debug code that can dump the connection settings when I run it from the CLI or give me a patch I can apply to do that? At this point we need some insight into exactly what's happening at runtime

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Flags: needinfo?(giorgos)

Giorgos Logiotatidis [:giorgos]

Comment 14

•

12 years ago

Let's start with printing the broker url in the python shell ./manage.py shell > from django.conf import settings > settings.BROKER_URL is that correct?

Flags: needinfo?(giorgos)

Brandon Burton [:solarce]

Assignee

Comment 15

•

12 years ago

After much debugging, including editing vendor-local/lib/python/celery/task/sets.py to dump it's connection information. I finally found the problem. The BROKER_VHOST and BROKER_USER settings were set to 'mozillians_dev' but should have been 'mozillians-dev', something that my eyes just skipped over when checking the settings before. With that fixed and the config pushed out, I'm seeing tasks flow [2013-09-13 18:03:26,908: INFO/MainProcess] Task mozillians.users.tasks.index_objects[773cd331-e3b3-4c8b-a74e-f0abe034fc2f] succeeded in 18.2593507767s: None [2013-09-13 18:03:26,913: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[98e746eb-7e48-4232-a58a-7f73eccf7cb9] [2013-09-13 18:03:27,700: INFO/MainProcess] Task mozillians.users.tasks.index_objects[b76b39b1-25c2-46a4-85ee-9edb5e0274c6] succeeded in 18.2177550793s: None [2013-09-13 18:03:27,704: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[09f3fa1e-0157-4275-834a-a2baee680e02] [2013-09-13 18:03:30,898: INFO/MainProcess] Task mozillians.users.tasks.index_objects[50706296-9a3d-4226-854c-000bae96cbc9] succeeded in 17.6964728832s: None [2013-09-13 18:03:30,901: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[3b244828-38d3-4826-abe3-2fec75845199] There is a backlog of reindexing tasks so I am going to monitor the task queue and once it is done, confirmed the re-indexing worked as expect, if so, I'll RF the bug Thanks so much for your patience

Andrei Hajdukewycz [:sancus]

Reporter

Comment 16

•

12 years ago

Thank you for tracking this down!

Brandon Burton [:solarce]

Assignee

Comment 17

•

12 years ago

Indexing is working now, it seems to take 15-20 minutes from when the indexes are dropped until search works again. Since there is very little data in the dev DB, there aren't a lot of results to search on, https://mozillians-dev.allizom.org/en-US/search/?q=&limit= only returns five people total, but I did see the index get dropped, searching returned 0 results, then after a few minutes, got all five results again.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Justin Crawford [:hoosteeno] [:jcrawford]

Comment 18

•

12 years ago

Cool. For posterity's sake, I'll just point out that those five results are the only five with /public/ profiles. The same query while logged-in brings up 20 pages of results.

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

[mozillians-dev] Elasticsearch indexing often fails on dev

Categories

(Infrastructure & Operations Graveyard :: WebOps: Engagement, task, P4)

Tracking

(Not tracked)

People

(Reporter: sancus, Assigned: bburton)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Comment 13

Updated

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated