Closed
Bug 906840
Opened 11 years ago
Closed 11 years ago
[mozillians-dev] Elasticsearch indexing often fails on dev
Categories
(Infrastructure & Operations Graveyard :: WebOps: Engagement, task, P4)
Infrastructure & Operations Graveyard
WebOps: Engagement
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: sancus, Assigned: bburton)
Details
For some reason, there are generic celery timeouts quite often when manage.py index_all_profiles runs on dev. This completes in ~1.3 seconds on stage, so I'm wondering if dev needs a netflow or something else odd is going on. Also note that it doesn't always time out.
Reporter | ||
Updated•11 years ago
|
Severity: normal → minor
OS: Windows 7 → All
Hardware: x86_64 → All
Comment 1•11 years ago
|
||
Hello, To confirm, mozillians-dev is configured to use the "mozillians_dev" and "mozillians_public_dev" indices, both of which currently contain zero (0) documents. This would be logical if the indexing is *always* timing out, but I'm somewhat confused by your assertion that the timeout behaviour is inconsistent - is it possible to be more specific about the behaviour ?
Assignee: server-ops-webops → dmaher
Priority: -- → P4
Reporter | ||
Comment 2•11 years ago
|
||
Something keeps annihilating the indexes, but when I go into the admin interface and run index_all_profiles from the button that triggers that, it fixes it. I'm really not sure what could be deleting the entire indexes in this way.... it's hard to imagine that it's broken code because stage/prod don't have this problem.
Reporter | ||
Comment 3•11 years ago
|
||
Actually, to be clear, the problem is that the index_all_profiles function deletes the index before recreating it. So, if reindexing fails for any reason, including a celery timeout, it will wreck the index until the next time the cron runs.
Comment 4•11 years ago
|
||
:phrawzty, can you take another look, when you get a chance? This is plaguing our automation, and making us spin needless cycles; thanks!
Flags: needinfo?(dmaher)
Updated•11 years ago
|
Status: NEW → ASSIGNED
Flags: needinfo?(dmaher)
Comment 5•11 years ago
|
||
It would appear that mozillians-dev was configured to use an old (and recently decommissioned) rabbitmq broker - hence the celery issues. I updated the setting to use the appropriate node. 08:09:40 < phrawzty> sancus: could you please test a celery transaction ? 08:09:47 < phrawzty> sancus: i changed the BROKER_HOST 08:10:13 < sancus> I could manually trigger indexing, I guess 08:11:21 < sancus> seems to have worked 08:18:04 < phrawzty> sancus: does this appear to have solved the errant behaviour ? 08:18:39 < sancus> phrawzty: hard to be sure given the intermittent nature, but lets say it has and I'll reopen the bug if it hasn't 08:18:50 < phrawzty> fair enough.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 6•11 years ago
|
||
I hate to be the bearer of bad tidings, but the person index on mozillians-dev.allizom.org has disappeared again, making automation fail.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 7•11 years ago
|
||
Kicking back to Web Ops -- we need traction on this, and I understand :phrawzty is out for a while.
Assignee: dmaher → server-ops-webops
Assignee | ||
Comment 8•11 years ago
|
||
I took a look at things, since :phrawzty is on leave for a bit, at the request of :stephend I thought it might be a configuration problem, because we recently rebuilt and replace the generic-celery server for dev, but a review of the configs looks like everything is order * the application is configured to the RabbitMQ daemon running on generic-celery1.dev.webapp.phx1, which is where the celery jobs also run * RabbitMQ is configured and the app has the correct auth and permissions information * The app is configured to use ES on elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200, which it is able to reach, as shown below * The celeryd logs don't show any errors that would seem to point at a source for this issue ---------- ES test [root@generic-celery1.dev.webapp.phx1 ~]# curl elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 { "ok" : true, "status" : 200, "name" : "elasticsearch3_dev_phx1", "version" : { "number" : "0.20.5", "snapshot_build" : false }, "tagline" : "You Know, for Search" ----- So I can't see anything that jumps out at me from a configuration and network flow POV, but it's not clear to me how these jobs are run on a schedule, usually we'd have a cron job that runs this, but I don't see any but the cron job that updates the code on dev Can you provide some details on what causes this job to be run regularly? Thanks
Assignee: server-ops-webops → bburton
Severity: minor → normal
Status: REOPENED → ASSIGNED
Comment 9•11 years ago
|
||
We had a cronjob updating every hour *and* the site update script which run every 15' updated the index as well. Every hour the two update jobs run in parallel, and maybe that's the root of the problem. We did remove the cronjob (bug 909664) and AFAIK we didn't have this problem again.
Solarce, when you get a chance, can you look again? <3
Flags: needinfo?(bburton)
Assignee | ||
Comment 11•11 years ago
|
||
(In reply to Giorgos Logiotatidis [:giorgos] from comment #9) > We had a cronjob updating every hour *and* the site update script which run > every 15' updated the index as well. Every hour the two update jobs run in > parallel, and maybe that's the root of the problem. > > We did remove the cronjob (bug 909664) and AFAIK we didn't have this problem > again. Ok, so I found the relevant code in https://github.com/mozilla/mozillians/blob/c786a8025aa1b71f809bf76de24a3687d4e4fdb5/mozillians/users/cron.py#L15 When I run the command by hand it is acting like it can't connect to RabbitMQ but the connection info in local.py is correct [root@genericadm.private.phx1 mozillians]# python2.6 manage.py cron index_all_profiles /data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/decorators.py:34: CDeprecationWarning: The `celery.decorators` module along with the magic keyword arguments, are deprecated, and will be removed in version 3.0. Please use the `celery.task` module instead of `celery.decorators`, and the `task.request` should be used instead of the magic keyword arguments: from celery.task import task See http://bit.ly/celery22major for more information. """)) Traceback (most recent call last): File "manage.py", line 23, in <module> manage.main() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/src/funfactory/funfactory/manage.py", line 143, in main execute_manager(current_settings) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/__init__.py", line 459, in execute_manager utility.execute() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/__init__.py", line 382, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/base.py", line 196, in run_from_argv self.execute(*args, **options.__dict__) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/base.py", line 232, in execute output = self.handle(*args, **options) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/src/django-cronjobs/cronjobs/management/commands/cron.py", line 64, in handle registered[script](*args) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/mozillians/users/cron.py", line 41, in index_all_profiles TaskSet(ts).apply_async() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/task/sets.py", line 138, in apply_async pub = publisher or self.Publisher(connection=conn) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/app/amqp.py", line 328, in TaskPublisher return TaskPublisher(*args, **self.app.merge(defaults, kwargs)) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/app/amqp.py", line 158, in __init__ super(TaskPublisher, self).__init__(*args, **kwargs) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/compat.py", line 61, in __init__ super(Publisher, self).__init__(connection, self.exchange, **kwargs) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/messaging.py", line 79, in __init__ self.revive(self.channel) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/messaging.py", line 168, in revive channel = channel.default_channel File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 581, in default_channel self.connection File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 574, in connection self._connection = self._establish_connection() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 533, in _establish_connection conn = self.transport.establish_connection() File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/transport/amqplib.py", line 279, in establish_connection connect_timeout=conninfo.connect_timeout) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/transport/amqplib.py", line 89, in __init__ super(Connection, self).__init__(*args, **kwargs) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/connection.py", line 129, in __init__ self.transport = create_transport(host, connect_timeout, ssl) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/transport.py", line 281, in create_transport return TCPTransport(host, connect_timeout) File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/transport.py", line 85, in __init__ raise socket.error, msg socket.timeout: timed out Where in the code does it decide how to connect to RabbitMQ?
Assignee | ||
Updated•11 years ago
|
Flags: needinfo?(bburton)
Assignee | ||
Updated•11 years ago
|
Flags: needinfo?(giorgos)
Comment 12•11 years ago
|
||
I guess the code that connects to rabbitmq is somewhere in the amqblib library that django-celery uses. These are "standard" celery libraries and afaik we don't use any beta / old releases anymore (we did until recently, now we're following standard playdoh). Since the conf in local.py is correct, we are sure that there is an open network flow between the machine and rabbitmq?
Flags: needinfo?(giorgos)
Assignee | ||
Comment 13•11 years ago
|
||
(In reply to Giorgos Logiotatidis [:giorgos] from comment #12) > I guess the code that connects to rabbitmq is somewhere in the amqblib > library that django-celery uses. These are "standard" celery libraries and > afaik we don't use any beta / old releases anymore (we did until recently, > now we're following standard playdoh). > > Since the conf in local.py is correct, we are sure that there is an open > network flow between the machine and rabbitmq? Yes, I confirmed all flows when I first saw the error, can you add some debug code that can dump the connection settings when I run it from the CLI or give me a patch I can apply to do that? At this point we need some insight into exactly what's happening at runtime
Assignee | ||
Updated•11 years ago
|
Flags: needinfo?(giorgos)
Comment 14•11 years ago
|
||
Let's start with printing the broker url in the python shell
./manage.py shell
> from django.conf import settings
> settings.BROKER_URL
is that correct?
Flags: needinfo?(giorgos)
Assignee | ||
Comment 15•11 years ago
|
||
After much debugging, including editing vendor-local/lib/python/celery/task/sets.py to dump it's connection information. I finally found the problem. The BROKER_VHOST and BROKER_USER settings were set to 'mozillians_dev' but should have been 'mozillians-dev', something that my eyes just skipped over when checking the settings before. With that fixed and the config pushed out, I'm seeing tasks flow [2013-09-13 18:03:26,908: INFO/MainProcess] Task mozillians.users.tasks.index_objects[773cd331-e3b3-4c8b-a74e-f0abe034fc2f] succeeded in 18.2593507767s: None [2013-09-13 18:03:26,913: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[98e746eb-7e48-4232-a58a-7f73eccf7cb9] [2013-09-13 18:03:27,700: INFO/MainProcess] Task mozillians.users.tasks.index_objects[b76b39b1-25c2-46a4-85ee-9edb5e0274c6] succeeded in 18.2177550793s: None [2013-09-13 18:03:27,704: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[09f3fa1e-0157-4275-834a-a2baee680e02] [2013-09-13 18:03:30,898: INFO/MainProcess] Task mozillians.users.tasks.index_objects[50706296-9a3d-4226-854c-000bae96cbc9] succeeded in 17.6964728832s: None [2013-09-13 18:03:30,901: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[3b244828-38d3-4826-abe3-2fec75845199] There is a backlog of reindexing tasks so I am going to monitor the task queue and once it is done, confirmed the re-indexing worked as expect, if so, I'll RF the bug Thanks so much for your patience
Reporter | ||
Comment 16•11 years ago
|
||
Thank you for tracking this down!
Assignee | ||
Comment 17•11 years ago
|
||
Indexing is working now, it seems to take 15-20 minutes from when the indexes are dropped until search works again. Since there is very little data in the dev DB, there aren't a lot of results to search on, https://mozillians-dev.allizom.org/en-US/search/?q=&limit= only returns five people total, but I did see the index get dropped, searching returned 0 results, then after a few minutes, got all five results again.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 18•11 years ago
|
||
Cool. For posterity's sake, I'll just point out that those five results are the only five with /public/ profiles. The same query while logged-in brings up 20 pages of results.
Updated•8 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•