Closed Bug 906840 Opened 7 years ago Closed 7 years ago

[mozillians-dev] Elasticsearch indexing often fails on dev

Categories

(Infrastructure & Operations Graveyard :: WebOps: Engagement, task, P4)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sancus, Assigned: bburton)

Details

For some reason, there are generic celery timeouts quite often when manage.py index_all_profiles runs on dev.

This completes in ~1.3 seconds on stage, so I'm wondering if dev needs a netflow or something else odd is going on. Also note that it doesn't always time out.
Severity: normal → minor
OS: Windows 7 → All
Hardware: x86_64 → All
Hello,

To confirm, mozillians-dev is configured to use the "mozillians_dev" and "mozillians_public_dev" indices, both of which currently contain zero (0) documents.  This would be logical if the indexing is *always* timing out, but I'm somewhat confused by your assertion that the timeout behaviour is inconsistent - is it possible to be more specific about the behaviour ?
Assignee: server-ops-webops → dmaher
Priority: -- → P4
Something keeps annihilating the indexes, but when I go into the admin interface and run index_all_profiles from the button that triggers that, it fixes it. I'm really not sure what could be deleting the entire indexes in this way.... it's hard to imagine that it's broken code because stage/prod don't have this problem.
Actually, to be clear, the problem is that the index_all_profiles function deletes the index before recreating it. So, if reindexing fails for any reason, including a celery timeout, it will wreck the index until the next time the cron runs.
:phrawzty, can you take another look, when you get a chance?  This is plaguing our automation, and making us spin needless cycles; thanks!
Flags: needinfo?(dmaher)
Status: NEW → ASSIGNED
Flags: needinfo?(dmaher)
It would appear that mozillians-dev was configured to use an old (and recently decommissioned) rabbitmq broker - hence the celery issues.  I updated the setting to use the appropriate node.

08:09:40 < phrawzty> sancus: could you please test a celery transaction ?
08:09:47 < phrawzty> sancus: i changed the BROKER_HOST
08:10:13 < sancus> I could manually trigger indexing, I guess
08:11:21 < sancus> seems to have worked
08:18:04 < phrawzty> sancus: does this appear to have solved the errant behaviour ?
08:18:39 < sancus> phrawzty: hard to be sure given the intermittent nature, but lets say it has and I'll reopen the bug if it hasn't
08:18:50 < phrawzty> fair enough.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I hate to be the bearer of bad tidings, but the person index on mozillians-dev.allizom.org has disappeared again, making automation fail.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Kicking back to Web Ops -- we need traction on this, and I understand :phrawzty is out for a while.
Assignee: dmaher → server-ops-webops
I took a look at things, since :phrawzty is on leave for a bit, at the request of :stephend

I thought it might be a configuration problem, because we recently rebuilt and replace the generic-celery server for dev, but a review of the configs looks like everything is order

* the application is configured to the RabbitMQ daemon running on generic-celery1.dev.webapp.phx1, which is where the celery jobs also run
* RabbitMQ is configured and the app has the correct auth and permissions information
* The app is configured to use ES on elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200, which it is able to reach, as shown below
* The celeryd logs don't show any errors that would seem to point at a source for this issue


----------
ES test

[root@generic-celery1.dev.webapp.phx1 ~]# curl elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200
{
  "ok" : true,
  "status" : 200,
  "name" : "elasticsearch3_dev_phx1",
  "version" : {
    "number" : "0.20.5",
    "snapshot_build" : false
  },
  "tagline" : "You Know, for Search"

-----

So I can't see anything that jumps out at me from a configuration and network flow POV, but it's not clear to me how these jobs are run on a schedule, usually we'd have a cron job that runs this, but I don't see any but the cron job that updates the code on dev

Can you provide some details on what causes this job to be run regularly?

Thanks
Assignee: server-ops-webops → bburton
Severity: minor → normal
Status: REOPENED → ASSIGNED
We had a cronjob updating every hour *and* the site update script which run every 15' updated the index as well. Every hour the two update jobs run in parallel, and maybe that's the root of the problem.

We did remove the cronjob (bug 909664) and AFAIK we didn't have this problem again.
Solarce, when you get a chance, can you look again?  <3
Flags: needinfo?(bburton)
(In reply to Giorgos Logiotatidis [:giorgos] from comment #9)
> We had a cronjob updating every hour *and* the site update script which run
> every 15' updated the index as well. Every hour the two update jobs run in
> parallel, and maybe that's the root of the problem.
> 
> We did remove the cronjob (bug 909664) and AFAIK we didn't have this problem
> again.

Ok, so I found the relevant code in https://github.com/mozilla/mozillians/blob/c786a8025aa1b71f809bf76de24a3687d4e4fdb5/mozillians/users/cron.py#L15

When I run the command by hand it is acting like it can't connect to RabbitMQ but the connection info in local.py is correct

[root@genericadm.private.phx1 mozillians]# python2.6 manage.py cron index_all_profiles
/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/decorators.py:34: CDeprecationWarning:
The `celery.decorators` module along with the magic keyword arguments,
are deprecated, and will be removed in version 3.0.

Please use the `celery.task` module instead of `celery.decorators`,
and the `task.request` should be used instead of the magic keyword arguments:

    from celery.task import task

See http://bit.ly/celery22major for more information.


  """))
Traceback (most recent call last):
  File "manage.py", line 23, in <module>
    manage.main()
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/src/funfactory/funfactory/manage.py", line 143, in main
    execute_manager(current_settings)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/__init__.py", line 459, in execute_manager
    utility.execute()
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/__init__.py", line 382, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/base.py", line 196, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/lib/python/django/core/management/base.py", line 232, in execute
    output = self.handle(*args, **options)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor/src/django-cronjobs/cronjobs/management/commands/cron.py", line 64, in handle
    registered[script](*args)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/mozillians/users/cron.py", line 41, in index_all_profiles
    TaskSet(ts).apply_async()
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/task/sets.py", line 138, in apply_async
    pub = publisher or self.Publisher(connection=conn)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/app/amqp.py", line 328, in TaskPublisher
    return TaskPublisher(*args, **self.app.merge(defaults, kwargs))
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/celery/app/amqp.py", line 158, in __init__
    super(TaskPublisher, self).__init__(*args, **kwargs)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/compat.py", line 61, in __init__
    super(Publisher, self).__init__(connection, self.exchange, **kwargs)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/messaging.py", line 79, in __init__
    self.revive(self.channel)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/messaging.py", line 168, in revive
    channel = channel.default_channel
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 581, in default_channel
    self.connection
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 574, in connection
    self._connection = self._establish_connection()
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/connection.py", line 533, in _establish_connection
    conn = self.transport.establish_connection()
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/transport/amqplib.py", line 279, in establish_connection
    connect_timeout=conninfo.connect_timeout)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/kombu/transport/amqplib.py", line 89, in __init__
    super(Connection, self).__init__(*args, **kwargs)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/connection.py", line 129, in __init__
    self.transport = create_transport(host, connect_timeout, ssl)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/transport.py", line 281, in create_transport
    return TCPTransport(host, connect_timeout)
  File "/data/genericrhel6-dev/src/mozillians-dev.allizom.org/mozillians/vendor-local/lib/python/amqplib/client_0_8/transport.py", line 85, in __init__
    raise socket.error, msg
socket.timeout: timed out

Where in the code does it decide how to connect to RabbitMQ?
Flags: needinfo?(bburton)
Flags: needinfo?(giorgos)
I guess the code that connects to rabbitmq is somewhere in the amqblib library that django-celery uses. These are "standard" celery libraries and afaik we don't use any beta / old releases anymore (we did until recently, now we're following standard playdoh).

Since the conf in local.py is correct, we are sure that there is an open network flow between the machine and rabbitmq?
Flags: needinfo?(giorgos)
(In reply to Giorgos Logiotatidis [:giorgos] from comment #12)
> I guess the code that connects to rabbitmq is somewhere in the amqblib
> library that django-celery uses. These are "standard" celery libraries and
> afaik we don't use any beta / old releases anymore (we did until recently,
> now we're following standard playdoh).
> 
> Since the conf in local.py is correct, we are sure that there is an open
> network flow between the machine and rabbitmq?

Yes, I confirmed all flows when I first saw the error, can you add some debug code that can dump the connection settings when I run it from the CLI or give me a patch I can apply to do that? At this point we need some insight into exactly what's happening at runtime
Flags: needinfo?(giorgos)
Let's start with printing the broker url in the python shell

./manage.py shell
> from django.conf import settings
> settings.BROKER_URL

is that correct?
Flags: needinfo?(giorgos)
After much debugging, including editing vendor-local/lib/python/celery/task/sets.py to dump it's connection information.

I finally found the problem. The BROKER_VHOST and BROKER_USER settings were set to 'mozillians_dev' but should have been 'mozillians-dev', something that my eyes just skipped over when checking the settings before.

With that fixed and the config pushed out, I'm seeing tasks flow

[2013-09-13 18:03:26,908: INFO/MainProcess] Task mozillians.users.tasks.index_objects[773cd331-e3b3-4c8b-a74e-f0abe034fc2f] succeeded in 18.2593507767s: None
[2013-09-13 18:03:26,913: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[98e746eb-7e48-4232-a58a-7f73eccf7cb9]
[2013-09-13 18:03:27,700: INFO/MainProcess] Task mozillians.users.tasks.index_objects[b76b39b1-25c2-46a4-85ee-9edb5e0274c6] succeeded in 18.2177550793s: None
[2013-09-13 18:03:27,704: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[09f3fa1e-0157-4275-834a-a2baee680e02]
[2013-09-13 18:03:30,898: INFO/MainProcess] Task mozillians.users.tasks.index_objects[50706296-9a3d-4226-854c-000bae96cbc9] succeeded in 17.6964728832s: None
[2013-09-13 18:03:30,901: INFO/MainProcess] Got task from broker: mozillians.users.tasks.index_objects[3b244828-38d3-4826-abe3-2fec75845199]

There is a backlog of reindexing tasks so I am going to monitor the task queue and once it is done, confirmed the re-indexing worked as expect, if so, I'll RF the bug

Thanks so much for your patience
Thank you for tracking this down!
Indexing is working now, it seems to take 15-20 minutes from when the indexes are dropped until search works again.

Since there is very little data in the dev DB, there aren't a lot of results to search on, https://mozillians-dev.allizom.org/en-US/search/?q=&limit= only returns five people total, but I did see the index get dropped, searching returned 0 results, then after a few minutes, got all five results again.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Cool. 

For posterity's sake, I'll just point out that those five results are the only five with /public/ profiles. The same query while logged-in brings up 20 pages of results.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.