Closed Bug 654567 Opened 11 years ago Closed 11 years ago

Implement ElasticSearch API in middleware

Categories

(Socorro :: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adrian, Assigned: adrian)

References

Details

(Whiteboard: [qa?])

Attachments

(3 files, 6 obsolete files)

We should have a simple REST API in the middleware for querying ElasticSearch.
Assignee: nobody → agaudebert
The documentation for this API is in the wiki: https://wiki.mozilla.org/Socorro/ElasticSearch_API

A first "implementation plan" can be seen here: http://adrian.gaudebert.fr/downloads/images/SocorroMiddlewareAPIImplementation.png

Please make any comment or suggestion!
Attached patch Middleware API - ES search (obsolete) — Splinter Review
Current version of the ES API in Socorro.
Attachment #535143 - Flags: review?(rhelmer)
Comment on attachment 535143 [details] [diff] [review]
Middleware API - ES search

Review of attachment 535143 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great overall, I think we need some changes before taking this (comments inline below).

::: socorro/webapi/apiInterface.py
@@ +1,1 @@
> +

interfaces feel pretty un-pythonic to me. It is one thing if it's a parent class that implements functionality that you are then extending/overriding, but I think in this case you could do away with the APIInterface and getImplInstance and just make the two classes happen to implement the same interface.

If they have any actual code in common, it should be factored out to a common class that both inherit of course.

::: socorro/webapi/elasticSearchAPI.py
@@ +11,5 @@
> +		"""
> +		Default constructor
> +		"""
> +		self.http = httpClient.HttpClient("node14.generic.metrics.sjc1.mozilla.com", "9200")
> +

put the above in a config.dist file and use config for hostname, port

@@ +58,5 @@
> +		to_date 	= kwargs.get("to", now)
> +		fields 		= kwargs.get("in", "_all")
> +		os 			= kwargs.get("os", "_all")
> +		version 	= kwargs.get("version", "_all")
> +		#branches 	= kwargs.get("branches", "_all")	# what am I supposed to search in?

We should discuss this, I am not sure if we store branches right now, I suspect we have a version=>branch mapping somewhere in Socorro rather than in the crash report but we can check.

::: socorro/webapi/postgreAPI.py
@@ +1,1 @@
> +import logging

I think postgresAPI.py would be a better name, postgres or postgresql is the usual name.

Also not sure this really needs to be a standalone web service that returns SQL; I think it'd be ok to just build and execute the SQL in services/search.py

@@ +57,5 @@
> +
> +		sql_query = "SELECT *"
> +		sql_query += "FROM reports"
> +		sql_query += "WHERE signature='" + terms + "'"
> +

We should not build SQL queries with string concatenation, especially if it contains user-generated content (even if not now, someone may unknowingly pass it in the future, assuming it's safe). We should use parameterized queries instead (we use the psyopg2 library for this elsewhere).

Also, shouldn't this be against top_crashes_by_signature table and not the reports table?
Attachment #535143 - Flags: review-
Attachment #535143 - Flags: review?(rhelmer)
About the interface: this is something that comes from my experience with C++, and as you know I'm pretty new to python. But I'll definitely remove it if it's the good way to do that!

About postgresAPI: this class is supposed to implement the exact same API than ES ( https://wiki.mozilla.org/Socorro/ElasticSearch_API ), but there is still nothing in there. It will in the future implement query, search and reports so people can choose whether they want to use ES or Postgres. So the search function is supposed to perform a search, and not a top crasher report, so I think it is the good thing to go against reports table. But please tell me if I'm wrong!

Thanks for the review, I'll report the changes here once they're done.
(In reply to comment #4)
> About postgresAPI: this class is supposed to implement the exact same API
> than ES ( https://wiki.mozilla.org/Socorro/ElasticSearch_API ), but there is
> still nothing in there. It will in the future implement query, search and
> reports so people can choose whether they want to use ES or Postgres. So the
> search function is supposed to perform a search, and not a top crasher
> report, so I think it is the good thing to go against reports table. But
> please tell me if I'm wrong!

Well if it's going to be a small number of fairly straightforward SQL queries, and we don't intend on adding many more search providers, I think there's value in just keeping things in one place in socorro.services.search and not putting up too much scaffolding. We can revisit this if we find ourselves adding more search providers, but I don't think that's anticipated anytime in the near future.

> Thanks for the review, I'll report the changes here once they're done.

Thank you!
I'm going add some unsolicited comments, too.

We need to get some meaningful names in here. "api.py" is like naming all your variables "data".  If this is about "search" then perhaps it should be called the "SearchAPI".  

It looks like the intent is to create a new hierarchy of implementations of a search api.  Since these files that define the api do not themselves implement web services, they should not live within the webapi package.  They should, perhaps, have their own package at the same level as the webapi package.

I'm confused in this search API how one would choose the alternate Postgresql implementation without having to actually change the source code in api.py.  If it is to be that only one or the other implementation is to be used, I would suggest that the ConfigurationManager be charged with the selection and eliminate api.py entirely.  I can provide further details on how to do that. 

I'm noticing some private methods for datetime conversions.  Since this sort of thing is useful elsewhere, I suggest that they be included in the socorro.lib.datetimeutil module.

The use of __ for method names within classes precludes future overriding through inheritance.  I suggest a single underscore for indication of private methods.  That signals future coders wishing to subclass to be careful, but doesn't artificially place limits on their freedom.

I also discourage the use of old-style python classes.  
  class Alpha:          # old style == bad
  class Alpha(object):  # new style == good

I see instances of boolean logic with "not" factored outward.
  not x == y                    # x != y
  not type(thing) is someType   # type(thing) is not someType 

list comprehensions and generator expressions are much more succinct ways of  list and string processing.  The following two functions do exactly the same thing:

def __array_to_string(self, array, separator, prefix = "", suffix = ""):
  string = ""
  first = True
  for term in array:
    if not first:
      string += separator
    string += prefix + term + suffix
    first = False
  return string

def _array_to_string (self, array, separator, prefix = "", suffix = ""):
    return separator.join("%s%s%s" % (prefix, x, suffix) for x in array)

I have more comments, but I'll refrain for a future iteration of this code.
An important comment regarding the HttpClient class:

The __del__ method is not the equivalent of a C++ destructor.  You cannot be guaranteed that __del__ will be called by the garbage collector in a timely manner; it won't just automatically happen when a instance goes out of scope.  In a long running app, you may end up with connections hanging around for far longer than you want.

For classes that need to hold and then release a resource, a modern Pythonic implementation would be to use a contextmanager.  see http://docs.python.org/library/contextlib.html for a simple decorator for doing this.  Reimplementing in this way would require some refactoring to move the http connection out of the constructor and into a more local context.  Use of the HttpClient class with contextmanager semantics would look like this:

with HttpClient(host, port) as h:
  h.post(...)

Alternatively, I suggest renaming the __del__ to simply "close" and then require the use of a try: finally: block.
(In reply to comment #5)
> Well if it's going to be a small number of fairly straightforward SQL
> queries, and we don't intend on adding many more search providers, I think
> there's value in just keeping things in one place in socorro.services.search
> and not putting up too much scaffolding. We can revisit this if we find
> ourselves adding more search providers, but I don't think that's anticipated
> anytime in the near future.

I don't think the search function will be straightforward in the Postgres version of the API, the current PHP implementation looks pretty complex. 

I think I didn't explain well enough what I am trying to achieve, so let we give an other try. I love schemata so here we go: http://adrian.gaudebert.fr/downloads/images/Socorro-middleware-API.png

So when /query, /search or /report is called, the classes Query, Search or Report catch this call. They then use getImplInstance to get the right class to use to perform the request. getImplInstance uses the configuration to know if we want to use ES or PostgeSQL (or whatever solution we have), and returns an instance of this class. 

Basically, elasticSearchAPI and postgresAPI are supposed to do the exact same thing, and implement the same interface. But one does it with ES and the other with Postgres. We want to do that so people can choose not to use ES if they don't want to. And this way would also make this API more time resilient: if for any reason we need to use something else than ES, we can simply implement this interface for the new system, and change the configuration. 

(In reply to comment #6)
> I'm confused in this search API how one would choose the alternate
> Postgresql implementation without having to actually change the source code
> in api.py.  If it is to be that only one or the other implementation is to
> be used, I would suggest that the ConfigurationManager be charged with the
> selection and eliminate api.py entirely.  I can provide further details on
> how to do that. 

The choice of the implementation to use should definitely be in the configuration, and it is not the case yet only because it was easier and faster for me to do things this way. I would like to hear more about the ConfigurationManager though, as I don't know what it does and how it works. 

> It looks like the intent is to create a new hierarchy of implementations of
> a search api.  Since these files that define the api do not themselves
> implement web services, they should not live within the webapi package. 
> They should, perhaps, have their own package at the same level as the webapi
> package.

I was wondering what would be the best way to integrate those new modules into the existing architecture. The Search, Query and Report classes are services in that they are the one directly called when an HTTP request is done. But they do not implement the logic... I don't really know what kind of semantic we should use here. I put the classes where they are actually because it seemed to me to be the best way to integrate with the existing without creating a new module, but I would like to have your suggestion on that point. 

> I'm going add some unsolicited comments, too.

Your comments were secretly solicited, and I thank you very much for your help! I will make sure to fix everything you listed.
To have the ConfigurationManager handle selecting and loading the search implementation class, I suggest the following:

1) add a new config option to the .../scripts/config/webapiconf.py.dist file in this form:
  searchImplClass = cm.Option()
  searchImplClass.doc = 'the name of the class for search'
  searchImplClass.default = 'socorro.search.ElasticSearch'
  searchImplClass.fromStringConverter = cm.classConverter

2) add new config options to the .../scripts/config/webapiconf.py.dist for the requirements of your ElasticSearch Class.

3) in your Search class replace this line:

       self.apiImpl = api.getImplInstance(configContext)

   with

       self.apiImpl = configContext.searchImplClass(configContext)

  on execution of this line, the ConfigurationManager will load the appropriate class dynamically and then return that class to you.  Then by invoking that class and passing in the the 'configContext' you are instantiating the dynamically loaded class.

4) remove the api.py from the project


This use of the ConfigurationManager in 1.7.x is a primitive version of dynamic loading that I've implemented for the Socorro 2.0.  The newer 2.0 version is even easier to use and will query the class after loading to figure out any additional configuration requirements needed by the class.
Summary: Implement ElasticSerarch API in middleware → Implement ElasticSearch API in middleware
Attached patch Middleware API - ES search .2 (obsolete) — Splinter Review
This is a work in progress. This patch integrates lots of code improvements, following the good advices of :lars and :rhelmer. 

The PostgresAPI class is not to be reviewed yet, nothing is done in here.
Attachment #535143 - Attachment is obsolete: true
Attachment #535752 - Flags: review?(rhelmer)
Attachment #535752 - Flags: review?(lars)
Target Milestone: --- → 2.0
Attached patch Middleware API - ES search .3 (obsolete) — Splinter Review
Ok, now this begins to look like something. The PostgreSQL class is all new and should be reviewed. The ElasticSearch had some improvements and features added, and some functions were factored into SearchAPI. 

Anurag, Daniel: please review the ES query I build in ElasticSearchAPI. I would like to know if there is a way to improve performances by building it differently. 

Thanks all for reviewing!
Attachment #535752 - Attachment is obsolete: true
Attachment #535752 - Flags: review?(rhelmer)
Attachment #535752 - Flags: review?(lars)
Attachment #537910 - Flags: review?(rhelmer)
Attachment #537910 - Flags: review?(lars)
Attachment #537910 - Flags: review?(deinspanjer)
Attachment #537910 - Flags: review?(aphadke)
Comment on attachment 537910 [details] [diff] [review]
Middleware API - ES search .3

Overall

* Unit tests don't exist – they are required for each method

Focusing on the PostgresAPI class

* constructor – there is no need to instantiate a database.  The base class already does this, you just have to use it.  self.database is always available for your use.
* the 'search' method is huge and monolithic.  Consider breaking it up into smaller chunks – it'll make unit tests much easier to write.
* comment says optional parameters for **kwargs in the 'search' method include 'terms', but the code implies that it is actually called 'for'.
* a variable called query_string is actually a dict.  Types as suffixes should match the actual type of the variable
* many instances of using + to concatenate strings.  This is very inefficient in Python.  Use join instead:
    sqlQuery = sqlSelect + sqlFrom + sqlWhere + sqlGroup + sqlOrder + sqlLimit
    sqlQuery = ' '.join( sqlSelect, sqlFrom, sqlWhere, sqlGroup, sqlOrder, sqlLimit)
* many instances of using  += for accumulating strings – also inefficient.  Consider using list of strings:
    sqlWhere.append('r.hangid IS NULL')
    then later, join them together with a boolean operator:
    ' AND '.join(sqlWhere)
* many places where variables are written as literals into sql statements – this technique is ripe for sql injection attacks.  Consider using positional bound variables instead.  
* database connections in postgres are by default transactional.  Any localized use of a connection should be bounded in a try:finally construct:
    conn = self.database.connection():
    try:
        cur = conn.cursor()
       # execute sql statements
       conn.commit()
   except Exception:
       conn.rollback()
       # report or recover from error
   finally:
      conn.close()
Attachment #537910 - Flags: review?(lars) → review-
Comment on attachment 537910 [details] [diff] [review]
Middleware API - ES search .3

Why use query string parsing for this instead of the query DSL?
:dre: Do you mean using an external library to generate the query, or splitting the "query_string" of my json query in smaller parts?

:lars: Thanks for your review, I'll integrate those changes and make unit tests as soon as I finish integrating the new API in the UI.
The implementation of ElasticSearch can be tested on khan: http://agaudebert.khan.mozilla.org/query

Note that:
* only the search uses ES ;
* the code is not fully optimized ;
* the ES instance is not as full of data as it will be in prod ;
* the code is not yet unit-tested.
Attached patch Middleware API - ES search .4 (obsolete) — Splinter Review
Last patch to review before the final one which will be landing tonight. 

What's new:
* Integration in the UI
* Refactoring of the middleware code
* Fixing bugs

This code is testable on khan: http://agaudebert.khan.mozilla.org/query (using ES)
Attachment #537910 - Attachment is obsolete: true
Attachment #537910 - Flags: review?(rhelmer)
Attachment #537910 - Flags: review?(deinspanjer)
Attachment #537910 - Flags: review?(aphadke)
Attachment #539639 - Flags: review?(rhelmer)
Attachment #539639 - Flags: review?(laura)
Attachment #539639 - Flags: review?(lars)
Attached patch Middleware API - ES search final (obsolete) — Splinter Review
Final patch for ES in Socorro 2.0.
Attachment #539639 - Attachment is obsolete: true
Attachment #539639 - Flags: review?(rhelmer)
Attachment #539639 - Flags: review?(laura)
Attachment #539639 - Flags: review?(lars)
Attachment #539691 - Flags: review?(rhelmer)
Attachment #539691 - Flags: review?(laura)
Attachment #539691 - Flags: review?(lars)
Comment on attachment 539691 [details] [diff] [review]
Middleware API - ES search final

I'm going to r+ to keep the project moving.  However, I still have some reservations. My issues are mainly stylistic rather than algorithmic.  

There's still some string catenation of the form "a + b + c" that ought to use format strings.  The PostgresAPI.search method, while partitioned and commented, is still too monolithic.  It ought to be broken up into a set of separately testable methods. When you get around to the unit tests for this code, there's going to be some nasty surprises because the method is so long and involved.

It is ironic that I be the one to say this, but we ought to adopt PEP8 coding standards for all new code in the project.  I know someone is going to quote me and I'll never hear the end of it...
Attachment #539691 - Flags: review?(lars) → review+
Attachment #539691 - Flags: review?(rhelmer) → review+
Comment on attachment 539691 [details] [diff] [review]
Middleware API - ES search final

After irc discussion, looks like we still need to figure out pagination.
Attachment #539691 - Flags: review+ → review-
* Integrating the new API in the middleware
* Support for both ElasticSearch and PostgreSQL
* Integration with the UI

What's new since last patch:
* Some bug fixed
* New way of retrieving data with ES
* Rewriting of some parts of the code, especially the PG search() method
Attachment #539691 - Attachment is obsolete: true
Attachment #539691 - Flags: review?(laura)
Attachment #539922 - Flags: review?(rhelmer)
Attachment #539922 - Flags: review?(laura)
Attachment #539922 - Flags: review?(lars)
Comment on attachment 539922 [details] [diff] [review]
ElasticSearch in the middleware - the return of the final patch

r+  once we get to unit tests, we ought to do some refactoring and optimization.  Otherwise, from my cursory inspection, it looks fine.  Send it to QA for acceptance testing.
Attachment #539922 - Flags: review?(lars) → review+
Attachment #539922 - Flags: review?(rhelmer) → review+
Patch created from an up-to-date repo, should be working now. Nothing new.
Attachment #539922 - Attachment is obsolete: true
Attachment #539922 - Flags: review?(laura)
Attachment #539933 - Flags: review?(rhelmer)
Attachment #539933 - Flags: review?(laura)
Attached patch SVN-born patchSplinter Review
:rhelmer can you try to apply this patch? It is generated from an up-to-date SVN repo. Hope this one will work.
Attachment #539953 - Flags: review?(rhelmer)
Attachment #539953 - Attachment is patch: true
Attachment #539953 - Attachment mime type: text/x-patch → text/plain
Attachment #539953 - Flags: review?(rhelmer) → review+
Comment on attachment 539953 [details] [diff] [review]
SVN-born patch

Landed this for Adrian:
Committed revision 3219.
sys.maxint is too big on a 64bits machine, ES cannot accept it. Now using a numeric value.
Attachment #540147 - Flags: review?(rhelmer)
Comment on attachment 540147 [details] [diff] [review]
Fixing maxint bug when on 64bits machines

Committed revision 3226.
Attachment #540147 - Flags: review?(rhelmer) → review+
Blocks: 651279
New middleware API is on stage. http://crash-stats.allizom.org/query is using ElasticSearch, http://crash-stats-dev.allizom.org/query is using PostgreSQL, both through the new API. Both versions have to be tested. 

It is normal that the results of searches are not the same with ES and with PG. 

Search with PostgreSQL should return the exact same results than last version. Search with ElasticSearch, apart from having incomplete data, does not implement branches nor search by plugin. This will come in 2.1. 

There is no unit-test for this code for the moment, should be coming soon. 

Please feel free to ask me any question you may have.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [qa?]
(In reply to comment #27)
> It is normal that the results of searches are not the same with ES and with
> PG. 

BTW filed bug 665225 to track fixing this :)
Attachment #539933 - Flags: review?(rhelmer)
Attachment #539933 - Flags: review?(laura)
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.