Closed
Bug 654567
Opened 14 years ago
Closed 13 years ago
Implement ElasticSearch API in middleware
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
2.0
People
(Reporter: adrian, Assigned: adrian)
References
Details
(Whiteboard: [qa?])
Attachments
(3 files, 6 obsolete files)
46.09 KB,
patch
|
Details | Diff | Splinter Review | |
45.52 KB,
patch
|
rhelmer
:
review+
|
Details | Diff | Splinter Review |
689 bytes,
patch
|
rhelmer
:
review+
|
Details | Diff | Splinter Review |
We should have a simple REST API in the middleware for querying ElasticSearch.
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → agaudebert
Assignee | ||
Comment 1•14 years ago
|
||
The documentation for this API is in the wiki: https://wiki.mozilla.org/Socorro/ElasticSearch_API
A first "implementation plan" can be seen here: http://adrian.gaudebert.fr/downloads/images/SocorroMiddlewareAPIImplementation.png
Please make any comment or suggestion!
Assignee | ||
Comment 2•14 years ago
|
||
Current version of the ES API in Socorro.
Assignee | ||
Updated•14 years ago
|
Attachment #535143 -
Flags: review?(rhelmer)
Comment 3•14 years ago
|
||
Comment on attachment 535143 [details] [diff] [review]
Middleware API - ES search
Review of attachment 535143 [details] [diff] [review]:
-----------------------------------------------------------------
This looks great overall, I think we need some changes before taking this (comments inline below).
::: socorro/webapi/apiInterface.py
@@ +1,1 @@
> +
interfaces feel pretty un-pythonic to me. It is one thing if it's a parent class that implements functionality that you are then extending/overriding, but I think in this case you could do away with the APIInterface and getImplInstance and just make the two classes happen to implement the same interface.
If they have any actual code in common, it should be factored out to a common class that both inherit of course.
::: socorro/webapi/elasticSearchAPI.py
@@ +11,5 @@
> + """
> + Default constructor
> + """
> + self.http = httpClient.HttpClient("node14.generic.metrics.sjc1.mozilla.com", "9200")
> +
put the above in a config.dist file and use config for hostname, port
@@ +58,5 @@
> + to_date = kwargs.get("to", now)
> + fields = kwargs.get("in", "_all")
> + os = kwargs.get("os", "_all")
> + version = kwargs.get("version", "_all")
> + #branches = kwargs.get("branches", "_all") # what am I supposed to search in?
We should discuss this, I am not sure if we store branches right now, I suspect we have a version=>branch mapping somewhere in Socorro rather than in the crash report but we can check.
::: socorro/webapi/postgreAPI.py
@@ +1,1 @@
> +import logging
I think postgresAPI.py would be a better name, postgres or postgresql is the usual name.
Also not sure this really needs to be a standalone web service that returns SQL; I think it'd be ok to just build and execute the SQL in services/search.py
@@ +57,5 @@
> +
> + sql_query = "SELECT *"
> + sql_query += "FROM reports"
> + sql_query += "WHERE signature='" + terms + "'"
> +
We should not build SQL queries with string concatenation, especially if it contains user-generated content (even if not now, someone may unknowingly pass it in the future, assuming it's safe). We should use parameterized queries instead (we use the psyopg2 library for this elsewhere).
Also, shouldn't this be against top_crashes_by_signature table and not the reports table?
Attachment #535143 -
Flags: review-
Updated•14 years ago
|
Attachment #535143 -
Flags: review?(rhelmer)
Assignee | ||
Comment 4•14 years ago
|
||
About the interface: this is something that comes from my experience with C++, and as you know I'm pretty new to python. But I'll definitely remove it if it's the good way to do that!
About postgresAPI: this class is supposed to implement the exact same API than ES ( https://wiki.mozilla.org/Socorro/ElasticSearch_API ), but there is still nothing in there. It will in the future implement query, search and reports so people can choose whether they want to use ES or Postgres. So the search function is supposed to perform a search, and not a top crasher report, so I think it is the good thing to go against reports table. But please tell me if I'm wrong!
Thanks for the review, I'll report the changes here once they're done.
Comment 5•14 years ago
|
||
(In reply to comment #4)
> About postgresAPI: this class is supposed to implement the exact same API
> than ES ( https://wiki.mozilla.org/Socorro/ElasticSearch_API ), but there is
> still nothing in there. It will in the future implement query, search and
> reports so people can choose whether they want to use ES or Postgres. So the
> search function is supposed to perform a search, and not a top crasher
> report, so I think it is the good thing to go against reports table. But
> please tell me if I'm wrong!
Well if it's going to be a small number of fairly straightforward SQL queries, and we don't intend on adding many more search providers, I think there's value in just keeping things in one place in socorro.services.search and not putting up too much scaffolding. We can revisit this if we find ourselves adding more search providers, but I don't think that's anticipated anytime in the near future.
> Thanks for the review, I'll report the changes here once they're done.
Thank you!
Comment 6•14 years ago
|
||
I'm going add some unsolicited comments, too.
We need to get some meaningful names in here. "api.py" is like naming all your variables "data". If this is about "search" then perhaps it should be called the "SearchAPI".
It looks like the intent is to create a new hierarchy of implementations of a search api. Since these files that define the api do not themselves implement web services, they should not live within the webapi package. They should, perhaps, have their own package at the same level as the webapi package.
I'm confused in this search API how one would choose the alternate Postgresql implementation without having to actually change the source code in api.py. If it is to be that only one or the other implementation is to be used, I would suggest that the ConfigurationManager be charged with the selection and eliminate api.py entirely. I can provide further details on how to do that.
I'm noticing some private methods for datetime conversions. Since this sort of thing is useful elsewhere, I suggest that they be included in the socorro.lib.datetimeutil module.
The use of __ for method names within classes precludes future overriding through inheritance. I suggest a single underscore for indication of private methods. That signals future coders wishing to subclass to be careful, but doesn't artificially place limits on their freedom.
I also discourage the use of old-style python classes.
class Alpha: # old style == bad
class Alpha(object): # new style == good
I see instances of boolean logic with "not" factored outward.
not x == y # x != y
not type(thing) is someType # type(thing) is not someType
list comprehensions and generator expressions are much more succinct ways of list and string processing. The following two functions do exactly the same thing:
def __array_to_string(self, array, separator, prefix = "", suffix = ""):
string = ""
first = True
for term in array:
if not first:
string += separator
string += prefix + term + suffix
first = False
return string
def _array_to_string (self, array, separator, prefix = "", suffix = ""):
return separator.join("%s%s%s" % (prefix, x, suffix) for x in array)
I have more comments, but I'll refrain for a future iteration of this code.
Comment 7•14 years ago
|
||
An important comment regarding the HttpClient class:
The __del__ method is not the equivalent of a C++ destructor. You cannot be guaranteed that __del__ will be called by the garbage collector in a timely manner; it won't just automatically happen when a instance goes out of scope. In a long running app, you may end up with connections hanging around for far longer than you want.
For classes that need to hold and then release a resource, a modern Pythonic implementation would be to use a contextmanager. see http://docs.python.org/library/contextlib.html for a simple decorator for doing this. Reimplementing in this way would require some refactoring to move the http connection out of the constructor and into a more local context. Use of the HttpClient class with contextmanager semantics would look like this:
with HttpClient(host, port) as h:
h.post(...)
Alternatively, I suggest renaming the __del__ to simply "close" and then require the use of a try: finally: block.
Assignee | ||
Comment 8•14 years ago
|
||
(In reply to comment #5)
> Well if it's going to be a small number of fairly straightforward SQL
> queries, and we don't intend on adding many more search providers, I think
> there's value in just keeping things in one place in socorro.services.search
> and not putting up too much scaffolding. We can revisit this if we find
> ourselves adding more search providers, but I don't think that's anticipated
> anytime in the near future.
I don't think the search function will be straightforward in the Postgres version of the API, the current PHP implementation looks pretty complex.
I think I didn't explain well enough what I am trying to achieve, so let we give an other try. I love schemata so here we go: http://adrian.gaudebert.fr/downloads/images/Socorro-middleware-API.png
So when /query, /search or /report is called, the classes Query, Search or Report catch this call. They then use getImplInstance to get the right class to use to perform the request. getImplInstance uses the configuration to know if we want to use ES or PostgeSQL (or whatever solution we have), and returns an instance of this class.
Basically, elasticSearchAPI and postgresAPI are supposed to do the exact same thing, and implement the same interface. But one does it with ES and the other with Postgres. We want to do that so people can choose not to use ES if they don't want to. And this way would also make this API more time resilient: if for any reason we need to use something else than ES, we can simply implement this interface for the new system, and change the configuration.
(In reply to comment #6)
> I'm confused in this search API how one would choose the alternate
> Postgresql implementation without having to actually change the source code
> in api.py. If it is to be that only one or the other implementation is to
> be used, I would suggest that the ConfigurationManager be charged with the
> selection and eliminate api.py entirely. I can provide further details on
> how to do that.
The choice of the implementation to use should definitely be in the configuration, and it is not the case yet only because it was easier and faster for me to do things this way. I would like to hear more about the ConfigurationManager though, as I don't know what it does and how it works.
> It looks like the intent is to create a new hierarchy of implementations of
> a search api. Since these files that define the api do not themselves
> implement web services, they should not live within the webapi package.
> They should, perhaps, have their own package at the same level as the webapi
> package.
I was wondering what would be the best way to integrate those new modules into the existing architecture. The Search, Query and Report classes are services in that they are the one directly called when an HTTP request is done. But they do not implement the logic... I don't really know what kind of semantic we should use here. I put the classes where they are actually because it seemed to me to be the best way to integrate with the existing without creating a new module, but I would like to have your suggestion on that point.
> I'm going add some unsolicited comments, too.
Your comments were secretly solicited, and I thank you very much for your help! I will make sure to fix everything you listed.
Comment 9•14 years ago
|
||
To have the ConfigurationManager handle selecting and loading the search implementation class, I suggest the following:
1) add a new config option to the .../scripts/config/webapiconf.py.dist file in this form:
searchImplClass = cm.Option()
searchImplClass.doc = 'the name of the class for search'
searchImplClass.default = 'socorro.search.ElasticSearch'
searchImplClass.fromStringConverter = cm.classConverter
2) add new config options to the .../scripts/config/webapiconf.py.dist for the requirements of your ElasticSearch Class.
3) in your Search class replace this line:
self.apiImpl = api.getImplInstance(configContext)
with
self.apiImpl = configContext.searchImplClass(configContext)
on execution of this line, the ConfigurationManager will load the appropriate class dynamically and then return that class to you. Then by invoking that class and passing in the the 'configContext' you are instantiating the dynamically loaded class.
4) remove the api.py from the project
This use of the ConfigurationManager in 1.7.x is a primitive version of dynamic loading that I've implemented for the Socorro 2.0. The newer 2.0 version is even easier to use and will query the class after loading to figure out any additional configuration requirements needed by the class.
Updated•14 years ago
|
Summary: Implement ElasticSerarch API in middleware → Implement ElasticSearch API in middleware
Assignee | ||
Comment 10•14 years ago
|
||
This is a work in progress. This patch integrates lots of code improvements, following the good advices of :lars and :rhelmer.
The PostgresAPI class is not to be reviewed yet, nothing is done in here.
Attachment #535143 -
Attachment is obsolete: true
Attachment #535752 -
Flags: review?(rhelmer)
Attachment #535752 -
Flags: review?(lars)
Updated•14 years ago
|
Target Milestone: --- → 2.0
Assignee | ||
Comment 11•14 years ago
|
||
Ok, now this begins to look like something. The PostgreSQL class is all new and should be reviewed. The ElasticSearch had some improvements and features added, and some functions were factored into SearchAPI.
Anurag, Daniel: please review the ES query I build in ElasticSearchAPI. I would like to know if there is a way to improve performances by building it differently.
Thanks all for reviewing!
Attachment #535752 -
Attachment is obsolete: true
Attachment #535752 -
Flags: review?(rhelmer)
Attachment #535752 -
Flags: review?(lars)
Attachment #537910 -
Flags: review?(rhelmer)
Attachment #537910 -
Flags: review?(lars)
Attachment #537910 -
Flags: review?(deinspanjer)
Attachment #537910 -
Flags: review?(aphadke)
Comment 12•14 years ago
|
||
Comment on attachment 537910 [details] [diff] [review]
Middleware API - ES search .3
Overall
* Unit tests don't exist – they are required for each method
Focusing on the PostgresAPI class
* constructor – there is no need to instantiate a database. The base class already does this, you just have to use it. self.database is always available for your use.
* the 'search' method is huge and monolithic. Consider breaking it up into smaller chunks – it'll make unit tests much easier to write.
* comment says optional parameters for **kwargs in the 'search' method include 'terms', but the code implies that it is actually called 'for'.
* a variable called query_string is actually a dict. Types as suffixes should match the actual type of the variable
* many instances of using + to concatenate strings. This is very inefficient in Python. Use join instead:
sqlQuery = sqlSelect + sqlFrom + sqlWhere + sqlGroup + sqlOrder + sqlLimit
sqlQuery = ' '.join( sqlSelect, sqlFrom, sqlWhere, sqlGroup, sqlOrder, sqlLimit)
* many instances of using += for accumulating strings – also inefficient. Consider using list of strings:
sqlWhere.append('r.hangid IS NULL')
then later, join them together with a boolean operator:
' AND '.join(sqlWhere)
* many places where variables are written as literals into sql statements – this technique is ripe for sql injection attacks. Consider using positional bound variables instead.
* database connections in postgres are by default transactional. Any localized use of a connection should be bounded in a try:finally construct:
conn = self.database.connection():
try:
cur = conn.cursor()
# execute sql statements
conn.commit()
except Exception:
conn.rollback()
# report or recover from error
finally:
conn.close()
Attachment #537910 -
Flags: review?(lars) → review-
Comment 13•13 years ago
|
||
Comment on attachment 537910 [details] [diff] [review]
Middleware API - ES search .3
Why use query string parsing for this instead of the query DSL?
Assignee | ||
Comment 14•13 years ago
|
||
:dre: Do you mean using an external library to generate the query, or splitting the "query_string" of my json query in smaller parts?
:lars: Thanks for your review, I'll integrate those changes and make unit tests as soon as I finish integrating the new API in the UI.
Assignee | ||
Comment 15•13 years ago
|
||
The implementation of ElasticSearch can be tested on khan: http://agaudebert.khan.mozilla.org/query
Note that:
* only the search uses ES ;
* the code is not fully optimized ;
* the ES instance is not as full of data as it will be in prod ;
* the code is not yet unit-tested.
Assignee | ||
Comment 16•13 years ago
|
||
Last patch to review before the final one which will be landing tonight.
What's new:
* Integration in the UI
* Refactoring of the middleware code
* Fixing bugs
This code is testable on khan: http://agaudebert.khan.mozilla.org/query (using ES)
Attachment #537910 -
Attachment is obsolete: true
Attachment #537910 -
Flags: review?(rhelmer)
Attachment #537910 -
Flags: review?(deinspanjer)
Attachment #537910 -
Flags: review?(aphadke)
Attachment #539639 -
Flags: review?(rhelmer)
Attachment #539639 -
Flags: review?(laura)
Attachment #539639 -
Flags: review?(lars)
Assignee | ||
Comment 17•13 years ago
|
||
Final patch for ES in Socorro 2.0.
Attachment #539639 -
Attachment is obsolete: true
Attachment #539639 -
Flags: review?(rhelmer)
Attachment #539639 -
Flags: review?(laura)
Attachment #539639 -
Flags: review?(lars)
Attachment #539691 -
Flags: review?(rhelmer)
Attachment #539691 -
Flags: review?(laura)
Attachment #539691 -
Flags: review?(lars)
Comment 18•13 years ago
|
||
Comment on attachment 539691 [details] [diff] [review]
Middleware API - ES search final
I'm going to r+ to keep the project moving. However, I still have some reservations. My issues are mainly stylistic rather than algorithmic.
There's still some string catenation of the form "a + b + c" that ought to use format strings. The PostgresAPI.search method, while partitioned and commented, is still too monolithic. It ought to be broken up into a set of separately testable methods. When you get around to the unit tests for this code, there's going to be some nasty surprises because the method is so long and involved.
It is ironic that I be the one to say this, but we ought to adopt PEP8 coding standards for all new code in the project. I know someone is going to quote me and I'll never hear the end of it...
Attachment #539691 -
Flags: review?(lars) → review+
Updated•13 years ago
|
Attachment #539691 -
Flags: review?(rhelmer) → review+
Comment 19•13 years ago
|
||
Comment on attachment 539691 [details] [diff] [review]
Middleware API - ES search final
After irc discussion, looks like we still need to figure out pagination.
Attachment #539691 -
Flags: review+ → review-
Assignee | ||
Comment 20•13 years ago
|
||
* Integrating the new API in the middleware
* Support for both ElasticSearch and PostgreSQL
* Integration with the UI
What's new since last patch:
* Some bug fixed
* New way of retrieving data with ES
* Rewriting of some parts of the code, especially the PG search() method
Attachment #539691 -
Attachment is obsolete: true
Attachment #539691 -
Flags: review?(laura)
Attachment #539922 -
Flags: review?(rhelmer)
Attachment #539922 -
Flags: review?(laura)
Attachment #539922 -
Flags: review?(lars)
Comment 21•13 years ago
|
||
Comment on attachment 539922 [details] [diff] [review]
ElasticSearch in the middleware - the return of the final patch
r+ once we get to unit tests, we ought to do some refactoring and optimization. Otherwise, from my cursory inspection, it looks fine. Send it to QA for acceptance testing.
Attachment #539922 -
Flags: review?(lars) → review+
Updated•13 years ago
|
Attachment #539922 -
Flags: review?(rhelmer) → review+
Assignee | ||
Comment 22•13 years ago
|
||
Patch created from an up-to-date repo, should be working now. Nothing new.
Attachment #539922 -
Attachment is obsolete: true
Attachment #539922 -
Flags: review?(laura)
Attachment #539933 -
Flags: review?(rhelmer)
Attachment #539933 -
Flags: review?(laura)
Assignee | ||
Comment 23•13 years ago
|
||
:rhelmer can you try to apply this patch? It is generated from an up-to-date SVN repo. Hope this one will work.
Attachment #539953 -
Flags: review?(rhelmer)
Updated•13 years ago
|
Attachment #539953 -
Attachment is patch: true
Attachment #539953 -
Attachment mime type: text/x-patch → text/plain
Attachment #539953 -
Flags: review?(rhelmer) → review+
Comment 24•13 years ago
|
||
Comment on attachment 539953 [details] [diff] [review]
SVN-born patch
Landed this for Adrian:
Committed revision 3219.
Assignee | ||
Comment 25•13 years ago
|
||
sys.maxint is too big on a 64bits machine, ES cannot accept it. Now using a numeric value.
Attachment #540147 -
Flags: review?(rhelmer)
Comment 26•13 years ago
|
||
Comment on attachment 540147 [details] [diff] [review]
Fixing maxint bug when on 64bits machines
Committed revision 3226.
Attachment #540147 -
Flags: review?(rhelmer) → review+
Assignee | ||
Comment 27•13 years ago
|
||
New middleware API is on stage. http://crash-stats.allizom.org/query is using ElasticSearch, http://crash-stats-dev.allizom.org/query is using PostgreSQL, both through the new API. Both versions have to be tested.
It is normal that the results of searches are not the same with ES and with PG.
Search with PostgreSQL should return the exact same results than last version. Search with ElasticSearch, apart from having incomplete data, does not implement branches nor search by plugin. This will come in 2.1.
There is no unit-test for this code for the moment, should be coming soon.
Please feel free to ask me any question you may have.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [qa?]
Comment 28•13 years ago
|
||
(In reply to comment #27)
> It is normal that the results of searches are not the same with ES and with
> PG.
BTW filed bug 665225 to track fixing this :)
Updated•13 years ago
|
Attachment #539933 -
Flags: review?(rhelmer)
Assignee | ||
Updated•13 years ago
|
Attachment #539933 -
Flags: review?(laura)
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•