Closed Bug 1814312 Opened 1 year ago Closed 2 months ago

Optimise the run time of a SELECT query

Categories

(Tree Management :: Treeherder, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wezhou, Unassigned)

References

(Blocks 1 open bug)

Details

In the production environment, we find a SELECT query that runs frequently (more than 75K times in the past 6 hours) but runs slowly. The query on average takes 5+ seconds to finish and is the major reason why the database has close to 100% CPU usage during today's incident.

One sample of the query looks like the following,

# Time: 2023-01-31T17:29:45.869793Z
# User@Host: treeherder_prod_admin[treeherder_prod_admin] @  [10.4.4.254]  thread_id: 1914777  server_id: 1866177960
# Query_time: 5.539673  Lock_time: 0.000257 Rows_sent: 0  Rows_examined: 23207
SET timestamp=1675186185;
SELECT id, summary, crash_signature, keywords, resolution, status, dupe_of,
             MATCH (`summary`) AGAINST ('127.0.0.1     [31/Jan/2023 17:14:35]  ‘\0ý\' €TFX' IN BOOLEAN MODE) AS relevance
              FROM bugscache
             WHERE 1
               AND `summary` LIKE CONCAT ('%%', '127.0.0.1 - - [31/Jan/2023 17:14:35] \"‘\0ý\'~€TFX', '%%') ESCAPE '='
          ORDER BY relevance DESC
             LIMIT 0,50;

In Google's Query Insight tool, it looks like the most time consuming part of the query is the "Order by" clause.

The purpose of this ticket is to see if this frequent query can be optimized in any way so that it doesn't become the bottleneck of the DB performance.

a few thoughts here as per some discussions at our work week:

  1. we could remove the ORDER BY and LIMIT; let python do the sorting and limiting; from some redash queries we are returning only a few matches
    1a) we could limit 100 and remove ORDER BY
  2. reducing the number of calls to this. Overall if we can do a few checks in python we can determine if this is a single tracking bug and should only have a single match, then we can do a cheaper query; From some initial numbers that will solve 2/3 of the cases, so this might be enough to reduce our load.
  3. we could add some filters to the search terms and ignore things that don't match our top 10 patterns. This introduces a risk that we might end up not showing a relevant bug, but at this point the error message is not actionable without more context and a previous message probably has the correct information.
  4. find database engineers at mozilla and have them review our query, our indexing, our sql configuration
  5. consider upgrading mysql if there is a newer version that meets our needs (what are our needs)? (we run 5.7.40)
  6. use elastics search for the text heavy searching- i.e. the bugscache table would be in elastic search and not in mysql

consider upgrading mysql if there is a newer version that meets our needs (what are our needs)? (we run 5.7.40)

We should definitely consider doing that. v5.7 will be EOL by October 2023 anyways. (BTW, production database is 5.7.39).

use elastics search for the text heavy searching- i.e. the bugscache table would be in elastic search and not in mysql

That might be the way to go if the query optimization doesn't play out. I know AMO (a django app) also has an Elasticsearch component.

a quick update here, all of the calls we do to this query are for each entry in text_log_error, I have since reduced the volume of incoming data by 50%:
https://sql.telemetry.mozilla.org/queries/90375/source#223688

I believe it is realistic to reduce it by 25-40% more (target of 60K lines/weekday - see bug 1816144).

I am going to mark this as resolved as we have moved further along the path of replacing mysql with postgres

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.