Closed Bug 269025 Opened 20 years ago Closed 17 years ago

Database should be optimized further for performance

Categories

(Webtools :: Bouncer, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: morgamic, Assigned: morgamic)

Details

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20041102 Firefox/1.0RC1 (Debian package 0.99+1.0RC1-3)
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20041102 Firefox/1.0RC1 (Debian package 0.99+1.0RC1-3)

Generating statistics, given the size of the log table will become increasingly
taxing as time progresses.  Optimization would help with this.  Proper indexing
and possibly even a shadow DB should be considered. This should probably be
discussed as a group so we come to a general consensus on what approach to take.

Reproducible: Always
Steps to Reproduce:
Status: UNCONFIRMED → NEW
Ever confirmed: true
Tonight I updated the live app to increment count fields rather than do an
INSERT on an auto_increment row.  This seems to solve the problem of having
millions of rows in one table.

Admins can see hit stats quickly now from the "mirrors" and "products" pages. 
It has become more and more apparent that using MySQL for logging is just a bad
idea in general.

Thus, Scott and I agreed that we should utilize our apache transfer logs for
long-term reporting.  The data is all there, we will have to develop some
scripts to parse GET strings in order to periodically generate statistical
distributions.  But really, it's all there for a rainy day.

On the DB side of things, we may have increased load on the DB server and Apache
some by removing an INSERT and replacing it with two simple UPDATE's.  If it
becomes a problem, we can always disable flip LOGGING in cfg/config.php.

Aside from this change, I'd like to leave the current branch alone and focus on
v2.0.  Whether or not this eliminates the DB's optimization issues regarding
logging, I am not 100% sure - but it certainly makes them less of a concern if
we aren't doing a SELECT on a table with 2 million rows.
Status: NEW → ASSIGNED
(In reply to comment #1)
> Tonight I updated the live app to increment count fields rather than do an
> INSERT on an auto_increment row.  This seems to solve the problem of having
> millions of rows in one table.

Unfortunately this change creates other serious problems.

> Admins can see hit stats quickly now from the "mirrors" and "products" pages. 
> It has become more and more apparent that using MySQL for logging is just
> a bad idea in general.

Why?  MySQL has been successfully used for such applications before.  In fact,
mozilla.org's existing download stats (http://download-stats.mozilla.org/) get
generated from a MySQL database currently containing 129,523,401 entries.

It's just necessary to get the schema right and properly optimized.

> Thus, Scott and I agreed that we should utilize our apache transfer logs for
> long-term reporting.  The data is all there, we will have to develop some
> scripts to parse GET strings in order to periodically generate statistical
> distributions.  But really, it's all there for a rainy day.

Even when we were only directing users to the primary mirror network, a
significant portion of traffic was uncounted because our mirrors didn't provide
us with their logs.  We had to extrapolate based on partial data and make
difficult guesses about what portion of the data we're missing.

Now that we're also directing users to secondary mirrors, an even greater
portion of mirrors won't be providing us data, so even more data will be missing
and extrapolation will be even harder.

> Whether or not this eliminates the DB's optimization issues regarding
> logging, I am not 100% sure - but it certainly makes them less of a concern
> if we aren't doing a SELECT on a table with 2 million rows.

I think this change represents a significant loss in functionality for the
download tool and our stats in general.  It means we will no longer be able to
easily get date-based download figures and will have much less information than
before about total downloads.  I would really like to have us instead optimize
the schema and queries so that generating routine data from the table takes
little time no matter how many records it contains.
Thanks for your comments, Myk.  I will see what I can do as far as optimizing
the schema/queries.  I have a good set of test data to play with.

By creating an index for the timestamps on logs, periodic reports should be
quick (don't have to traverse it all if it is ordered).  Lifetime reports would
still take _some_ time (<10 seconds) though.

Overall I agree that dropping dates was a bad long-term decision and I will try
to address that as soon as possible.
Thanks Mike.  I also have a test DB set up on my own system and can help you
with the optimization.
I updated myself on indexing to make sure I had this right.  I concluded that
things look pretty solid.  I plan to work with OSL sysadmins to tweak indexing
under direct testing using AB.  Using AB I will be able to spot SNAFUs with the
indexing.  We will see.

I will update this bug once adequate staging has taken place.

Overall, the basic solution for logging that we have come up with is a hybrid -
1) updating incremental counts as before and 2) throwing everything into the
_log table.

The latter is made more feasible by not having the counts be generated for the
entire lifetime of Bouncer, but rather by a limited date range on the stats page
-- which is the predicted most common use anyway.  By limiting the range and
properly indexing the DATETIME we should allow for snappy reporting despite the
length of the table.

I am finishing up the daterange filter along with the other filters this week. 
More to come.
QA Contact: kveton → bouncer
Indexes are fine -- this was fixed about 2 years ago :)
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.