Closed Bug 564540 Opened 14 years ago Closed 14 years ago

Make our fancy new CSVs and JSON faster

Categories

(addons.mozilla.org Graveyard :: Code Quality, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED
5.12.3

People

(Reporter: clouserw, Assigned: ianbicking)

References

()

Details

(Keywords: perf)

We have a pile of new CSVs and from my click-and-wait testing strategy I think they're all being built on the fly with minimal caching afterwards. Now that we have more infra options we should be building these smarter: 1) All CSVs and JSON are sent to celery to be built on the first request. HTTP 202 is returned to the user until the output is ready 2) When celery finishes building the output it is cached to disk in a shared location with a structure of /stats/$addonid/$name.$type. The shared space is the place remora currently calls NETAPP_STORAGE. Example: NETAPP_STORAGE/stats/1865/os-week-20100101-20100601.json 3) When a file is used from the disk cache it should also be touched. We'll run a GC cron job that removes all files that haven't been touched in a month.
Blocks: 564541
Target Milestone: 5.11 → 5.12
Random notes: We've decided to drop the "-week-" part of this, which should make our calculations and caching easier. All exports will be daily, and people can extrapolate week/month from that. The current stats models/views/urls should be modified for this. The celery jobs will need locking so they don't duplicate work.
Assignee: nobody → ianb
Target Milestone: 5.12 → 5.11.3
Target Milestone: 5.11.3 → 5.11.4
Target Milestone: 5.11.4 → 5.12
I have been trying to figure out where the performance problems on this code come from. The query is quite fast and well indexed, and shows only modest time increases as the amount of data goes up. Running locally with a complete data set, I can't reproduce any timing problem. So I tried looking more closely on a production box, specifically pm-app-amodev01, adding bits of code to see where the time goes. So far I haven't figured it out. For instance, a request that for downloads can take 1.5 seconds of clock time to request (it's not entirely consistent). This hits zamboni.apps.stats.views.downloads_series. I looked into that code, expecting to find something that is slow, but the function takes 0.02 seconds to complete. I looked at TimingMiddleware, to see if something it was tracking might be slower, but it gave basically the same number. I tried disabling all the middleware (except amo.middleware.LocaleAndAppURLMiddleware which is required) and it didn't speed up. LocaleAndAppURLMiddleware doesn't appear very complicated. The only thing that didn't seem clear in that code was tower.activate(), and commenting that out made no change. To be extra sure, I put code in zamboni.wsgi to check the time, and then it was only 0.2 seconds. So... I really have no idea where in the stack this slowness is happening. It's not in Python anywhere; at least, not in a way I can reproduce. I don't know any more about what lies between mod_wsgi and completing an HTTP request in this configuration, so I don't know where else it would be slow.
ianb: Is there anything else besides http://github.com/jbalogh/zamboni/commit/a84f18d6ca? Can we close this?
Since the URL from comment 4 still times out, I suspect we have memory/object problems. If these queries are creating db.Model objects, that's going to spend a ton of time building objects we probably don't need. And then all those objects are going to get pickled and cached. Caching will fail anyways if we go over memcached's 1MB limit. Using .values or .values_list will probably help. http://github.com/jbalogh/zamboni/blob/master/apps/stats/views.py#L268 doesn't help, but JSON times out too, so that's not the only culprit.
Serving up even quite large content should be fine, and the CSV writer and JSON writer are really quite fast (CSV is faster than JSON in my tests, but the query happens in slightly different places in those two codepaths which obscure that). So if it's timing out it's probably in other places. Obviously the query itself could time out, which we can probably determine simply by running that query manually and seeing how long it takes. If it's the cache system, we could avoid the ORM entirely and run SQL directly.
ianb is on vacation next week. bumping
Target Milestone: 5.11.8 → 5.12
Priority: -- → P5
Local performance testing indicates this commit brings the requests into an acceptable time limit: http://github.com/ianb/zamboni/commit/d1795ec224f6b3a481bff005d753c19227d340d2 (changes database data from phpserialize to JSON) Performance increases require a data migration, which can be performed with: python manage.py convert_stats_to_json It can't run in SQL, and will likely take a very long time, but can be applied incrementally and will pick up where it left off. We need to check if the PHP side can produce JSON for stats, and if there is any other code consuming the phpserialized data.
(In reply to comment #10) > Local performance testing indicates this commit brings the requests into an > acceptable time limit: > http://github.com/ianb/zamboni/commit/d1795ec224f6b3a481bff005d753c19227d340d2 > (changes database data from phpserialize to JSON) > > Performance increases require a data migration, which can be performed with: > > python manage.py convert_stats_to_json > > It can't run in SQL, and will likely take a very long time, but can be applied > incrementally and will pick up where it left off. This is looking great. The best way to do long running offline stuff is with celeryd. That lets us rate limit it and distribute it if needed. You can see an example of a manage script doing that at http://github.com/jbalogh/zamboni/commit/1758532dfdad6935ee547a813178cecb7af83c4c . The apply_async() call is what is interesting there. > We need to check if the PHP side can produce JSON for stats, and if there is > any other code consuming the phpserialized data. I'll poke around in the PHP and see.
One caveat which might be a problem for Celery is that when I was running this migration locally, after a few hours a memory leak stalled out the process. MySQLdb is known to leak memory... or it could be Python, its memory management can fall down in some circumstances. But either way, eventually the command will break the process it runs in. I put in an option to limit the number of updates it runs, and the command exits with a non-zero status if the limit is reached, so it can be looped until done. Does Celery recycle processes or have some maximum lifetime on processes? (I guess I could actually just exit too, and I'm guessing Celery will restart?)
Celery conversion here: http://github.com/ianb/zamboni/commit/536d87f5b280a87854b117795a4e21f1de3d19d1 I haven't tested it, as I don't have a running Celery setup in my development environment (the non-celery version still works; --queue implements it as a task). Also I'm not sure about the lifecycle of celery processes, it's still a concern that memory will be leaked if those processes are reused for multiple tasks.
Thanks. I tried to run this tonight, and I agree it's using crazy amounts of RAM and CPU. It looks like I have to specify max-objs, and since I don't want to sit here all night I went for: ./manage.py convert_stats_to_json --queue --max-objs=999999999 It returns super fast, but celery thrashes. If I do max-objs 1 or 2 it doesn't die, but 20 to 200 just makes it thrash. I glanced at the code a bit - are you sure you're not recursing forever? It looks like you pass max_objs into update_to_json, and then pass it in again from after_max_redo. I didn't dig into it deeper since I'm falling asleep. To use celery, the idea is to loop over your objects in chunks, and then pass each chunk to apply_async (like the script in comment 11). I think the way you're doing it is passing the entire max_objs into celery, so celery is doing the work, but it's we're not getting any of the benefit of rate limiting or parallel processing. (Unless you run the command a zillion times on the command line, I guess).
That definitely shouldn't happen. I'll look at the logic there. I can't loop over any chunks ahead of time as simply finding the things that need to be updated takes a significant amount of the work. I've been running with max-objs at 100000 or so without problem, and the way I intended the Celery task to work is that it should restart its own task after reaching that limit. The whole task is safely rerunnable (but not useful to run in parallel), so the intention is to simply restart the task after doing a bunch of work.
(In reply to comment #11) > > I'll poke around in the PHP and see. The only code I saw using the JSON code were the statistics pages themselves, so that should make this transition manageable. Once we make the transition and confirm the speedup, we'll have to let metrics know about the change( In reply to comment #15) > That definitely shouldn't happen. I'll look at the logic there. I can't loop > over any chunks ahead of time as simply finding the things that need to be > updated takes a significant amount of the work. I've been running with > max-objs at 100000 or so without problem, and the way I intended the Celery > task to work is that it should restart its own task after reaching that limit. > The whole task is safely rerunnable (but not useful to run in parallel), so the > intention is to simply restart the task after doing a bunch of work. Whatever works is fine with me. Keep in mind that this is a run once script - it doesn't need to be beautiful. The more work we keep off the db and onto the celery boxes the better. If the migration takes longer than a couple hours, we'll have to figure out a plan for an extended stats outage or some method of keeping both versions available. Since historical stats don't change we can do that, it'll just be more coordinating.
Well, if we're going to keep this simple, I've tested it as a simple shell script with a while loop to do the retrying. It took about two full days on my laptop to do the update, and I only have a partial database dump. I should clarify though that the migration can run with the application live, so nothing has to be taken offline. It can simply run as long as it needs to, and be interrupted at any time and restarted. The change I made to the code still allows phpserialized data to be read, it just detects and decodes JSON data as well. We could do a smaller migration in staging to test as well, e.g., convert just the records from that known-slow URL. The script doesn't support that kind of selectivity, but I could add that.
(In reply to comment #17) > The change I made to the code > still allows phpserialized data to be read, it just detects and decodes JSON > data as well. It detects it in python. Our problem is that the PHP side is still expecting what is in the db, so if it changes it'll break. We'd either need to modify the PHP code to see both, or to the switch all at once.
I thought you said the PHP wasn't reading those values, only writing them? Or maybe I misunderstood you. Changing the PHP side seems reasonable -- and necessary, if we want the PHP side to read the new JSON values too, right? The detection the Python code is doing is just looking for value.startswith('{') or value.startswith('['), and if one of those then using JSON.
PHP is definitely reading the values, but I can look into modifying that since we are on a new enough version of PHP. I've tried to run your command with max-objs at 100000, 10000, and 1000 and it fails every time saying it's out of memory: (zamboni)clouserw@khan:~/dev/zamboni$ ./manage.py convert_stats_to_json --max-objs=1000 /home/clouserw/.virtualenvs/zamboni/lib/python2.6/site-packages/celery/loaders/default.py:93: NotConfigured: No celeryconfig.py module found! Please make sure it exists and is available to Python. NotConfigured) 11:54:1284663249 nose.plugins.manager:DEBUG DefaultPluginManager load plugin sqlalchemy = sqlalchemy.test.noseplugin:NoseSQLAlchemy :/home/clouserw/.virtualenvs/zamboni/lib/python2.6/site-packages/nose/plugins/manager.py:359 11:54:1284663249 z.command:DEBUG Updating model Contribution.post_data (1/8) :/home/clouserw/dev/zamboni/apps/stats/tasks.py:221 11:54:1284663249 z.command:DEBUG SELECT `stats_contributions`.`id`, `stats_contributions`.`addon_id`, `stats_contributions`.`amount`, `stats_contributions`.`source`, `stats_contributions`.`source_locale`, `stats_contributions`.`annoying`, `stats_contributions`.`created`, `stats_contributions`.`uuid`, `stats_contributions`.`is_suggested`, `stats_contributions`.`suggested_amount`, `stats_contributions`.`comment`, `stats_contributions`.`transaction_id`, `stats_contributions`.`post_data` FROM `stats_contributions` WHERE (NOT ((`stats_contributions`.`post_data` LIKE BINARY {% AND NOT (`stats_contributions`.`post_data` IS NULL))) AND NOT ((`stats_contributions`.`post_data` LIKE BINARY [% AND NOT (`stats_contributions`.`post_data` IS NULL))) AND NOT ((`stats_contributions`.`post_data` = AND NOT (`stats_contributions`.`post_data` IS NULL)))) :/home/clouserw/dev/zamboni/apps/stats/tasks.py:221 11:54:1284663257 z.command:DEBUG Updating model DownloadCount.sources (2/8) :/home/clouserw/dev/zamboni/apps/stats/tasks.py:221 11:54:1284663257 z.command:DEBUG SELECT `download_counts`.`id`, `download_counts`.`addon_id`, `download_counts`.`count`, `download_counts`.`date`, `download_counts`.`src` FROM `download_counts` WHERE (NOT ((`download_counts`.`src` LIKE BINARY {% AND NOT (`download_counts`.`src` IS NULL))) AND NOT ((`download_counts`.`src` LIKE BINARY [% AND NOT (`download_counts`.`src` IS NULL))) AND NOT ((`download_counts`.`src` = AND NOT (`download_counts`.`src` IS NULL)))) :/home/clouserw/dev/zamboni/apps/stats/tasks.py:221 Exception MemoryError: MemoryError() in <bound method Cursor.__del__ of <MySQLdb.cursors.Cursor object at 0xc4e512c>> ignored Traceback (most recent call last): File "./manage.py", line 57, in <module> execute_manager(settings) File "/home/clouserw/.virtualenvs/zamboni/src/django/django/core/management/__init__.py", line 438, in execute_manager utility.execute() File "/home/clouserw/.virtualenvs/zamboni/src/django/django/core/management/__init__.py", line 379, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/home/clouserw/.virtualenvs/zamboni/src/django/django/core/management/base.py", line 191, in run_from_argv self.execute(*args, **options.__dict__) File "/home/clouserw/.virtualenvs/zamboni/src/django/django/core/management/base.py", line 220, in execute output = self.handle(*args, **options) File "/home/clouserw/dev/zamboni/apps/stats/management/commands/convert_stats_to_json.py", line 30, in handle updater.run() File "/home/clouserw/dev/zamboni/apps/stats/tasks.py", line 204, in run self.update_all_models() File "/home/clouserw/dev/zamboni/apps/stats/tasks.py", line 297, in update_all_models self.update_objs(model, field_name) File "/home/clouserw/dev/zamboni/apps/stats/tasks.py", line 265, in update_objs for obj in qs: File "/home/clouserw/.virtualenvs/zamboni/src/django/django/db/models/query.py", line 106, in _result_iter self._fill_cache() File "/home/clouserw/.virtualenvs/zamboni/src/django/django/db/models/query.py", line 760, in _fill_cache self._result_cache.append(self._iter.next()) File "/home/clouserw/.virtualenvs/zamboni/src/django/django/db/models/query.py", line 282, in iterator obj = self.model(*row[index_start:aggregate_start]) File "/home/clouserw/.virtualenvs/zamboni/src/django/django/db/models/base.py", line 273, in __init__ setattr(self, field.attname, val) File "/home/clouserw/.virtualenvs/zamboni/src/django/django/db/models/fields/subclassing.py", line 98, in __set__ obj.__dict__[self.field.name] = self.field.to_python(value) MemoryError
We could run this on preview to see how it helps speed and do loadtests. We should be decomissioning all of remora's stats pages when we flip to potch's, so I don't think remora needs a compat layer.
I'm not sure why that MemoryError is occurring... I haven't been able to trigger that without doing a lot of work, and 1000 objects is practically nothing. Are you running exactly "./manage.py convert_stats_to_json --max-objs=1000" -- no --queue option? Does it do much work before you get the MemoryError? For handling the transition Jeff's idea seems reasonable -- I can run the migration on preview (maybe just on select objects, so it doesn't take too long), we can be sure this actually resolves the problem, and if the changes to apps.stats.db are deployed then the migration can be done sometime in the future when we do the flip (and feel confident it doesn't need to be reverted).
The command I ran is what I pasted above. It takes about ~5 minutes for it to get the memory error, I'm not sure what it's doing during that time. This is on khan, fwiw. I'd like to run the migration on my dev database before doing preview, but yeah, we'll definitely need to do some perf testing there before this goes live. What else do you need as far as debugging output for the MemoryError? Have you tried running this on khan?
I did not realize there was a setup on khan to run it on, that'd be very useful! Is the khan setup documented anywhere, or is there anything important I should know about it? It seems like I can just copy your settings_local into a checkout/venv and work from that... I can't tell from the settings where the running instance actually lives though. Also how is the database managed? Like if I mess it up, or simply want to retest what I'm doing, is there a way to restore or revert my changes?
(In reply to comment #24) > I did not realize there was a setup on khan to run it on, that'd be very > useful! Is the khan setup documented anywhere, or is there anything important > I should know about it? It seems like I can just copy your settings_local into > a checkout/venv and work from that... I can't tell from the settings where the > running instance actually lives though. My running instance is in my home directory, you should make your own venv and configure it. You're welcome to grab my settings_local, just don't use my db. ;) The dbs all live on cm-webdev01-master-01. I gave you credentials for that a long time ago, so I think you still have a complete dataset on there. Do you still have them?
We're making good progress on this - script can currently run. Ian: can you paste in the command you use to convert an entire database?
Target Milestone: 5.12 → 5.12.1
Using the code from my branch (http://github.com/ianb/zamboni), run: while ! python ~/dev/zamboni/manage.py convert_stats_to_json --max-objs=10000 ; do echo "Another run" ; done
Here are some numbers for the current pages off PAMO. All the remora /statistics/csv/* pages: Good: 663 Max/Avg/Min/StdDev: 492737ms/152936ms/102ms/135337ms Bad: 160 Total Time: 494.109225988 Req/s: 1.27299788572 Is that an average time of 152 seconds? Yes, it is. And the /statistics/json pages: Good: 169 Max/Avg/Min/StdDev: 97401ms/65573ms/1893ms/35759ms Bad: 0 Total Time: 97.4137899876 Req/s: 1.64247792864
Alright, I reran this to give remora more of a chance and grabbed some pleasantly good zamboni numbers as well. The following is for 20 simultaneous requests on PAMO. Remora: Max/Avg/Min/StdDev: 38604ms/10794ms/146ms/8841ms Total Time: 521.80540514 Req/s: 1.20543021173 Zamboni CSV (1000 URLs): Max/Avg/Min/StdDev: 11729ms/1396ms/36ms/1787ms Total Time: 110.506951809 Req/s: 8.48815377355 Zamboni JSON (1000 URLs): Max/Avg/Min/StdDev: 8406ms/1143ms/43ms/1558ms Total Time: 93.6633141041 Req/s: 10.0145933226 Please clean up/rebase your migration and I'll land it on our trunk and tomorrow night I'll run it and we'll grab some new numbers. Thanks Ian.
Ran this on PAMO last night with: while ! python26 manage.py convert_stats_to_json --max-objs=10000; do echo -n "sleeping..."; sleep 10; echo "gogo"; done Still ran out of memory after a few hours. Restarted again, but it's done a lot, we'll probably be done today.
Update: No luck now, even with --max-objs=100 we run out of memory on the first run. The box has 3G free before running.
Target Milestone: 5.12.1 → 5.12.2
(In reply to comment #29) > Zamboni CSV (1000 URLs): > Max/Avg/Min/StdDev: 11729ms/1396ms/36ms/1787ms > Total Time: 110.506951809 > Req/s: 8.48815377355 > > Zamboni JSON (1000 URLs): > Max/Avg/Min/StdDev: 8406ms/1143ms/43ms/1558ms > Total Time: 93.6633141041 > Req/s: 10.0145933226 The script finished running yesterday at ~10ish. New load test results: CSV: Max/Avg/Min/StdDev: 12745ms/1594ms/34ms/2194ms Total Time: 127.117345095 Req/s: 7.37900873639 JSON: Max/Avg/Min/StdDev: 10041ms/1245ms/64ms/1615ms Total Time: 100.236711979 Req/s: 9.3578488508 So, overall worse performance, but within a margin of error that makes it essentially no change. At this point I suppose the question is - do we want to continue working on this bug? Using the remora results from comment 29 it looks like we're already substantially faster and we should close the bug. However, the average response time is 1.2-1.5 seconds which is a really long time, especially when the front page does 4 stats requests. Ian: were there other areas that looked like they could give promising results?
From discussion, it appears simplejson speedups isn't installed in the environment used to test this, and without those speedups the JSON decoding is only modestly faster than phpserialize decoding.
oremj installed simplejson, there was no noticable change: Max/Avg/Min/StdDev: 12478ms/1757ms/39ms/2116ms Total Time: 135.53547287 Req/s: 6.92069743912 Max/Avg/Min/StdDev: 8746ms/1167ms/61ms/1675ms Total Time: 95.0672271252 Req/s: 9.86670199989 I re-ran the test directly afterwards to see how the mysql cache would affect us and got pretty good results: Max/Avg/Min/StdDev: 176ms/62ms/18ms/26ms Total Time: 5.4371099472 Req/s: 172.518122515 Max/Avg/Min/StdDev: 165ms/61ms/15ms/28ms Total Time: 4.57465982437 Req/s: 205.042568412 I think that points to the bottleneck being an unprimed database, not the code processing json.
Note the json conversion was specifically to address certain entries that were timing out (taking over 15 seconds), rendering the data inaccessible. For those records the response time was consistent and caching didn't provide any improvement, and the changes were made after profiling those specific requests.
For the record, I didn't get any timeouts in my benchmark and the URL in comment 4 is working
I'm not sure if timeouts are consistent across preview and the live site. Whether I got a timeout or just a long response time depended on exactly the URL I hit, and what servers I went through. I don't believe pieces like Zeus are in place on preview. Also, while the performance of that specific case with phpserialize could have been addressed, I asked and the consensus seemed to be that people wanted to generally remove phpserialize data and use json instead, which this implements.
They should be consistent, and this test case was. (and zeus is on preview) I think we're making good progress - I'm happy we moved to json. I'm just asking now if there are other areas we can improve with a reasonable amount of investment.
The best thing to do then would be to use a profiler with that same test (though the profiler slows things down by nearly an order of magnitude, so a trimmed test would be best). I don't think there's any reason that needs to be done on preview instead of a local dev box. I did that for the one request, but I never committed any of the profile rigging. I didn't notice this page before but it seems like a reasonable approach (specifically python -m cProfile): http://code.djangoproject.com/wiki/ProfilingDjango One caveat in my experience: some results can be surprising or misleading because queries and memcached can be hit at unexpected times, making code look slow when it's actually just the first thing to access a lazily computed value.
I've started putting together some profiling results locally, but looks like I'll need to run it on khan to get access to an updated full database.
(In reply to comment #41) > I've started putting together some profiling results locally, but looks like > I'll need to run it on khan to get access to an updated full database. How's that going?
Target Milestone: 5.12.2 → 5.12.3
I have the profiles running with csvs-zz.log, and I'm trying to interpret the results at this time.
(I'm not sure these profiling results would be the same in the presence of caching.) Most of the time is going into database-related parts of the code. About 20% of the time is actually the database (MySQLdb routines). It looks like database-related code takes about 70% of the time (conveniently this goes through cache-machine caching.base, though the overhead of the routine itself isn't high -- it's simply a point of aggregation). The extra 50% of the time is the Django ORM, deserializing fields, and other miscellaneous but fairly generic bits of database-related code. The most notable query, which accounts for 60% of the time, is getting the addon object for the queried data (get_object_or_404(Addon, id=addon_id)). (This particular query might be successfully cached in a non-development environment.) The object itself is only used to check permissions, otherwise the addon_id itself is sufficient.
(In reply to comment #44) > The most notable query, which accounts for 60% of the time, is getting the > addon object for the queried data (get_object_or_404(Addon, id=addon_id)). Using Addon.objects.all().no_transforms() as the queryset skips all the extra Addon baggage.
Using Addon.objects.all().no_transforms().filter(id=addon_id) as suggested appears to make a substantial difference, bringing addon fetching down to 20% of the total time and halving request times. The MySQL cache also has a substantial affect -- leaving the code overnight, all the database caching expired and overall time per request went up about ten fold. Overall MySQL was taking about 65% of the request time in this case.
Any recommendations for next steps on this bug or have we done all we can?
Upon reflection, the worry about performance here is primarily in the case when the MySQL query cache isn't ready. I believe for the most part performance is fairly decent once the query is cached (plus browser caching already catches this case frequently), and most of the perceived performance problems will be from the uncached case (which is 10x as slow). Resolving the database bottleneck would require either persistent caching (e.g., precalculating some values), or other kinds of database tuning. Precalculating values seems likely to be too resource-intensive considering that probably most stats go unviewed anyway. I don't much about database tuning, so I cannot suggest much there, though I did check the indexes early on and they seemed generally good (and without those indexes probably even after warming up the database the queries would be slow), though I believe it is possible that the database indexes are not identical between all database deployments (since indexes are essentially invisible, and I saw some small mismatches in previous testing). So in summary, I don't believe there's anything worth doing to improve general performance, unless there is a particular situation which introduces specific problems (which I don't know of at this time).
Ok, thanks for investigating and the patches for json'ing the data. We'll see how things go.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.