Status

support.mozilla.org
Army of Awesome
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: wenzel, Assigned: wenzel)

Tracking

unspecified
2.2.5

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

7 years ago
Per bug 584886, we need to collect tweets.

As we found out, Metrics' HIVE database might not be sufficiently real-time for the purpose.

We could run a cron job collecting tweets, but depending on how much we are collecting, the rate limit [1] might kick in. Though we *can* probably go for a single "OR" query to reduce the amount of requests.

Do we need an update more than once a minute? If so, cron is insufficient and we need to write a daemon, but I'd like to avoid that, because then we also need a watchdog, so it'll all be much more complicated.

Or we use Celery[2] and update from cron as well as everytime someone visits the page -- with a rate limit. That might actually be our best bet to stay as real-time as possible.

Whacha think?


[1] http://dev.twitter.com/pages/rate-limiting
[2] http://celeryproject.org/
If we're not reading these from the HIVE database, do we plan on sticking all the "firefox" related tweets directly into our database? If so, how long are we storing them? As someone who has to download the thing regularly, these things matter.

Is there another sort of data store we might be able to use that's more suited to this kind of data?
(Assignee)

Comment 2

7 years ago
Another thing to keep in mind is the sheer amount of tweets: From an unscientific search on search.twitter.com, I'd guestimate one Firefox tweet per second. We might need to build in some sort of aggregate-and-purge step very early on in the process.
(Assignee)

Comment 3

7 years ago
Hm, we mid-aired there, but of course I agree with James.

If we want to, we can drop the tweets into redis instead. Though I am not sure how much sense that makes.
(Assignee)

Updated

7 years ago
Blocks: 591934
(Assignee)

Updated

7 years ago
Blocks: 591942
(Assignee)

Updated

7 years ago
No longer blocks: 591934, 591942
(Assignee)

Updated

7 years ago
Blocks: 591946
(Assignee)

Comment 4

7 years ago
Anurag, do you have some input for this? Will Metrics need any of our data to work with, or will you just pull it all off twitter?
(Assignee)

Comment 5

7 years ago
Okay, we'll store this in the regular DB, but make sure to limit the amount of tweets in the database. We're not currently looking to retain tweets, Metrics will collect and analyze them separately, independently from if they were tweeted through our page.

We'll collect tweets once a minute off search.twitter.com for now.

I'm planning on collecting tweets mentioning "firefox", while applying rudimentary filters to exclude links, retweets, replies.

Anything else?
(Assignee)

Updated

7 years ago
Assignee: nobody → fwenzel
Status: NEW → ASSIGNED
(Assignee)

Comment 6

7 years ago
All right, I got a first version up and running.

http://github.com/fwenzel/kitsune/commit/7b83bc18

Alex: r?


It collects tweets as mentioned above, and unless they match the mentioned filter criteria, I save them as a JSON BLOB. Only metadata I extract for now is the tweet ID and creation timestamp.

In one go, we collect up to 100 tweets (most of which are discarded). I am unscientifically guestimating we get about 10--20 retained tweets a minute with the current criteria.

After the update is done, I discard all but the most recent 500 tweets in the database.

QUESTIONS:
- do we want to restrict incoming tweets to language code "en" only for now, or do we not care what language they are in?
- Do the numbers sound reasonable?
- anything else we want to filter for?


For the record, here's an example search result JSON blob:
{
"profile_image_url":"http://a2.twimg.com/profile_images/508149026/bdsr_normal.jpg",
"created_at":"Fri, 17 Sep 2010 10:08:22 +0000",
"from_user":"bdsr85",
"metadata": {"result_type":"recent"},
"to_user_id":null,
"text":"FireFoxはキャッシュが残って紛らわしい件",
"id":24745823579,
"from_user_id":76053573,
"geo":null,
"iso_language_code":"ja",
"source":"<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"
}
(Assignee)

Comment 7

7 years ago
I've been collecting some tweets for a while to test, and to fill up 500 usable (i.e., only the ones we don't discard) tweets, it took about 2.5 hours. That's 200 tweets per hour, a little more than 3 a minute. Probably varies, depending on time of day. This was early in the morning, Pacific time.
(In reply to comment #7)
> I've been collecting some tweets for a while to test, and to fill up 500 usable
> (i.e., only the ones we don't discard) tweets, it took about 2.5 hours. That's
> 200 tweets per hour, a little more than 3 a minute. Probably varies, depending
> on time of day. This was early in the morning, Pacific time.

I guess we should think about pre-filling the DB when we launch then?
(Assignee)

Comment 9

7 years ago
Yes. At least, we should set up the cron job on stage and keep it running, then copy the data over at launch. Are we doing pagination? Should be easy to fill the first five pages of tweets in no time. If we launch in the evening Pacific time, our DB will be filled by morning.

Alternatively, I can change the code so it downloads as many results as twitter will give it if the DB is not populated yet.
Sorry to chime in late here, but it looks like most questions have been answered.

Since this page will only be in English for launch, I'm okay with just showing tweets in English. Ideally, we'd like to show tweets to the users for their language (similar to how we do locale detection for l10n). Another option is to show English only by default and use the Filter menu to show all languages.

Chatted with Alex last week about doing pagination, and we think that would be a good idea.

Mary, Kadir, David: any feedback?

Comment 11

7 years ago
Hey there:  Does it make it more complex to have additional languages?  If not, we should go ahead with it.  I think it might make folks who are proficient in English (but a little shy) more apt to help.
Fred or Alex, any thoughts on comment 11? Thanks
(In reply to comment #11)
> Hey there:  Does it make it more complex to have additional languages?  If not,
> we should go ahead with it.  I think it might make folks who are proficient in
> English (but a little shy) more apt to help.

Yes.

Also, I should note that just because Twitter marks a tweet with a certain language, doesn't mean it's actually in that language.  I saw this when working on Firefox Cup tweets;  Twitter would mark a tweet as english, but it'd be in Japanese.

So, I'm not sure separating languages will do much for us.
Alright, let's go with English only (based on how Twitter marks it).
(In reply to comment #6)
> All right, I got a first version up and running.
> 
> http://github.com/fwenzel/kitsune/commit/7b83bc18
> 
> Alex: r?

Left a couple comments, otherwise r+
(Assignee)

Comment 16

7 years ago
All right, I addressed all comments and pushed the results to the customercare branch again (I also rebased it to jsocol/kitsune:master again so merging will be easier later).

Notes:
- I am now exposing the locale in the database. If you only want to show ``en`` on the front end, filter accordingly.
- I also don't think we need to make an effort to prefill our DB before the push. From scratch, I get about 40 tweets during the first run, so our 500 tweets should fill up in no time.
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.2.5
You need to log in before you can comment on or make changes to this bug.