Closed Bug 678405 Opened 13 years ago Closed 13 years ago

Gloda shouldn't generate rows in messageAttributes for unimportant attributes

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 10.0

People

(Reporter: protz, Assigned: protz)

Details

(Keywords: perf, Whiteboard: [gloda key])

Attachments

(1 file, 2 obsolete files)

In global-messages-db.sqlite:
- attributeDefinitions defines the available message attributes for each message (read, starred, etc.),
- messageAttributes contains message id / attribute id / attribute value triplets, each one being covered by an index, so that we can efficiently query on "all starred messages" for instance.

However, the usefulness of some of these attributes is somehow dubious.
- Do we really want to query "all read messages"?
- Do we really want to query "all message toMe"?

Apart from the ones that Thunderbird Conversations and Gloda use, it will be a win to remove these attributes from messageAttributes (and attributeDefinitions too?): less indexes, less size, better performance. Non-exhaustive list: read, fromMe, toMe, to, cc, bcc, isEncrypted, attachmentInfos, repliedTo, forwarded.

Also, we should make sure having an index and being searchable is opt-in when using defineAttributes, so that consumers don't accidentally cripple the database with extra information.

We won't be losing these attributes since they're stored in the JSON blob in the messages table. Simply, we'll lose the ability to query in these attributes.
Yes, this will be a vontastic improvement!

They will need to stay in attributeDefinitions because that's how they get their unique integer identifier that we also use in the JSON blob in an attempt to perform some limited compression.  (And if we wanted to use a more compact encoding than JSON, we would probably still benefit from the integer id's.)
Would this affect Gloda result filtering ? (ie when I click on to me or from me on the gloda result page )
Keywords: perf
I will need to review the search code to actually figure out which of these attributes are used in Gloda queries proper (meaning I cannot drop them from the messageAttributes table), and which are used in subsequent refining without issuing a newer gloda query (meaning I can drop them). Breaking the search feature is not an explicit goal of that bug. If you wish to see this breakage implemented, I suggest you file a new bug and assign to me, it should be fairly straightforward to achieve.
Attached patch Patch v1 (obsolete) — Splinter Review
Preliminary version of the patch. I intend to rebuild my index tonight see if I can manage to render my Thunderbird unusable or trim the fat off my global-messages-db.sqlite file!
Attachment #552817 - Flags: feedback?(bugmail)
This removed 14% off the size of my global-messages-db.sqlite (went from 197MB to 169MB). I had 1 million rows in messageAttributes, now I have:

sqlite> select count(*) from messages;
62827
sqlite> select count(*) from messageAttributes;
350036

This is for this patch only.
Attached patch Patch v2 (obsolete) — Splinter Review
This patch makes the following code fail:

let q = Gloda.newQuery(...);
q.attrNonSearchable(val);

which is what we want to happen. This allowed me to see precisely which tests were failing and which were not, and which attributes were actually required and which were not.

I've put together a list of gloda attribute that are about to go, there's about a dozen of them: https://spreadsheets.google.com/spreadsheet/ccc?key=0AkkK5d8EHoWRdEFDOXBEeE92WUwzYUQ2dV94V3g5NEE&hl=en_US ; feel free to comment here.

The "keep only if true" is for attributes that we would like to know about only when they're actually. We'll never issue a query for a message that is not starred or not to me. But that can be addressed later in a subsequent patch.
Attachment #552817 - Attachment is obsolete: true
Attachment #552817 - Flags: feedback?(bugmail)
Attachment #552933 - Flags: feedback?(bugmail)
"from" will be useful to me (one day), since I'd like to be able to query for all attachments sent by a particular user. Technically, I could query for messages involving that user and then filter on "from", but I can imagine that having a lot of false positives, especially if attachments are being sent to both you and the user in question.
Once you get the results of your query, you can still refine the results before moving on. Roughly:

onQueryCompleted: function (aItems) {
  // aItems is all messages involving user X and having attachments
  aItems = aItems.filter(function (glodaMsg) glodaMsg.to == userX);
  // aItems is now what you want
},

Since Gloda.newQuery(Gloda.NOUN_MESSAGE).involves(userX).attachmentTypes() is already fairly restrictive, I don't think you'll have to filter on a lot of results. Would such a solution be enough for you?
Sorry, meant to filter on glodaMsg.from, of course.
(In reply to Jonathan Protzenko [:protz] from comment #6)
> I've put together a list of gloda attribute that are about to go, there's
> about a dozen of them:
> https://spreadsheets.google.com/spreadsheet/
> ccc?key=0AkkK5d8EHoWRdEFDOXBEeE92WUwzYUQ2dV94V3g5NEE&hl=en_US ; feel free to
> comment here.

For messages:
- headerMessageID should be marked special.  It's also very important to our conversation logic (more than just "tests" :)

For contacts:
- popularity is special and is not going away because it is used by the autocomplete code.  As such, it makes sense to leave this queryable even though it's not an absolute value and so is sorta boring.
- frecency is unused but otherwise in the same boat as popularity.  we can remove the attribute, its column, and index, entirely (as well as the logic that touches it).  you might want to just treat it like popularity for now and then remove it in a separate bug.
- freetags is used by the autocompleter (ContactTagCompleter), even though data is never populated using anything in core.  Given mconley's excellent progress on improving contact stuff, I think we should keep this around because a) it will be useful soon b) no space is used if no data is present, so it isn't costing us anything.
(In reply to Jim Porter (:squib) from comment #7)
> "from" will be useful to me (one day), since I'd like to be able to query
> for all attachments sent by a particular user. Technically, I could query
> for messages involving that user and then filter on "from", but I can
> imagine that having a lot of false positives, especially if attachments are
> being sent to both you and the user in question.

I agree that this is a useful distinction, performance-wise, since I can easily imagine there being an order of magnitude difference between the set size of involves(X) and from(X).  If you are planning to write and popularize an extension or :bwinton has a feature planned that would be targeted in the Thunderbird 9-11 range, I think we can/should keep it, but if the odds are low, I think we should ditch from and add it back when needed.  (The time rationale being that I would not want us to trigger a gloda re-index more frequently than once every 3 months.)
Attached patch Patch v3Splinter Review
So I've made freetag, popularity and frecency searchable; I'm not sure for the identities attribute on GlodaContact. I tend to think that this one can go away, as queries usually go like this:
- have an email address,
- find its related GlodaIdentity,
- from the GlodaIdentity, move up to the GlodaContact.

It would be really dumb to query for a GlodaContact using a GlodaIdentity. I don't remember why I requested feedback only and not the full review, since there's tests.
Attachment #552933 - Attachment is obsolete: true
Attachment #552933 - Flags: feedback?(bugmail)
Attachment #555161 - Flags: review?(bugmail)
Comment on attachment 555161 [details] [diff] [review]
Patch v3

Indeed, "identities" does not need to be user-visible since it was only really intended to be used under the hood for loading.  (When we load identity "a1", we then go on to load its owning contact "A", and if that contact has any other identities associated like "a2"/"a3"/etc. we go and load those by joining on that.)  So if someone has an identity id, they can load the identity and get the contact for free.  Of course, it's all somewhat academic as you must keep the database column we use for the join, so there is no disk savings, just avoiding user confusion :)

This is a great optimization for us, thanks much!
Attachment #555161 - Flags: review?(bugmail) → review+
http://hg.mozilla.org/comm-central/rev/87b0510db268
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 10.0
You need to log in before you can comment on or make changes to this bug.