Indexing on subdocuments and returning just the subdocuments on query


(Ben S) #1

For reasons obvious to anyone that's used really large data sets, it's
often advantageous store data in buckets as opposed to individual documents
for each entry. For instance, let's say you're designing a Twitter-like
application. If every time a user tweets something we simply store it as
its own document, that means that each time a user logs in and checks his
inbox of tweets we'll have to query on all tweets that match the IDs of the
users he's following. But this can lead to an unacceptably high number
(dozens perhaps hundreds) of random disk seeks. On the other hand, if at
the time of someone posting a tweet we fan this out by writing to a bucket
of tweets to each of his followers, we need only fetch a bucket of tweets
which are in a single inbox document (or perhaps a few) when displaying a
user's inbox. This reduces disk seeks to just a couple. There's a good
mongodb blog entry about this very topic here.

http://blog.mongodb.org/post/65612078649/schema-design-for-social-inboxes-in-mongodb

Let's say a tweet inbox bucketed JSON document looks like such (I left out
a timestamp for simplicity):

{
“recipient": “user01”,
“inbox”: [
{ “sender”: “user99”, “tweet”: “This is a tweet” },
{ “sender”: “user98”, “tweet”: “Hello World!” },
{ “sender”: “user98”, “tweet”: “It’s a mad mad world” },
….
]
}

Now, the problem with storing your data in blocks like that, is it makes it
more difficult to query on that data. If all tweets are stored as separate
documents, then it's easy to say run a search for all tweets in your inbox
that contain the word "world". And from what I understand you can easily
index subdocuments and the elements of an array in Elasticsearch. But is
there anyway to know which of those subdocuments (or elements of an array
in this case) within the one JSON document led to the search hit?
Even if
it's not directly supported by Elasticsearch, is there any clever way to
return just the fields (or array elements) that matched rather than
returning the entire document? Ideally in this case I'd like to pull the
sender's ID too so I can fully construct the tweet entries that match.

Thanks,
Ben

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #2