Nested vs. multi-valued for facet performance


(Andrew Clegg) #1

Hi,

I've heard in one or two places, like this thread (2nd post):

https://groups.google.com/group/elasticsearch/browse_thread/thread/78f2600ad069cabb/19a688a270fa80cf?lnk=raot

that using nested documents instead of multi-valued fields can reduce
memory requirements for faceting by quite a lot.

But I wasn't sure if I understood it what it meant.

Is it just replacing documents like this:

{
"user" : "bart",
"friends" : [ "homer", "maggie", "marge" ]
}

with something like this?

{
"user" : "bart",
"friends" : [
{ "user" : "homer" },
{ "user" : "maggie" },
{ "user" : "marge" }
]
}

If it is that simple, why does anyone use list-valued fields at all when
this is more efficient? Is there any other downside (apart from bigger JSON
messages)?

(If it makes a difference, our application is write-heavy and largely
facet-based)

Can you add a field with nested documents at indexing time, or do you have
to map it in advance manually?

Many thanks!

Andrew.


(Andrew Clegg) #2

Appending this discussion from IRC for the benefit of future searchers --
thanks Clinton.

4:46 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley

so...
4:47 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
for multi-value fields, eg { tag: ['foo','bar']}
4:47 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
is has to create an array based on the max number of terms in that field
4:48 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
so if it needs to create an array with 10 values (where max values per
field = 1), then for one doc that has 5 values, it creates 10 x 5
4:48 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
ouch
4:48 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
yes
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
so the 1% of docs with many values can greatly increase memory usage
4:49 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
so am I thinking along right lines here:
https://groups.google.com/forum/#!topic/elasticsearch/lIcFPQ2HoQc
4:49 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
apologies for bumping via irc :wink:
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
not quite
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
those would be type: "object"
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
you have to make them type: "nested"
4:50 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
which actually indexes them internally as SEPARATE documents
4:50 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
but because each of these docs has only one value per field, the memory
usage in this case is much lower
4:50 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
i hadn't thought about that consequence before
4:50 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
ah ok. do you know if it's possible to set that up via the default
mappings somehow? or do you have to map explicitly for every field where
you're doing that?
4:51 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
nested docs have their own overhead (eg performance - they're separate
docs after all)
4:51 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
so i'd advise doing it only where you need to
4:52 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
also, the way you query your docs changes as well
4:52 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
there are special nested queries/filters/facets
4:52 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
ok, thanks
4:52 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
np

On Monday, 11 June 2012 16:14:46 UTC+1, Andrew Clegg wrote:

Hi,

I've heard in one or two places, like this thread (2nd post):

https://groups.google.com/group/elasticsearch/browse_thread/thread/78f2600ad069cabb/19a688a270fa80cf?lnk=raot

that using nested documents instead of multi-valued fields can reduce
memory requirements for faceting by quite a lot.

But I wasn't sure if I understood it what it meant.

Is it just replacing documents like this:

{
"user" : "bart",
"friends" : [ "homer", "maggie", "marge" ]
}

with something like this?

{
"user" : "bart",
"friends" : [
{ "user" : "homer" },
{ "user" : "maggie" },
{ "user" : "marge" }
]
}

If it is that simple, why does anyone use list-valued fields at all when
this is more efficient? Is there any other downside (apart from bigger JSON
messages)?

(If it makes a difference, our application is write-heavy and largely
facet-based)

Can you add a field with nested documents at indexing time, or do you have
to map it in advance manually?

Many thanks!

Andrew.


(Jack Chen) #3

I found this really useful in diagnosing why our facet performance was
horrible for one of our indexes. Thanks for posting this!

Hopefully there are plans to make this better without having to resort to
nested documents. I'm experimenting with indexing a string with tags
delimited by spaces and using the whitespace analyzer to see if this helps.

Appending this discussion from IRC for the benefit of future searchers --

thanks Clinton.

4:46 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley

so...
4:47 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
for multi-value fields, eg { tag: ['foo','bar']}
4:47 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
is has to create an array based on the max number of terms in that field
4:48 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
so if it needs to create an array with 10 values (where max values per
field = 1), then for one doc that has 5 values, it creates 10 x 5
4:48 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
ouch
4:48 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
yes
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
so the 1% of docs with many values can greatly increase memory usage
4:49 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
so am I thinking along right lines here:
https://groups.google.com/forum/#!topic/elasticsearch/lIcFPQ2HoQc
4:49 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
apologies for bumping via irc :wink:
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
not quite
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
those would be type: "object"
4:49 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
you have to make them type: "nested"
4:50 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
which actually indexes them internally as SEPARATE documents
4:50 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
but because each of these docs has only one value per field, the memory
usage in this case is much lower
4:50 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
i hadn't thought about that consequence before
4:50 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
ah ok. do you know if it's possible to set that up via the default
mappings somehow? or do you have to map explicitly for every field where
you're doing that?
4:51 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
nested docs have their own overhead (eg performance - they're separate
docs after all)
4:51 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
so i'd advise doing it only where you need to
4:52 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
also, the way you query your docs changes as well
4:52 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
there are special nested queries/filters/facets
4:52 PM <andrewclegghttps://irccloud.com/#!/irc.freenode.net:6667/andrewclegg
ok, thanks
4:52 PM <clintongormleyhttps://irccloud.com/#!/irc.freenode.net:6667/clintongormley
np

On Monday, 11 June 2012 16:14:46 UTC+1, Andrew Clegg wrote:

Hi,

I've heard in one or two places, like this thread (2nd post):

https://groups.google.com/group/elasticsearch/browse_thread/thread/78f2600ad069cabb/19a688a270fa80cf?lnk=raot

that using nested documents instead of multi-valued fields can reduce
memory requirements for faceting by quite a lot.

But I wasn't sure if I understood it what it meant.

Is it just replacing documents like this:

{
"user" : "bart",
"friends" : [ "homer", "maggie", "marge" ]
}

with something like this?

{
"user" : "bart",
"friends" : [
{ "user" : "homer" },
{ "user" : "maggie" },
{ "user" : "marge" }
]
}

If it is that simple, why does anyone use list-valued fields at all when
this is more efficient? Is there any other downside (apart from bigger JSON
messages)?

(If it makes a difference, our application is write-heavy and largely
facet-based)

Can you add a field with nested documents at indexing time, or do you
have to map it in advance manually?

Many thanks!

Andrew.

--


(system) #4