Mapping Size limitations

Hi,

I am using dynamic templates and I am dealing with with couple thousand
such dynamic fields. Each of these dynamic fields are objects have 2 to 3
subfields which are of type "byte". My question is , is there any
performance penalty of having a large mapping? Right now I have couple
thousand fields in mapping but in future it can increase to maybe 10k-20k
fields. Will I see performance degradation with large mapping file. If so,
what things will be effected? FYI, I am planning to use facets over these
fields.

Thanks!

--

Hi,

The first thing that jumps out at me is, do you really need that many
fields? Are you able to give us a bit of information about what you're
doing and maybe we can come up with a way to do it without 20,000 fields.

On Sunday, November 11, 2012 11:12:51 AM UTC+13, revdev wrote:

Hi,

I am using dynamic templates and I am dealing with with couple thousand
such dynamic fields. Each of these dynamic fields are objects have 2 to 3
subfields which are of type "byte". My question is , is there any
performance penalty of having a large mapping? Right now I have couple
thousand fields in mapping but in future it can increase to maybe 10k-20k
fields. Will I see performance degradation with large mapping file. If so,
what things will be effected? FYI, I am planning to use facets over these
fields.

Thanks!

--

It's nearly impossible to manage 20k fields. The reasons are: each field
consumes around some MB of resident memory in Lucene, each field mapping
creation causes cluster blocks and proliferation of the mapping settings
throughout the cluster nodes. Even if you manage to get 20k field created,
constructing a query over 20k fields and the lookup time for each field
settings and the mappings will eat up your performance.

Rule of thumbs: Facets are designed to perform well over a small number of
fields with a high cardinality of values. They do not perform well over a
high number of fields with low cardinality.

I would also be curious to learn about the scenario in which such a high
number of fields is required.

Jörg

On Saturday, November 10, 2012 11:12:51 PM UTC+1, revdev wrote:

Hi,

I am using dynamic templates and I am dealing with with couple thousand
such dynamic fields. Each of these dynamic fields are objects have 2 to 3
subfields which are of type "byte". My question is , is there any
performance penalty of having a large mapping? Right now I have couple
thousand fields in mapping but in future it can increase to maybe 10k-20k
fields. Will I see performance degradation with large mapping file. If so,
what things will be effected? FYI, I am planning to use facets over these
fields.

Thanks!

--

Thanks Jorg, Chris for quick reply. Let me explain you the situation which
will clear what I am trying to accomplish. I am not giving the exact domain
example but a similar example in different domain.

Scenario: Lets say, I am storing a list of restaurant reviews from all over
the web. Each review document can have following fields:

review_id (long)
review_ratings (object)
aspect_1_name (string) : rating (float)
aspect_2_name (string) : rating (float)
aspect_3_name (string) : rating (float)
...
date (datetime)

Requirement: The goal is to be able to calculate facets on "review aspects"
and get avg rating given for each aspect across all reviews within a given
period. In this case, aspects can be like "Visual appeal of dish", "taste",
"smell" etc. Hypothetically, lets assume number of aspects can increase to
20k aspects. Note that, a single review can have maybe a dozen of aspects
defined but when you talk about millions of reviews over a period, they
might have collectively thousands of different aspects associated. So, here
our mapping will become huge because of "review_ratings" object.

Now, to achieve this, I can use histogram facet over Field Date and Value
as the Aspect Name. To get facets over, say 100 aspects, I can create 100
facets, one for each aspect. So, at a time, I will just be querying around
100 aspects at a time to get their avg rating.

Now that you know a sample scenario, can you guys tell me if my approach is
correct? or I am doing something fundamentally wrong ?

Thanks a lot again for help guys!
Vinay

On Mon, Nov 12, 2012 at 12:32 AM, Jörg Prante joergprante@gmail.com wrote:

It's nearly impossible to manage 20k fields. The reasons are: each field
consumes around some MB of resident memory in Lucene, each field mapping
creation causes cluster blocks and proliferation of the mapping settings
throughout the cluster nodes. Even if you manage to get 20k field created,
constructing a query over 20k fields and the lookup time for each field
settings and the mappings will eat up your performance.

Rule of thumbs: Facets are designed to perform well over a small number of
fields with a high cardinality of values. They do not perform well over a
high number of fields with low cardinality.

I would also be curious to learn about the scenario in which such a high
number of fields is required.

Jörg

On Saturday, November 10, 2012 11:12:51 PM UTC+1, revdev wrote:

Hi,

I am using dynamic templates and I am dealing with with couple thousand
such dynamic fields. Each of these dynamic fields are objects have 2 to 3
subfields which are of type "byte". My question is , is there any
performance penalty of having a large mapping? Right now I have couple
thousand fields in mapping but in future it can increase to maybe 10k-20k
fields. Will I see performance degradation with large mapping file. If so,
what things will be effected? FYI, I am planning to use facets over these
fields.

Thanks!

--

--

Do you already have some histogram data exhibiting the 20k spread? I can see the possibility of 20k aspects and I can also see a very long tail. Can the aspects be sub-grouped into doc_type's thereby reducing the set of aspects for each sub-group? e.g. 'jasmine flavor' won't apply to fries/steaks but fine with tea.

You mean, create a separate Index Type for each subgroups? I can do that
but that's the last resort I want to take since it will require quite a lot
of code to make things look seamless to outside clients for both query and
indexing.

Thanks for pitching that idea tho.

On Monday, November 12, 2012 2:18:13 PM UTC-8, es_learner wrote:

Do you already have some histogram data exhibiting the 20k spread? I can
see
the possibility of 20k aspects and I can also see a very long tail. Can
the
aspects be sub-grouped into doc_type's thereby reducing the set of aspects
for each sub-group? e.g. 'jasmine flavor' won't apply to fries/steaks but
fine with tea.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Mapping-Size-limitations-tp4025257p4025334.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

Practically I can't see how you're going to be able to support so many
fields. Jorg is right, memory consumption is going to be immense, things
will get unmanageable.

What if, along the lines of what es_learner suggested, you have a document
per aspect? So each document could consist of review id, aspect_name,
rating and datetime. You could then filter aspect_name to choose which
aspects you wanted to facet on.

On Tuesday, November 13, 2012 11:58:44 AM UTC+13, revdev wrote:

You mean, create a separate Index Type for each subgroups? I can do that
but that's the last resort I want to take since it will require quite a lot
of code to make things look seamless to outside clients for both query and
indexing.

Thanks for pitching that idea tho.

On Monday, November 12, 2012 2:18:13 PM UTC-8, es_learner wrote:

Do you already have some histogram data exhibiting the 20k spread? I can
see
the possibility of 20k aspects and I can also see a very long tail. Can
the
aspects be sub-grouped into doc_type's thereby reducing the set of
aspects
for each sub-group? e.g. 'jasmine flavor' won't apply to fries/steaks
but
fine with tea.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Mapping-Size-limitations-tp4025257p4025334.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

hmm, let me tell you one thing I tried but it did not work:

I changed my doc from this

review_id (long)
review_ratings (object)
aspect_1_name (string) : rating (float)
aspect_2_name (string) : rating (float)
aspect_3_name (string) : rating (float)
...
date (datetime)

to

review_id (long)
review_ratings (Array)
[
{"name" : aspect_1_name (string), "rating": aspect_rating
(float) },
{"name" : aspect_2_name (string), "rating": aspect_rating
(float) },
...
]
date (datetime)

Then I applied facet filter to filter aspects by their name, ie.
"facet_filter": { 'term' : { "name" : "aspect_1_name"}} but the result was
returning result by calculating mean on all elements within
"review_ratings" array for reviews which had "aspect_name_1" in the
review_rating array.

For example take this example rating_array:
{
"rating_array" : [
{"name" : "a", "rating" : 3},
{"name" : "b", "rating" : 4},
{"name" : "c", "rating" : 5},
]
}
If I filtering by "name":b", ES will calculate total as 12 and mean as
12/3, rather than just returning total as 4 and mean as 4.

Can I do something that will just return results by calculating just the
data from the array index which has name="aspect_1_name" ? or, am I doing
something wrong? :slight_smile:
thx!

On Monday, November 12, 2012 3:13:23 PM UTC-8, Chris Male wrote:

Practically I can't see how you're going to be able to support so many
fields. Jorg is right, memory consumption is going to be immense, things
will get unmanageable.

What if, along the lines of what es_learner suggested, you have a document
per aspect? So each document could consist of review id, aspect_name,
rating and datetime. You could then filter aspect_name to choose which
aspects you wanted to facet on.

On Tuesday, November 13, 2012 11:58:44 AM UTC+13, revdev wrote:

You mean, create a separate Index Type for each subgroups? I can do that
but that's the last resort I want to take since it will require quite a lot
of code to make things look seamless to outside clients for both query and
indexing.

Thanks for pitching that idea tho.

On Monday, November 12, 2012 2:18:13 PM UTC-8, es_learner wrote:

Do you already have some histogram data exhibiting the 20k spread? I
can see
the possibility of 20k aspects and I can also see a very long tail. Can
the
aspects be sub-grouped into doc_type's thereby reducing the set of
aspects
for each sub-group? e.g. 'jasmine flavor' won't apply to fries/steaks
but
fine with tea.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Mapping-Size-limitations-tp4025257p4025334.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

Is your (new) Array of review_ratings a nested type?

http://www.elasticsearch.org/guide/reference/mapping/nested-type.html

not now but that might be the solution I was looking for! :slight_smile:
Let me experiment with nested_types.
Thanks a lot es_learner!

--