Best mapping advice in storing logs for statistical facet aggregation

Kaspars_Sprogis · May 20, 2012, 10:19am

Hi,

We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:

Title (containing specific data, that shouldn't be analyzed as
separate words, therefore i plan to use Keyword analyzer) (String)
Date (DateOptionalTime)
Time. Time in seconds. (Integer)
UserID (integer)

The aim of project is to collect specific user data and allow users in our
application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.

Some questions regarding the mapping:

Is it a good practice and is it worth to use short names to save some
storage space. For example title -> t, date -> d, time -> tm, and so on?
I didn't perfectly understood *store=yes *mapping parameter. In the
docs it is said:
"Set to yes the store actual field in the index, no to not store it.
Defaults to no"
However i don't understand the performance advantages/disadvantages of
this field. In our case if i want to do aggregate sum of Time using
Statistical Facets, should i use store=yes so that aggregation is faster
and value is used from index and not from store or it doesn't affect
aggregating?
What other advises you could suggest to gain performance keeping in
mind that aggregation will be required using Statistical Facets. The single
document as itself will be quite small. Just 5-6 keys and Title field will
have maximum 300 characters.

Current index config and mapping is here: https://gist.github.com/2757570
We are now dealing with mapping, so right now for development we are using
dev-server with no replications at all.

Thank you.

kimchy · May 22, 2012, 10:19am

On Sun, May 20, 2012 at 12:19 PM, Kaspars Sprogis darklow@gmail.com wrote:

Hi,

We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:

Title (containing specific data, that shouldn't be analyzed as
separate words, therefore i plan to use Keyword analyzer) (String)

Date (DateOptionalTime)

Time. Time in seconds. (Integer)

UserID (integer)

The aim of project is to collect specific user data and allow users in our
application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.

Some questions regarding the mapping:

Is it a good practice and is it worth to use short names to save
some storage space. For example title -> t, date -> d, time -> tm, and so
on?

It doesn't matter much.

I didn't perfectly understood *store=yes *mapping parameter. In the
docs it is said:
"Set to yes the store actual field in the index, no to not store it.
Defaults to no"
However i don't understand the performance advantages/disadvantages of
this field. In our case if i want to do aggregate sum of Time using
Statistical Facets, should i use store=yes so that aggregation is faster
and value is used from index and not from store or it doesn't affect
aggregating?

There is no need for setting store to yes for the different facet
aggregation. Setting store to yes simply means that the field will be
stored on its own, which can come handy when not storing the _source for
example.

What other advises you could suggest to gain performance keeping in
mind that aggregation will be required using Statistical Facets. The single
document as itself will be quite small. Just 5-6 keys and Title field will
have maximum 300 characters.

Make sure you use rolling indices, like an index per month.

Current index config and mapping is here: Elasticsearch log mapping · GitHub

Confused, I see you use ngram on the title, just wanted to double chech.

We are now dealing with mapping, so right now for development we are using
dev-server with no replications at all.

Thank you.

Kaspars_Sprogis · May 22, 2012, 10:30am

Thanks for the reply, now i finally understand store=yes parameter.
About indices- yes, i search through a lot of docs and discussions and find
out rolling indices is what we will need - aliases comes here really
helpful.
About config and nGram, it was just an idea nGram is the right
tokenizer, because i wanted to do also wildcard search to both sides,
however i quite fast realized it is not needed and Pattern tokenizer does
the job quite right too (just corrected the GIST file)

On Tuesday, May 22, 2012 1:19:54 PM UTC+3, kimchy wrote:

On Sun, May 20, 2012 at 12:19 PM, Kaspars Sprogis wrote:

Hi,

We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:

Title (containing specific data, that shouldn't be analyzed as
separate words, therefore i plan to use Keyword analyzer) (String)

Date (DateOptionalTime)

Time. Time in seconds. (Integer)

UserID (integer)

The aim of project is to collect specific user data and allow users in
our application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.

Some questions regarding the mapping:

Is it a good practice and is it worth to use short names to save
some storage space. For example title -> t, date -> d, time -> tm, and so
on?

It doesn't matter much.

I didn't perfectly understood *store=yes *mapping parameter. In
the docs it is said:
"Set to yes the store actual field in the index, no to not store it.
Defaults to no"
However i don't understand the performance
advantages/disadvantages of this field. In our case if i want to do
aggregate sum of Time using Statistical Facets, should i use store=yes so
that aggregation is faster and value is used from index and not from store
or it doesn't affect aggregating?

There is no need for setting store to yes for the different facet
aggregation. Setting store to yes simply means that the field will be
stored on its own, which can come handy when not storing the _source for
example.

What other advises you could suggest to gain performance keeping
in mind that aggregation will be required using Statistical Facets. The
single document as itself will be quite small. Just 5-6 keys and Title
field will have maximum 300 characters.

Make sure you use rolling indices, like an index per month.

Current index config and mapping is here: Elasticsearch log mapping · GitHub

Confused, I see you use ngram on the title, just wanted to double chech.

We are now dealing with mapping, so right now for development we are
using dev-server with no replications at all.

Thank you.

Topic		Replies	Views
Terms Faceting on multi-valued field Elasticsearch	4	837	July 6, 2017
Multiple key_fields for Statistical Facet Elasticsearch	2	283	July 6, 2017
Getting statistics for each bucket in the facet result Elasticsearch	3	340	July 6, 2017
Store strategy and complicated query/aggregation Elasticsearch	1	348	March 11, 2020
Query -elastic search-Facet slor vs aggregation Elasticsearch	6	1431	July 26, 2018

Best mapping advice in storing logs for statistical facet aggregation

Related topics