Best mapping advice in storing logs for statistical facet aggregation


(Kaspars Sprogis) #1

Hi,

We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:

  1. Title (containing specific data, that shouldn't be analyzed as
    separate words, therefore i plan to use Keyword analyzer) (String)
  2. Date (DateOptionalTime)
  3. Time. Time in seconds. (Integer)
  4. UserID (integer)

The aim of project is to collect specific user data and allow users in our
application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.

Some questions regarding the mapping:

  1. Is it a good practice and is it worth to use short names to save some
    storage space. For example title -> t, date -> d, time -> tm, and so on?
  2. I didn't perfectly understood *store=yes *mapping parameter. In the
    docs it is said:
    "Set to yes the store actual field in the index, no to not store it.
    Defaults to no"

    However i don't understand the performance advantages/disadvantages of
    this field. In our case if i want to do aggregate sum of Time using
    Statistical Facets, should i use store=yes so that aggregation is faster
    and value is used from index and not from store or it doesn't affect
    aggregating?
  3. What other advises you could suggest to gain performance keeping in
    mind that aggregation will be required using Statistical Facets. The single
    document as itself will be quite small. Just 5-6 keys and Title field will
    have maximum 300 characters.

Current index config and mapping is here: https://gist.github.com/2757570
We are now dealing with mapping, so right now for development we are using
dev-server with no replications at all.

Thank you.


(Shay Banon) #2

On Sun, May 20, 2012 at 12:19 PM, Kaspars Sprogis darklow@gmail.com wrote:

Hi,

We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:

  1. Title (containing specific data, that shouldn't be analyzed as
    separate words, therefore i plan to use Keyword analyzer) (String)
  2. Date (DateOptionalTime)
  3. Time. Time in seconds. (Integer)
  4. UserID (integer)

The aim of project is to collect specific user data and allow users in our
application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.

Some questions regarding the mapping:

  1. Is it a good practice and is it worth to use short names to save
    some storage space. For example title -> t, date -> d, time -> tm, and so
    on?

It doesn't matter much.

  1. I didn't perfectly understood *store=yes *mapping parameter. In the
    docs it is said:
    "Set to yes the store actual field in the index, no to not store it.
    Defaults to no"

    However i don't understand the performance advantages/disadvantages of
    this field. In our case if i want to do aggregate sum of Time using
    Statistical Facets, should i use store=yes so that aggregation is faster
    and value is used from index and not from store or it doesn't affect
    aggregating?

There is no need for setting store to yes for the different facet
aggregation. Setting store to yes simply means that the field will be
stored on its own, which can come handy when not storing the _source for
example.

  1. What other advises you could suggest to gain performance keeping in
    mind that aggregation will be required using Statistical Facets. The single
    document as itself will be quite small. Just 5-6 keys and Title field will
    have maximum 300 characters.

Make sure you use rolling indices, like an index per month.

Current index config and mapping is here: https://gist.github.com/2757570

Confused, I see you use ngram on the title, just wanted to double chech.

We are now dealing with mapping, so right now for development we are using
dev-server with no replications at all.

Thank you.


(Kaspars Sprogis) #3

Thanks for the reply, now i finally understand store=yes parameter.
About indices- yes, i search through a lot of docs and discussions and find
out rolling indices is what we will need - aliases comes here really
helpful.
About config and nGram, it was just an idea nGram is the right
tokenizer, because i wanted to do also wildcard search to both sides,
however i quite fast realized it is not needed and Pattern tokenizer does
the job quite right too (just corrected the GIST file)

On Tuesday, May 22, 2012 1:19:54 PM UTC+3, kimchy wrote:

On Sun, May 20, 2012 at 12:19 PM, Kaspars Sprogis wrote:

Hi,

We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:

  1. Title (containing specific data, that shouldn't be analyzed as
    separate words, therefore i plan to use Keyword analyzer) (String)
  2. Date (DateOptionalTime)
  3. Time. Time in seconds. (Integer)
  4. UserID (integer)

The aim of project is to collect specific user data and allow users in
our application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.

Some questions regarding the mapping:

  1. Is it a good practice and is it worth to use short names to save
    some storage space. For example title -> t, date -> d, time -> tm, and so
    on?

It doesn't matter much.

  1. I didn't perfectly understood *store=yes *mapping parameter. In
    the docs it is said:
    "Set to yes the store actual field in the index, no to not store it.
    Defaults to no"

    However i don't understand the performance
    advantages/disadvantages of this field. In our case if i want to do
    aggregate sum of Time using Statistical Facets, should i use store=yes so
    that aggregation is faster and value is used from index and not from store
    or it doesn't affect aggregating?

There is no need for setting store to yes for the different facet
aggregation. Setting store to yes simply means that the field will be
stored on its own, which can come handy when not storing the _source for
example.

  1. What other advises you could suggest to gain performance keeping
    in mind that aggregation will be required using Statistical Facets. The
    single document as itself will be quite small. Just 5-6 keys and Title
    field will have maximum 300 characters.

Make sure you use rolling indices, like an index per month.

Current index config and mapping is here: https://gist.github.com/2757570

Confused, I see you use ngram on the title, just wanted to double chech.

We are now dealing with mapping, so right now for development we are
using dev-server with no replications at all.

Thank you.


(system) #4