We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:
Title (containing specific data, that shouldn't be analyzed as
separate words, therefore i plan to use Keyword analyzer) (String)
Date (DateOptionalTime)
Time. Time in seconds. (Integer)
UserID (integer)
The aim of project is to collect specific user data and allow users in our
application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.
Some questions regarding the mapping:
Is it a good practice and is it worth to use short names to save some
storage space. For example title -> t, date -> d, time -> tm, and so on?
I didn't perfectly understood *store=yes *mapping parameter. In the
docs it is said: "Set to yes the store actual field in the index, no to not store it.
Defaults to no"
However i don't understand the performance advantages/disadvantages of
this field. In our case if i want to do aggregate sum of Time using
Statistical Facets, should i use store=yes so that aggregation is faster
and value is used from index and not from store or it doesn't affect
aggregating?
What other advises you could suggest to gain performance keeping in
mind that aggregation will be required using Statistical Facets. The single
document as itself will be quite small. Just 5-6 keys and Title field will
have maximum 300 characters.
Current index config and mapping is here: https://gist.github.com/2757570
We are now dealing with mapping, so right now for development we are using
dev-server with no replications at all.
On Sun, May 20, 2012 at 12:19 PM, Kaspars Sprogis darklow@gmail.com wrote:
Hi,
We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:
Title (containing specific data, that shouldn't be analyzed as
separate words, therefore i plan to use Keyword analyzer) (String)
Date (DateOptionalTime)
Time. Time in seconds. (Integer)
UserID (integer)
The aim of project is to collect specific user data and allow users in our
application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.
Some questions regarding the mapping:
Is it a good practice and is it worth to use short names to save
some storage space. For example title -> t, date -> d, time -> tm, and so
on?
It doesn't matter much.
I didn't perfectly understood *store=yes *mapping parameter. In the
docs it is said: "Set to yes the store actual field in the index, no to not store it.
Defaults to no"
However i don't understand the performance advantages/disadvantages of
this field. In our case if i want to do aggregate sum of Time using
Statistical Facets, should i use store=yes so that aggregation is faster
and value is used from index and not from store or it doesn't affect
aggregating?
There is no need for setting store to yes for the different facet
aggregation. Setting store to yes simply means that the field will be
stored on its own, which can come handy when not storing the _source for
example.
What other advises you could suggest to gain performance keeping in
mind that aggregation will be required using Statistical Facets. The single
document as itself will be quite small. Just 5-6 keys and Title field will
have maximum 300 characters.
Make sure you use rolling indices, like an index per month.
Thanks for the reply, now i finally understand store=yes parameter.
About indices- yes, i search through a lot of docs and discussions and find
out rolling indices is what we will need - aliases comes here really
helpful.
About config and nGram, it was just an idea nGram is the right
tokenizer, because i wanted to do also wildcard search to both sides,
however i quite fast realized it is not needed and Pattern tokenizer does
the job quite right too (just corrected the GIST file)
On Tuesday, May 22, 2012 1:19:54 PM UTC+3, kimchy wrote:
On Sun, May 20, 2012 at 12:19 PM, Kaspars Sprogis wrote:
Hi,
We are building a project where approximate amount of data will be 10
million docs per month.
Document contains two major keys:
Title (containing specific data, that shouldn't be analyzed as
separate words, therefore i plan to use Keyword analyzer) (String)
Date (DateOptionalTime)
Time. Time in seconds. (Integer)
UserID (integer)
The aim of project is to collect specific user data and allow users in
our application to define filters using simple "Begin with" and "Contains"
rules combining with date range based on Date field and UserID. Based on
these rules we must query data and return aggregated sum of field Time.
Some questions regarding the mapping:
Is it a good practice and is it worth to use short names to save
some storage space. For example title -> t, date -> d, time -> tm, and so
on?
It doesn't matter much.
I didn't perfectly understood *store=yes *mapping parameter. In
the docs it is said: "Set to yes the store actual field in the index, no to not store it.
Defaults to no"
However i don't understand the performance
advantages/disadvantages of this field. In our case if i want to do
aggregate sum of Time using Statistical Facets, should i use store=yes so
that aggregation is faster and value is used from index and not from store
or it doesn't affect aggregating?
There is no need for setting store to yes for the different facet
aggregation. Setting store to yes simply means that the field will be
stored on its own, which can come handy when not storing the _source for
example.
What other advises you could suggest to gain performance keeping
in mind that aggregation will be required using Statistical Facets. The
single document as itself will be quite small. Just 5-6 keys and Title
field will have maximum 300 characters.
Make sure you use rolling indices, like an index per month.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.