Hello,
Currently i have one index (by day) which contains logs from several applications. The size is ~50-80Gb/day.
We often searches/aggregate documents by applications.
So would it be better to split this index into smaller indexs (from 1 to 10-15 indexs about 2-10Gb)?
Would the response time improve a lot (considering that i can cache 64gb in memory)?
And what would be the penalty for cross application searches ?
Another concern : it's seems that els don't like many indexes.
With splitting, i can end up with 400 indexes by month.
Is it too much ?
The most important figure is the total number of shards per node.
Having too many shards on a node could lead to some issues (file descriptors, memory...). So, at some point, if you need to manage more shards, you should think of adding more nodes.
1 index with 5 shards is exactly the same as 5 index with 1 shard.
Hope this helps.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Hello,
Currently i have one index (by day) which contains logs from several applications. The size is ~50-80Gb/day.
We often searches/aggregate documents by applications.
So would it be better to split this index into smaller indexs (from 1 to 10-15 indexs about 2-10Gb)?
Would the response time improve a lot (considering that i can cache 64gb in memory)?
And what would be the penalty for cross application searches ?
Another concern : it's seems that els don't like many indexes.
With splitting, i can end up with 400 indexes by month.
Is it too much ?
If you have 50-80 G/d you need quite a number of machines. Smaller indexes
and higher shard count on the same number of machines do not help, the
search performance will be worse.
ES is fine with many indexes. Take big indexes that span over many
machines, and once the index is complete, execute an optimize for faster
search.
Hello,
Currently i have one index (by day) which contains logs from several
applications. The size is ~50-80Gb/day.
We often searches/aggregate documents by applications.
So would it be better to split this index into smaller indexs (from 1 to
10-15 indexs about 2-10Gb)?
Would the response time improve a lot (considering that i can cache 64gb
in memory)?
And what would be the penalty for cross application searches ?
Another concern : it's seems that els don't like many indexes.
With splitting, i can end up with 400 indexes by month.
Is it too much ?
Well thanks you.
Based on the answers, i understand this : put everything in one big index with one shard per server.
When the shards are too big then add another server.
Coming for dbms world, it's "strange" for me.
For example, in mysql, we create 1 table for each application and so the tables are faster to scan/query, even for the indexes (full text or not).
With this, we don't cache in memory the other tables if they are never accessed.
If i put everything in one big table, then indexes are bigger and more expensive to query.
Admin operations are longer and impact all logging applications.
So i thought by isolating independant data into multiples indexes, the local queries would be faster.
If i have 3 servers/64g memory each , 1 shard/day/server with 1 month history, majority of queries for the last 7 days :
Does the last 7 shards by server must be fully cached in memory for good response time?
Even if i query in real only 10% of the data of the last 7 days and rarely the remaining ?
Well thanks you.
Based on the answers, i understand this : put everything in one big index with one shard per server.
When the shards are too big then add another server.
If you have one shard per node, adding a new node will have no effect...
If you add new shards so if you have more than one shard per node, adding new nodes will help.
Coming for dbms world, it's "strange" for me.
For example, in mysql, we create 1 table for each application and so the tables are faster to scan/query, even for the indexes (full text or not).
With this, we don't cache in memory the other tables if they are never accessed.
If i put everything in one big table, then indexes are bigger and more expensive to query.
Admin operations are longer and impact all logging applications.
So i thought by isolating independant data into multiples indexes, the local queries would be faster.
If i have 3 servers/64g memory each , 1 shard/day/server with 1 month history, majority of queries for the last 7 days :
Does the last 7 shards by server must be fully cached in memory for good response time?
Actually elasticsearch doesn't cache the full data but filter bitsets, fielddata... OS cache will cache efficiently Lucene files.
Even if i query in real only 10% of the data of the last 7 days and rarely the remaining ?
If you query not all shards (last day index/shards), you might end up using less memory.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.