On Sat, Jun 25, 2011 at 7:51 PM, Clinton Gormley firstname.lastname@example.org:
On Sat, 2011-06-25 at 19:17 +0530, Hari Shankar wrote:
Thanks a lot for responding. In fact I'd need to query all-time data,
so maybe instead of time-based groups, we'd have to develop some other
grouping strategy to split indices, maybe grouping customers such that
total index size of each group is approximately same.
Index aliases can point to more than one index (for read purposes). So
you could have two aliases:
- index_read: [ index_2009,index_2010,index_2011]
- index_write: index_2011
So your app would always write to 'index_write' and read from
'index_read'. All you need to do then is to create a new index amd
update the aliases once a year.
Won't this reduce read/search efficiency, since it now has to query more
indices and merge, e.g if I have to sort? Or is this overhead small?
The biggest issue I am facing with having a single large index is
update times, it takes almost a minute to update 200,000 records. I
have replication set to 3, but reducing it to 1 did not have a huge
impact on update times. I have 7 shards on 7 machines currently. What
other parameters can I tweak to improve update times?
You could reduce the refresh_interval from its default 1 second, add
more primary shards, play with the translog settings, get faster
machines,.... running out of options here.
I tried setting refresh_interval to -1 but it did not have to seem much of
an impact. I was using bulk indexer Java API. I guess the bulk indexer
handles the refresh this itself? Can I force it to use all cores of the CPU?
I did an htop on all machines and found that only one CPU was usually being
used ~100% whereas others were relatively idle. Increasing ES_MAX_MEM helped
until a point, but now it has no impact so I guess RAM is sufficient.
One thing I wanted to confirm - What will be more advantageous w.r.t
improving indexing speed: adding more machines but fixing the number of
shards at 7 or adding more machines and also increasing no. of shards
Can I make replication asynchronous?
3,200 records per second sounds quite fast to me, but maybe i'm not
aiming high enough
Yeah, I know... it is fast if you ask me considering this is an index and
not a DB, but who can explain that to the marketing guys .. Our earlier
system was completely based on RDBMS, now we are trying to move to HBase
(for writes and batch processing) + es (for the front-end reads). On most
fronts, the new system seems to be performing much better (as expected),
than the older one, there are these few areas where the earlier system was
stronger (seems), like updates and joining. We are just trying to close the
gap in these areas as much as possible..
Thanks for your time..