Hello,
I have been working on a cassandra river which triggers periodically and
indexes all data in a cassandra column family. The implementation for now
spawns 10 threads and processes 10k documents (with 13 columns)/thread.
The performance initially was very good. It indexed 1M documents in 10mins.
But after a 1hour, the indexing became very slow and it indexed around 8M
documents. I am trying to index a total of 50M documents.
I have attached a screenshot of the memory and CPU usage. What I noticed
was, a lot of merge threads spawned up which reduced the speed considerably:
"elasticsearch[Doppelganger][[prodinfo][1]: Lucene Merge Thread #329]"
daemon prio=10 tid=0x2a630000 nid=0x4c28 runnable [0x246bd000]
So, I believe this has to do with some configuration which I can tweak to
improve bulk indexing. I am running 1 node with 5 shared with 2GB of
ES_HEAP_SIZE and no replicas for now.
Shay mentioned some tips here:
https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/APWxRLrMOeU
in 2011.
Wanted to know if there are any bulk indexing performance improvements?
I am also using: bulk.execute().addListener() (async) in place of
bulk.execute().actionGet() (sync)
I am planning to share the cassandra-river as soon its achieves acceptable
performance.
Thanks,
-Utkarsh
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.