I have been working on a cassandra river which triggers periodically and
indexes all data in a cassandra column family. The implementation for now
spawns 10 threads and processes 10k documents (with 13 columns)/thread.
The performance initially was very good. It indexed 1M documents in 10mins.
But after a 1hour, the indexing became very slow and it indexed around 8M
documents. I am trying to index a total of 50M documents.
I have attached a screenshot of the memory and CPU usage. What I noticed
was, a lot of merge threads spawned up which reduced the speed considerably:
"elasticsearch[Doppelganger][[prodinfo]: Lucene Merge Thread #329]"
daemon prio=10 tid=0x2a630000 nid=0x4c28 runnable [0x246bd000]
So, I believe this has to do with some configuration which I can tweak to
improve bulk indexing. I am running 1 node with 5 shared with 2GB of
ES_HEAP_SIZE and no replicas for now.
Shay mentioned some tips here:
Wanted to know if there are any bulk indexing performance improvements?
I am also using: bulk.execute().addListener() (async) in place of
I am planning to share the cassandra-river as soon its achieves acceptable
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.