I am trying to index 80M tweets with a bulk loader that I wrote using
the Java API on 0.16.4.
When I get to about 2M tweets, indexing performance drops from 4K
indexed tweets per second to under 300 tweets per second. CPU load
goes from 30% to 3%. Heap memory bounces between 300MB and 600Mb and
then spikes to 900Mb.
I am trying to index 80M tweets with a bulk loader that I wrote using
the Java API on 0.16.4.
When I get to about 2M tweets, indexing performance drops from 4K
indexed tweets per second to under 300 tweets per second. CPU load
goes from 30% to 3%. Heap memory bounces between 300MB and 600Mb and
then spikes to 900Mb.
Why are you setting the index.engine.robin.refresh_interval? Don't set it (you are actually using refresh interval of 10 milliseconds....). The index.refresh_interval is enough.
How many threads are you indexing with?
I assume its on a single server? Merging will start to kick in and slow things down a bit, thats how it works...
How do you index hte data? Which client?
On Friday, July 15, 2011 at 2:51 AM, Derrick wrote:
BTW, I noticed over 3000 open files, mostly segment files of the form:
I am trying to index 80M tweets with a bulk loader that I wrote using
the Java API on 0.16.4.
When I get to about 2M tweets, indexing performance drops from 4K
indexed tweets per second to under 300 tweets per second. CPU load
goes from 30% to 3%. Heap memory bounces between 300MB and 600Mb and
then spikes to 900Mb.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.