Yet another day playing with that awesome product.
I have a question regarding bulk indexing.
Right now, we have three nodes, running with 22Gb ram devoted to ES
Our docs are big, with between 300-500 fields (let's say an average of
400), several nested structures and many analyzed strings.
We are storing the source, but are not indexing the _all field.
We have a batch of indexation in JAVA, in wich we use, of course, the bulk
api to increase our performances.
Right now, ours bulks, due to factors outside of our control, may vary in
size, between 2000 and 5000 of these docs per bulk.
Of course, the refresh_interval is disabled (-1)
Our performances lies somewhere between 2 and 4 minutes per bulk.
I read a lot about indexing speed of several thousand docs per sec, and we
are pretty far form there,
So, is there something we are doing wrong? Are those times dues to the
complexity of our doc?
Thanks in advance!
Yet another day playing with that awesome product.
I have a question regarding bulk indexing.
Right now, we have three nodes, running with 22Gb ram devoted to ES
Our docs are big, with between 300-500 fields (let's say an average of 400), several nested structures and many analyzed strings.
We are storing the source, but are not indexing the _all field.
We have a batch of indexation in JAVA, in wich we use, of course, the bulk api to increase our performances.
Right now, ours bulks, due to factors outside of our control, may vary in size, between 2000 and 5000 of these docs per bulk.
Of course, the refresh_interval is disabled (-1)
Our performances lies somewhere between 2 and 4 minutes per bulk.
I read a lot about indexing speed of several thousand docs per sec, and we are pretty far form there,
So, is there something we are doing wrong? Are those times dues to the complexity of our doc?
Thanks in advance!
Well, the code is pretty straightforward.
It's a loop that read a JSON in a noSQL database, fiddle with it a bit, and
then put it in a bulk.
After the loop, the bulk is executed - rinse and repeat.
The number of docs in the bulk vary because each doc have to be indexed
between 2-5 time, each time with altered fields, and within 2 different
indices.
Those 2000-5000 docs always come from 1000 original docs.
Well, the code is pretty straightforward.
It's a loop that read a JSON in a noSQL database, fiddle with it a bit, and then put it in a bulk.
After the loop, the bulk is executed - rinse and repeat.
The number of docs in the bulk vary because each doc have to be indexed between 2-5 time, each time with altered fields, and within 2 different indices.
Those 2000-5000 docs always come from 1000 original docs.
Well, the code is pretty straightforward.
It's a loop that read a JSON in a noSQL database, fiddle with it a bit,
and then put it in a bulk.
After the loop, the bulk is executed - rinse and repeat.
The number of docs in the bulk vary because each doc have to be indexed
between 2-5 time, each time with altered fields, and within 2 different
indices.
Those 2000-5000 docs always come from 1000 original docs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.