am using pyes... My script walks a dir tree looking for xml docs, for each
file found:
- parses it using a python lib (lxml.objectify)
- index it a json dump fo the object.
I did run my script commenting out the indexing step... which means just
walk the tree and parse the docs... it took 22 seconds!
I also notice, breaking my script after 1000 docs, that using 1, 2, 3, 4 or
5 nodes, does not change the total time much!!
My documents have half a dozen attributes, one of which is a decent size
HTML document.
I am using the default 5 shards and 1 replica.
am very very confused.
Thanks,
Mohamed.
On Monday, October 8, 2012 3:23:15 PM UTC-4, David Pilato wrote:
Hey Mohammed,
Where are you loosing time? Is it when you get and build your docs or
when you send it?
How do you send it to ES? Are you using a bulk? Which size?
How does your documents look like?
It's best if you can provide more details about what you are doing. A
curl recreation is perfect.
David.
Le 8 octobre 2012 à 21:00, Mohamed Lrhazi <ml...@georgetown.edu<javascript:>>
a écrit :
I indexed 20K documents using a 5 node ES setup, (RHEL 6.x)
with everything in its default values. It took 15mins.
I then doubled the vCPUs on the VMs, from 4 to 8, and RAM from 4 to 8 GB.
Rerun the indexing which took 16mins!
I then installed the service wrapper on all nodes, and added these lines
at the top of the elasticsearch.conf:
set.default.ES_HOME=/opt/elasticsearch-0.19.9
set.default.ES_HEAP_SIZE=2048
set.default.ES_MIN_MEM=4096
set.default.ES_MAX_MEM=4096
Rerun my indexing and it took exactly 15mins again!!!
What am doing wrong? What is my bottleneck here?
Thanks a lot,
Mohamed.
--
--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
--