Confused about tuning


(Mohamed Lrhazi) #1

I indexed 20K documents using a 5 node ES setup, (RHEL 6.x)
with everything in its default values. It took 15mins.

I then doubled the vCPUs on the VMs, from 4 to 8, and RAM from 4 to 8 GB.
Rerun the indexing which took 16mins!

I then installed the service wrapper on all nodes, and added these lines at
the top of the elasticsearch.conf:

set.default.ES_HOME=/opt/elasticsearch-0.19.9
set.default.ES_HEAP_SIZE=2048
set.default.ES_MIN_MEM=4096
set.default.ES_MAX_MEM=4096

Rerun my indexing and it took exactly 15mins again!!!

What am doing wrong? What is my bottleneck here?

Thanks a lot,
Mohamed.

--


(David Pilato) #2

Hey Mohammed,

Where are you loosing time? Is it when you get and build your docs or when you
send it?
How do you send it to ES? Are you using a bulk? Which size?

How does your documents look like?

It's best if you can provide more details about what you are doing. A curl
recreation is perfect.

David.

Le 8 octobre 2012 à 21:00, Mohamed Lrhazi ml623@georgetown.edu a écrit :

I indexed 20K documents using a 5 node ES setup, (RHEL 6.x) with everything in
its default values. It took 15mins.

I then doubled the vCPUs on the VMs, from 4 to 8, and RAM from 4 to 8 GB.
Rerun the indexing which took 16mins!

I then installed the service wrapper on all nodes, and added these lines at
the top of the elasticsearch.conf:

set.default.ES_HOME=/opt/elasticsearch-0.19.9
set.default.ES_HEAP_SIZE=2048
set.default.ES_MIN_MEM=4096
set.default.ES_MAX_MEM=4096

Rerun my indexing and it took exactly 15mins again!!!

What am doing wrong? What is my bottleneck here?

Thanks a lot,
Mohamed.

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Mohamed Lrhazi) #3

am using pyes... My script walks a dir tree looking for xml docs, for each
file found:

  • parses it using a python lib (lxml.objectify)
  • index it a json dump fo the object.

I did run my script commenting out the indexing step... which means just
walk the tree and parse the docs... it took 22 seconds!

I also notice, breaking my script after 1000 docs, that using 1, 2, 3, 4 or
5 nodes, does not change the total time much!!

My documents have half a dozen attributes, one of which is a decent size
HTML document.

I am using the default 5 shards and 1 replica.

am very very confused.

Thanks,
Mohamed.

On Monday, October 8, 2012 3:23:15 PM UTC-4, David Pilato wrote:

Hey Mohammed,

Where are you loosing time? Is it when you get and build your docs or
when you send it?
How do you send it to ES? Are you using a bulk? Which size?

How does your documents look like?

It's best if you can provide more details about what you are doing. A
curl recreation is perfect.

David.

Le 8 octobre 2012 à 21:00, Mohamed Lrhazi <ml...@georgetown.edu<javascript:>>
a écrit :

I indexed 20K documents using a 5 node ES setup, (RHEL 6.x)
with everything in its default values. It took 15mins.

I then doubled the vCPUs on the VMs, from 4 to 8, and RAM from 4 to 8 GB.
Rerun the indexing which took 16mins!

I then installed the service wrapper on all nodes, and added these lines
at the top of the elasticsearch.conf:

set.default.ES_HOME=/opt/elasticsearch-0.19.9
set.default.ES_HEAP_SIZE=2048
set.default.ES_MIN_MEM=4096
set.default.ES_MAX_MEM=4096

Rerun my indexing and it took exactly 15mins again!!!

What am doing wrong? What is my bottleneck here?

Thanks a lot,
Mohamed.

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Mohamed Lrhazi) #4

On Monday, October 8, 2012 3:41:33 PM UTC-4, Mohamed Lrhazi wrote:

  • index it a json dump fo the object.

index a json.dump of the python dict object.

--


(Mohamed Lrhazi) #5

OK. you mentioned "bulk", I was not using it... Using bulk I went from 15
mins, to 35 seconds !!!

Thanks a lot,
Mohamed.

On Monday, October 8, 2012 3:41:33 PM UTC-4, Mohamed Lrhazi wrote:

am using pyes... My script walks a dir tree looking for xml docs, for each
file found:

  • parses it using a python lib (lxml.objectify)
  • index it a json dump fo the object.

I did run my script commenting out the indexing step... which means just
walk the tree and parse the docs... it took 22 seconds!

I also notice, breaking my script after 1000 docs, that using 1, 2, 3, 4
or 5 nodes, does not change the total time much!!

My documents have half a dozen attributes, one of which is a decent size
HTML document.

I am using the default 5 shards and 1 replica.

am very very confused.

Thanks,
Mohamed.

On Monday, October 8, 2012 3:23:15 PM UTC-4, David Pilato wrote:

Hey Mohammed,

Where are you loosing time? Is it when you get and build your docs or
when you send it?
How do you send it to ES? Are you using a bulk? Which size?

How does your documents look like?

It's best if you can provide more details about what you are doing. A
curl recreation is perfect.

David.

Le 8 octobre 2012 à 21:00, Mohamed Lrhazi ml...@georgetown.edu a
écrit :

I indexed 20K documents using a 5 node ES setup, (RHEL 6.x)
with everything in its default values. It took 15mins.

I then doubled the vCPUs on the VMs, from 4 to 8, and RAM from 4 to 8
GB. Rerun the indexing which took 16mins!

I then installed the service wrapper on all nodes, and added these lines
at the top of the elasticsearch.conf:

set.default.ES_HOME=/opt/elasticsearch-0.19.9
set.default.ES_HEAP_SIZE=2048
set.default.ES_MIN_MEM=4096
set.default.ES_MAX_MEM=4096

Rerun my indexing and it took exactly 15mins again!!!

What am doing wrong? What is my bottleneck here?

Thanks a lot,
Mohamed.

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(David Pilato) #6

Check that everything is running quick from reading, parsing and generating
JSon.
Check your IO. If you are using VM, can it have a bad side effect on IO?

Let me say that I'm able to index on a "small" windows instance (1,5 Gb
allocated to ES) to index about 300 docs per second.

Are you running 5 nodes on 5 boxes? If you are running 5 nodes on the same
hardware, it's about the same than running 1 node.
On one node, you have 5 lucene instances (5 shards) on one box. With 5 nodes on
one nox, you have 1 shard per node, (1 Lucene instance per node), so you have 5
Lucene intances on the same hardware!

That said, before growing the number of nodes, I think you should be able to get
better results.
You should be able to run it in less than 30 seconds (let's say 1 minute).

Le 8 octobre 2012 à 21:41, Mohamed Lrhazi ml623@georgetown.edu a écrit :

am using pyes... My script walks a dir tree looking for xml docs, for each
file found:

  • parses it using a python lib (lxml.objectify)
  • index it a json dump fo the object.

I did run my script commenting out the indexing step... which means just walk
the tree and parse the docs... it took 22 seconds!

I also notice, breaking my script after 1000 docs, that using 1, 2, 3, 4 or 5
nodes, does not change the total time much!!

My documents have half a dozen attributes, one of which is a decent size HTML
document.

I am using the default 5 shards and 1 replica.

am very very confused.

Thanks,
Mohamed.

On Monday, October 8, 2012 3:23:15 PM UTC-4, David Pilato wrote:

Hey Mohammed,

Where are you loosing time? Is it when you get and build your docs or
when you send it?
How do you send it to ES? Are you using a bulk? Which size?

How does your documents look like?

It's best if you can provide more details about what you are doing. A
curl recreation is perfect.

David.

Le 8 octobre 2012 à 21:00, Mohamed Lrhazi < ml...@georgetown.edu> a écrit
:

> > > I indexed 20K documents using a 5 node ES setup, (RHEL 6.x) with
> > > everything in its default values. It took 15mins.
I then doubled the vCPUs on the VMs, from 4 to 8, and RAM from 4 to 8

GB. Rerun the indexing which took 16mins!

I then installed the service wrapper on all nodes, and added these

lines at the top of the elasticsearch.conf:

set.default.ES_HOME=/opt/elasticsearch-0.19.9
set.default.ES_HEAP_SIZE=2048
set.default.ES_MIN_MEM=4096
set.default.ES_MAX_MEM=4096


Rerun my indexing and it took exactly 15mins again!!!

What am doing wrong? What is my bottleneck here?

Thanks a lot,
Mohamed.



--

--
David Pilato
http://www.scrutmydocs.org/ http://www.scrutmydocs.org/
http://dev.david.pilato.fr/ http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(David Pilato) #7

Cool. That's a correct time now! :wink:

Le 8 octobre 2012 à 21:50, Mohamed Lrhazi ml623@georgetown.edu a écrit :

OK. you mentioned "bulk", I was not using it... Using bulk I went from 15
mins, to 35 seconds !!!

Thanks a lot,
Mohamed.

On Monday, October 8, 2012 3:41:33 PM UTC-4, Mohamed Lrhazi wrote:

am using pyes... My script walks a dir tree looking for xml docs, for
each file found:

  • parses it using a python lib (lxml.objectify)
  • index it a json dump fo the object.

I did run my script commenting out the indexing step... which means just
walk the tree and parse the docs... it took 22 seconds!

I also notice, breaking my script after 1000 docs, that using 1, 2, 3, 4
or 5 nodes, does not change the total time much!!

My documents have half a dozen attributes, one of which is a decent size
HTML document.

I am using the default 5 shards and 1 replica.

am very very confused.

Thanks,
Mohamed.

On Monday, October 8, 2012 3:23:15 PM UTC-4, David Pilato wrote:
> > > Hey Mohammed,

 Where are you loosing time? Is it when you get and build your docs or

when you send it?
How do you send it to ES? Are you using a bulk? Which size?

 How does your documents look like?

 It's best if you can provide more details about what you are doing. A

curl recreation is perfect.

 David.


 Le 8 octobre 2012 à 21:00, Mohamed Lrhazi < ml...@georgetown.edu> a

écrit :

  > > > > I indexed 20K documents using a 5 node ES setup, (RHEL 6.x)
  > > > > with everything in its default values. It took 15mins.
  I then doubled the vCPUs on the VMs, from 4 to 8, and RAM from 4

to 8 GB. Rerun the indexing which took 16mins!

  I then installed the service wrapper on all nodes, and added these

lines at the top of the elasticsearch.conf:

  set.default.ES_HOME=/opt/elasticsearch-0.19.9
  set.default.ES_HEAP_SIZE=2048
  set.default.ES_MIN_MEM=4096
  set.default.ES_MAX_MEM=4096


  Rerun my indexing and it took exactly 15mins again!!!

  What am doing wrong? What is my bottleneck here?

  Thanks a lot,
  Mohamed.



  --



 > > > 
 --
 David Pilato
 http://www.scrutmydocs.org/ <http://www.scrutmydocs.org/>
 http://dev.david.pilato.fr/ <http://dev.david.pilato.fr/>
 Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(system) #8