Slow indexing performance via http

Hi,

I've been having extremely slow indexing performance via http (from a variety of clients). I've written a quick test case to demonstrate.

#!/usr/bin/env python

-- coding: utf-8 --

import json,httplib,glob,time
i=0
for file in glob.glob("*.json"):
products=json.loads(open(file,"r").read())
starttime=time.time()
for product in products:
conn=httplib.HTTPConnection('localhost:9200')
conn.request("POST","/product/%d"%i,json.dumps(product))
conn.close()
i+=1
if i%1000==0:
timetaken=time.time()-starttime
print "Indexed 1000 docs in %fs"%timetaken
starttime=time.time()

These are the sort of figures I am getting:-

Indexed 1000 docs in 18.856016s
Indexed 1000 docs in 15.699048s
Indexed 1000 docs in 18.557784s
Indexed 1000 docs in 3.327371s
Indexed 1000 docs in 15.613446s
Indexed 1000 docs in 15.774722s
Indexed 1000 docs in 24.579973s
Indexed 1000 docs in 33.737681s
Indexed 1000 docs in 27.596674s
Indexed 1000 docs in 27.645701s
SIGSEGV

A typical document will look like this

{"category": "mens shirts", "designer": "hollander & lexer", "name": "Marcus wool and cotton shirt", "url": "http://www.matchesfashion.com/fcp/product/Matches-Fashion/Shirts/hollander-%26-lexer-hol-y-marcus-woolcot-shirts-GREY/44578", "price": "235.00", "site_id": "44578", "image_urls": ["http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/xlarge/hol-y-marcus-woolcot_gry_1.jpg", "http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_2.jpg", "http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_3.jpg", "http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_4.jpg", "http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_5.jpg"], "delivery_days": " 5 working days for your UK orders to be delivered and up to 7 working days for any International orders (customs dependent).", "unavailable_sizes": ["L"], "available_sizes": ["M", "XL"], "currency_code": "GBP", "description": "Grey round neck shirt. Buttoned down the front. Long sleeves. Buttoned cuffs. Creased effect shirt. Shoulder to hem measures 28in/71cm. The mannequin is 6ft 1in and wearing a medium. 85% cotton, 15% wool. Dry clean."}

I'm running 0.12.1 on a quad core i5 with 8GB ram. Heap is 1G/6G.

java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.1) (6b20-1.9.1-1ubuntu3)
OpenJDK 64-Bit Server VM (build 17.0-b16, mixed mode).

CPU is not maxing out but I am seeing a segfault after a few thousand documents:-

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f448eb7765a, pid=1185, tid=139925551892240

JRE version: 6.0_20-b20

Java VM: OpenJDK 64-Bit Server VM (17.0-b16 mixed mode linux-amd64 )

Derivative: IcedTea6 1.9.1

Distribution: Ubuntu 10.10, package 6b20-1.9.1-1ubuntu3

Problematic frame:

V [libjvm.so+0x2ea65a]

An error report file with more information is saved as:

/home/paul/search/elasticsearch/elasticsearch-0.12.1/hs_err_pid1185.log

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

https://bugs.launchpad.net/ubuntu/+source/openjdk-6/

Aborted

Am I missing something?

Thanks,

Paul.

Hi,

It seems like you open and close an HTTP connection for each request,
thats will hurt performance. Its better to reuse the same connection. If you
are using python, have a look at pyes for a good elasticsearch python
client. If you can use the batch API, it should speed things up.

Regarding the failure, is there a chance you can provide a link / gist
the hs_err file? Can you try with the sun jdk and not the icedtea one?

-shay.banon

On Sun, Nov 14, 2010 at 2:50 PM, phuoq paul@conversionscience.co.uk wrote:

Hi,

I've been having extremely slow indexing performance via http (from a
variety of clients). I've written a quick test case to demonstrate.

#!/usr/bin/env python

-- coding: utf-8 --

import json,httplib,glob,time
i=0
for file in glob.glob("*.json"):
products=json.loads(open(file,"r").read())
starttime=time.time()
for product in products:
conn=httplib.HTTPConnection('localhost:9200')
conn.request("POST","/product/%d"%i,json.dumps(product))
conn.close()
i+=1
if i%1000==0:
timetaken=time.time()-starttime
print "Indexed 1000 docs in %fs"%timetaken
starttime=time.time()

These are the sort of figures I am getting:-

Indexed 1000 docs in 18.856016s
Indexed 1000 docs in 15.699048s
Indexed 1000 docs in 18.557784s
Indexed 1000 docs in 3.327371s
Indexed 1000 docs in 15.613446s
Indexed 1000 docs in 15.774722s
Indexed 1000 docs in 24.579973s
Indexed 1000 docs in 33.737681s
Indexed 1000 docs in 27.596674s
Indexed 1000 docs in 27.645701s
SIGSEGV

A typical document will look like this

{"category": "mens shirts", "designer": "hollander & lexer", "name":
"Marcus wool and cotton shirt", "url":
"
http://www.matchesfashion.com/fcp/product/Matches-Fashion/Shirts/hollander-%26-lexer-hol-y-marcus-woolcot-shirts-GREY/44578
",
"price": "235.00", "site_id": "44578", "image_urls":
["
http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/xlarge/hol-y-marcus-woolcot_gry_1.jpg
",
"
http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_2.jpg
",
"
http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_3.jpg
",
"
http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_4.jpg
",
"
http://www.matchesfashion.com/pws/images/catalogue/products/hol-y-marcus-woolcot_gry/small/hol-y-marcus-woolcot_gry_5.jpg
"],
"delivery_days": " 5 working days for your UK orders to be delivered and up
to 7 working days for any International orders (customs dependent).",
"unavailable_sizes": ["L"], "available_sizes": ["M", "XL"],
"currency_code":
"GBP", "description": "Grey round neck shirt. Buttoned down the front.
Long
sleeves. Buttoned cuffs. Creased effect shirt. Shoulder to hem measures
28in/71cm. The mannequin is 6ft 1in and wearing a medium. 85% cotton, 15%
wool. Dry clean."}

I'm running 0.12.1 on a quad core i5 with 8GB ram. Heap is 1G/6G.

java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.1) (6b20-1.9.1-1ubuntu3)
OpenJDK 64-Bit Server VM (build 17.0-b16, mixed mode).

CPU is not maxing out but I am seeing a segfault after a few thousand
documents:-

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f448eb7765a, pid=1185, tid=139925551892240

JRE version: 6.0_20-b20

Java VM: OpenJDK 64-Bit Server VM (17.0-b16 mixed mode linux-amd64 )

Derivative: IcedTea6 1.9.1

Distribution: Ubuntu 10.10, package 6b20-1.9.1-1ubuntu3

Problematic frame:

V [libjvm.so+0x2ea65a]

An error report file with more information is saved as:

/home/paul/search/elasticsearch/elasticsearch-0.12.1/hs_err_pid1185.log

If you would like to submit a bug report, please include

instructions how to reproduce the bug and visit:

Bugs : openjdk-6 package : Ubuntu

Aborted

Am I missing something?

Thanks,

Paul.

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Slow-indexing-performance-via-http-tp1898916p1898916.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Thanks I'll try pyes and the batch API. I switched to the Sun JDK which seems to have removed the problem with the SEGV. My times now looks like this:-

Indexed 1000 docs in 21.770724s
Indexed 1000 docs in 12.785908s
Indexed 1000 docs in 15.674933s
Indexed 1000 docs in 6.275026s
Indexed 1000 docs in 15.630171s
Indexed 1000 docs in 24.699726s
Indexed 1000 docs in 24.596233s
Indexed 1000 docs in 27.625746s
Indexed 1000 docs in 27.751640s
Indexed 1000 docs in 30.660960s
Indexed 1000 docs in 30.869079s

So slightly slower but more reliable.

Thanks for the quick reply!

Paul.

Here are the results using pyes (reusing the same connection):-

Indexed 1000 docs in 40.896121s
Indexed 1000 docs in 40.864152s
Indexed 1000 docs in 40.824686s
Indexed 1000 docs in 14.696204s
Indexed 1000 docs in 40.760250s
Indexed 1000 docs in 40.929896s
Indexed 1000 docs in 40.981183s
Indexed 1000 docs in 40.918429s
Indexed 1000 docs in 40.562371s
Indexed 1000 docs in 41.179249s
Indexed 1000 docs in 40.998809s

I will try the bulk API

Are you still opening and closing http connection for each request?

On Sun, Nov 14, 2010 at 10:07 PM, phuoq paul@conversionscience.co.ukwrote:

Here are the results using pyes (reusing the same connection):-

Indexed 1000 docs in 40.896121s
Indexed 1000 docs in 40.864152s
Indexed 1000 docs in 40.824686s
Indexed 1000 docs in 14.696204s
Indexed 1000 docs in 40.760250s
Indexed 1000 docs in 40.929896s
Indexed 1000 docs in 40.981183s
Indexed 1000 docs in 40.918429s
Indexed 1000 docs in 40.562371s
Indexed 1000 docs in 41.179249s
Indexed 1000 docs in 40.998809s

I will try the bulk API

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Slow-indexing-performance-via-http-tp1898916p1900908.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

I'm the author of pyes. The performance seems very low.
Can you send me the code? Try also the thrift plugin is much more faster.
Try the performance scripts in the pyes directory.

I'm interested in feedback.

Hi,
Alberto

On 14/nov/2010, at 23.29, Shay Banon wrote:

Are you still opening and closing http connection for each request?

On Sun, Nov 14, 2010 at 10:07 PM, phuoq paul@conversionscience.co.uk wrote:

Here are the results using pyes (reusing the same connection):-

Indexed 1000 docs in 40.896121s
Indexed 1000 docs in 40.864152s
Indexed 1000 docs in 40.824686s
Indexed 1000 docs in 14.696204s
Indexed 1000 docs in 40.760250s
Indexed 1000 docs in 40.929896s
Indexed 1000 docs in 40.981183s
Indexed 1000 docs in 40.918429s
Indexed 1000 docs in 40.562371s
Indexed 1000 docs in 41.179249s
Indexed 1000 docs in 40.998809s

I will try the bulk API

View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Slow-indexing-performance-via-http-tp1898916p1900908.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

Thanks for the reply. I'm afraid I've had very little time to spend on this
at the moment so this is just some debugging code

python index.py
Traceback (most recent call last):
File
"/usr/local/lib/python2.6/dist-packages/pyes-0.13.1-py2.6.egg/pyes/es.py",
line 236, in _send_request
decoded = json.loads(response.body, cls=self.decoder)
File "/usr/lib/python2.6/json/init.py", line 318, in loads
return cls(encoding=encoding, **kw).decode(s)
File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.6/json/decoder.py", line 338, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Indexed 1000 docs in 40.842213s
Indexed 1000 docs in 40.609530s
Indexed 1000 docs in 40.750521s
Indexed 1000 docs in 14.469728s
Indexed 1000 docs in 40.601595s
Indexed 1000 docs in 40.678124s
Indexed 1000 docs in 40.749861s
Indexed 1000 docs in 40.600158s
Indexed 1000 docs in 40.699801s
Indexed 1000 docs in 40.430763s
Indexed 1000 docs in 40.489304s

paul@server:$ cat index.py
#!/usr/bin/env python

-- coding: utf-8 --

import json,httplib,glob,time
from pyes import *
i=0
for file in glob.glob("*.json"):
products=json.loads(open(file,"r").read())
conn=ES("127.0.0.1:9200")
starttime=time.time()
for product in products:
conn.index({"name":"Bill Baloney"}, "product", "test", i)
i+=1
if i%1000==0:
timetaken=time.time()-starttime
print "Indexed 1000 docs in %fs"%timetaken
starttime=time.time()

On Tue, Nov 16, 2010 at 10:19 PM, Alberto Paro-2 [via Elasticsearch Users] <
ml-node+1913674-841116784-266224@n3.nabble.comml-node%2B1913674-841116784-266224@n3.nabble.com

wrote:

I'm the author of pyes. The performance seems very low.
Can you send me the code? Try also the thrift plugin is much more faster.
Try the performance scripts in the pyes directory.

I'm interested in feedback.

Hi,
Alberto

On 14/nov/2010, at 23.29, Shay Banon wrote:

Are you still opening and closing http connection for each request?

On Sun, Nov 14, 2010 at 10:07 PM, phuoq <[hidden email]http://user/SendEmail.jtp?type=node&node=1913674&i=0

wrote:

Here are the results using pyes (reusing the same connection):-

Indexed 1000 docs in 40.896121s
Indexed 1000 docs in 40.864152s
Indexed 1000 docs in 40.824686s
Indexed 1000 docs in 14.696204s
Indexed 1000 docs in 40.760250s
Indexed 1000 docs in 40.929896s
Indexed 1000 docs in 40.981183s
Indexed 1000 docs in 40.918429s
Indexed 1000 docs in 40.562371s
Indexed 1000 docs in 41.179249s
Indexed 1000 docs in 40.998809s

I will try the bulk API

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Slow-indexing-performance-via-http-tp1898916p1900908.htmlhttp://elasticsearch-users.115913.n3.nabble.com/Slow-indexing-performance-via-http-tp1898916p1900908.html?by-user=t
Sent from the Elasticsearch Users mailing list archive at Nabble.comhttp://Nabble.com?by-user=t
.


View message @
http://elasticsearch-users.115913.n3.nabble.com/Slow-indexing-performance-via-http-tp1898916p1913674.html
To unsubscribe from Slow indexing performance via http, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1898916&code=cGF1bEBjb252ZXJzaW9uc2NpZW5jZS5jby51a3wxODk4OTE2fDEyMzA2OTk1ODA=.

--
Paul Chudleigh
Cofounder
paul@conversionscience.co.uk
M: +44 7961 378081
T: +44 20 7193 0726
T: +1 (502) 365-5548

I changed your code to run test:
#!/usr/bin/env python

-- coding: utf-8 --

import json,httplib,glob,time
from pyes import *
i=0
conn=ES("127.0.0.1:9200")
for file in glob.glob("*.json"):
#products=json.loads(open(file,"r").read())
starttime=time.time()
for product in xrange(100000):
conn.index({"name":"Bill Baloney"}, "product", "test", str(i), bulk=True)
i+=1
if i%1000==0:
timetaken=time.time()-starttime
print "Indexed 1000 docs in %fs"%timetaken
starttime=time.time()

MBPAlbertoParo:performance alberto$ python test_paul.py
Indexed 1000 docs in 1.870328s
Indexed 1000 docs in 0.497103s
Indexed 1000 docs in 0.234763s
Indexed 1000 docs in 0.301496s
Indexed 1000 docs in 0.200405s
Indexed 1000 docs in 0.348890s
Indexed 1000 docs in 0.171332s
Indexed 1000 docs in 0.234347s
Indexed 1000 docs in 0.182957s
Indexed 1000 docs in 0.368817s
Indexed 1000 docs in 0.244909s
Indexed 1000 docs in 0.210314s
Indexed 1000 docs in 0.170202s
Indexed 1000 docs in 0.204978s
Indexed 1000 docs in 0.155768s
Indexed 1000 docs in 0.197790s
Indexed 1000 docs in 0.162824s
Indexed 1000 docs in 0.195322s
Indexed 1000 docs in 0.151119s
Indexed 1000 docs in 0.197213s
Indexed 1000 docs in 1.888075s
Indexed 1000 docs in 0.585282s
Indexed 1000 docs in 0.346623s

using thrift:

conn=ES("127.0.0.1:9500")

MBPAlbertoParo:performance alberto$ python test_paul.py
Indexed 1000 docs in 1.753041s
Indexed 1000 docs in 0.445294s
Indexed 1000 docs in 0.255482s
Indexed 1000 docs in 0.302288s
Indexed 1000 docs in 0.234341s
Indexed 1000 docs in 0.471655s
Indexed 1000 docs in 0.204253s
Indexed 1000 docs in 0.308180s
Indexed 1000 docs in 0.231229s
Indexed 1000 docs in 0.573759s
Indexed 1000 docs in 0.431835s
Indexed 1000 docs in 0.448390s
Indexed 1000 docs in 0.171829s
Indexed 1000 docs in 0.216516s
Indexed 1000 docs in 0.307683s
Indexed 1000 docs in 1.378346s
Indexed 1000 docs in 0.374787s
Indexed 1000 docs in 0.304383s
Indexed 1000 docs in 0.176255s
Indexed 1000 docs in 0.242483s
Indexed 1000 docs in 0.205120s
Indexed 1000 docs in 0.197803s
Indexed 1000 docs in 0.156094s
Indexed 1000 docs in 0.198573s
Indexed 1000 docs in 0.196585s
Indexed 1000 docs in 0.202188s
Indexed 1000 docs in 0.151114s
Indexed 1000 docs in 0.360001s
Indexed 1000 docs in 0.153690s
Indexed 1000 docs in 0.200145s

Tipically thrift is much more faster with bigger document.

The tests are done with 0.12 version.

Today tests are done with 0.13.

Hi,
Alberto

On 16/nov/2010, at 23.35, phuoq wrote:

#!/usr/bin/env python

-- coding: utf-8 --

import json,httplib,glob,time
from pyes import *
i=0
for file in glob.glob("*.json"):
products=json.loads(open(file,"r").read())
conn=ES("127.0.0.1:9200")
starttime=time.time()
for product in products:
conn.index({"name":"Bill Baloney"}, "product", "test", i)
i+=1
if i%1000==0:
timetaken=time.time()-starttime
print "Indexed 1000 docs in %fs"%timetaken
starttime=time.time()