Which way to collect data into Elasticsearch is better?


#1

Hi Elasticsearch gurus:

As a fresh bird to elasticsearch, I'm trying to arch a data collection/analysis/visualization center for our e-commercial company.

I've done some research and want to seek some advice here.

I tried three way to simulate importing data (10K documents).

  1. use logstash with redis input and elasticsearch output
  2. use Python api -- elasticsearch.index()
  3. use Python helpers -- helpers.bulk()

The result shows:
in way 1/3 -- it costs less than about 8s. I guess they are same way actually.
way 2 -- costs about 80s.

then, I enhance way 2 with python threading:
when using 100 threads -- about 30s
when using 200 threads -- 'elasticsearch rejection' exception occurs in some threads.

In my scenario, I need some anchor codes in existing systems to send events/data to elasticsearch. Based on my research, bulk way will be better than api (actually, I guess all elasticsearch clients API are based on http RESTFUL) since former one has a higher TPS than latter.

So I think better solution is anchor code throw data into redis or some supported mqs, and logstash consumes them, finally insert into elasticsearch.

Can anyone give me some advice whether it's the right direction? Or is there better solution?


(Mark Hanford) #2

Can you edit your post so it's not in a blockquote, so I don't have to scroll to read all the lines? Take the spaces out from in front of all the lines...


#3

Thanks for your kind reminder!


(system) #4