Can we use elasticsearch do analysis work?


(Kramer Li) #1

Hi All,
Thanks for your interest.

I know that elasticsearch announce that some company build cluster including hundreds of computers to do analysis works. I`m not sure what kind of analysis jobs those company is doing. But I have some requirement like this.

I have some data like this. A structured data. Very simple.

gender, age, school, name, score
male, 18, s1, kramer, 100
male, 15, s2, dav, 88

I want to run the analysis like

SELECT school, avg(score) FROM table GROUP BY school ORDER BY avg(score) 

I know ES can do this. But I`m worrying its performance (Because the table may contain more than ten billions docs, also the actual data structure will be more complicated than here). So I did a test on a computer like :

CPU 8core
Memory 16G
OS centOS 7
elasticsearch 2.1
8G memory for ES heap
no swap

The way I do my test is :
First , find out how many docs one shard can process. In my computer, it is 400 000
Second, find out how many shards my computer can hold. In my computer, it is 2.
The detail of my test is here

So totally, one computer can analysis 800 000 docs. If I want to analysis 10 billion docs, I need 12500 computers. It is ridiculous...

So I am wondering does ES can really do the analysis above?
Also I noticed there is a product called Elasticsearch-Hadoop. I haven`t read the corresponding docs yet. But it looks like something used for analysis works. So am I using the wrong products?


(Mark Walkom) #2

Yes, ES can do this sort of thing.

You are talking about a laptop here, so your extrapolation is not accurate.
Try running this on an actual dedicated server instead.


(Kramer Li) #3

Hi

Thanks for your reply. So normally what kind of computer people use when they setup ES cluster for analysis? Like the physical memory , CPU core numbers etc. I do not have much experience on this. But obviously use ten thousands of computer to do analysis on 10 billion data is not a good choice

Regards
Mingwei


(Mark Walkom) #4

I can't answer that because there is no normal.

But start with something dedicated, at least 4 cores, 8GB of RAM and some disk.


(system) #5