Hi All,
Thanks for your interest.
I know that elasticsearch announce that some company build cluster including hundreds of computers to do analysis works. I`m not sure what kind of analysis jobs those company is doing. But I have some requirement like this.
I have some data like this. A structured data. Very simple.
gender, age, school, name, score
male, 18, s1, kramer, 100
male, 15, s2, dav, 88
I want to run the analysis like
SELECT school, avg(score) FROM table GROUP BY school ORDER BY avg(score)
I know ES can do this. But I`m worrying its performance (Because the table may contain more than ten billions docs, also the actual data structure will be more complicated than here). So I did a test on a computer like :
CPU 8core
Memory 16G
OS centOS 7
elasticsearch 2.1
8G memory for ES heap
no swap
The way I do my test is :
First , find out how many docs one shard can process. In my computer, it is 400 000
Second, find out how many shards my computer can hold. In my computer, it is 2.
The detail of my test is here
So totally, one computer can analysis 800 000 docs. If I want to analysis 10 billion docs, I need 12500 computers. It is ridiculous...
So I am wondering does ES can really do the analysis above?
Also I noticed there is a product called Elasticsearch-Hadoop. I haven`t read the corresponding docs yet. But it looks like something used for analysis works. So am I using the wrong products?