How should I search data in hdfs


#1

We have stored a lot of log files in hdfs. How can I search some data in the files with es-hadoop?
I am very confused what should I do.

Method One:

  1. Have to create an elasticsearch environment
  2. Write and run the MapReduce program with es-hadoop, which can read and index data of all files into es, do the search and return the results.
    If Method One is correct, because es has no enough volume to store all indexes, what shoud I do?

So I got Method Two:

  1. Have to create an elasticsearch environment
  2. Write and run the MapReduce program with es-hadoop, which can read and index somd files into es, do the search, return the results and repeat the process until all files are covered.
    If Method Two is correct, isn't the total process time too long to finish a searching?

Could someone solve my confusion or give my the correct method and the sample please? Thanks a lot.


#2

Could someone help me please? Thanks a lot!


(Costin Leau) #3

Benny, your assesment is correct.

For ES to work with the data, the data needs to be index in ES. As you mentioned, you can either try and index all the data or parts of it. Indexing all the data is ideal since then you can do several searches across all of it.
If you don't have enough space for it, there's not much you can do (by the way, HDFS makes the assumption that space is infinite. Running jobs with HDFS capacity over 60-70% many times ends up crashing the cluster).

You can try and index only parts of the data but as you pointed out, if you have to reindex them again, you'll again use CPU cycles and waste time.

I'm not sure how much data you are having but unless you have hardware for it, you can't really use it. Note that the indexing data / data duplication applies not just to ES but every system that has any type of metadata in place - you either save the metadata and trade storage for compute or vice-versa - keep the data in raw format and recreat it (by reindexing) every time you need it.
Pretty much all the time, using the former makes more sense.


(system) #4