Hadoop / Elasticsearch functionality

Hello all, this is my first post here! I am just starting to learn about Elasticsearch and Hadoop, and there is one thing I don't understand about how they work together.

My (admittedly very simplistic) understanding of how these two work, is that Elasticsearch can handle data storage and indexing/searching, while Hadoop can also do data storage, and also distributed computing. What I have trouble understanding is: who is doing what, when Elasticsearch is run "on top of" Hadoop?

Is the actual data stored in HDFS, or in Elasticsearch?

  • Is Hadoop the primary data storage? When an elasticsearch data is run, is it run against data in Hadoop? How is indexing done, in this case?
  • Or, on the contrary, is Elasticsearch used for data storage, and Hadoop only for long-term archiving?
  • How are Hadoop's distributed computing functions used? Is any part of the Elastic server's logic offloaded to Hadoop?

Most information I could find on the Internet seems to assume some basic knowledge about how the two work together or separately, knowledge which I lack, so I turned to this forum for advice. Please keep in mind that I'm probably looking for some extremely basic answers here.

Thanks for your thoughts.

Dan

4 Likes

You don't run ES on hadoop, you don't store ES data in hadoop. They store data themselves.

The es-hadoop library is a connector that let's you take data from one into the other, and let's them do what they are good at. In hadoop's case it is massive, job like processing and storage. For ES is it (near) real time analytics and search.

Hi Mark, thanks for responding.

So if I understand you right, both Hadoop and ES would have their own independent data sets? This is a little confusing for me.

Let's say I have a bunch of log data coming in through Logstash. Would one typically configure this data to go to Hadoop, ES, or both, and why?

Dan

Yes

As I mentioned, both ES and hadoop have different use cases. So it's dependant on that.

As I mentioned, both ES and hadoop have different use cases. So it's dependant on that.

Could you give me an example, preferably within Log Management?

Umm, legal retention over multiple years would probably be better stored in hadoop. With data 1-2 years held in ES for analysis and fast reporting.

1 Like

Why would it be better stored in Hadoop? What's the advantage there?
Why not just store everything in Elasticsearch?

1 Like

Let's say you have a 1,000 data files where a file can be a Word document, a PDF, a PowerPoint, a text, etc... you can store these files in the local file system, in Hadoop Distributed File System (HDFS), as a BLOB in a relational database, in Mongo GridFS, etc... when you have a need to search for data, you then index these files. You can either develop an application or use existing tool to read these files, turn them into text then index them with Elasticsearch (ES). The searchable data is stored in ES so you can search for information. Yes, you do have the option to store the "text" content, not the binary content such as Word document, PDF, etc in ES so that when you perform a search, you can also retrieve the "text" content and display it as part of a search result if you like. When you store the "text" content in ES, the index size will be big and in some cases, it's worth storing the content there and in some cases it's not. This is something that you can decide based on your data and what you want to do with your data.

With a few files, you don't need to store them in HDFS because you don't want to spend money and effort to setup a Hadoop Cluster to just store a few files like this, just use your local file system. When you have lots of files, I mean something in millions, billions, or more and you know your dataset size will grow over time, then HDFS is one of many options that you can choose. With HDFS, if you run out of diskspace, you can add a data node or more to the Hadoop Cluster easily to expand the storage and this can grow infinitely. With local file system, you'll need to add more physical drives but there is a limitation on the maximum number of physical drives that you can add to your machine.

How should these files be stored in HDFS? let's save that for a different discussion.

When you want to store files in HDFS for the purpose of processing them for whatever reasons, indexing is one of many reasons, you'll need to develop a process or an application that is capable of retrieving files from HDFS (in this case), extract the content and metadata if available, then index them with ES. Elasticsearch-Hadoop (ES-Hadoop) can help you or you can write your own app if you know what to do.

For example, you can develop a normal command-line application to do this or you can develop a MapReduce job to the same thing.

A command-line application is straight forward where it recursively goes through a list of files in a directory and process (or index) one at a time. When you have a billion files, it will take a while for this command-line application to process them all, this is when you need to develop a MapReduce job so you can cut back on the processing time by processing them in parallel, where a job can be distributed as a task to each data nodes in the cluster to process an x number of files within the same input directory in HDFS. This MapReduce job needs HDFS.

If you need a MapReduce job to index your data with ES, Elasticsearch-Hadoop can help you with that, it will save your development time, the only thing that you have to do is the part where you need to extract the text content out of the binary files like Word documents, PDFs, etc. There are many open source libraries that can help you with that, you don't need to write your own.

1 Like

Hi thn, thanks for the write-up!

Surely this, alone, can't be a reason to keep the data in HDFS, like Mark suggested earlier.

  • Storage virtualization can already be done by many OSes, no reason to use Hadoop for this.
  • ES also can scale.

So... I still don't understand. Why bother with Hadoop? (In the context of my usecases mentioned earlier, of course).

As I said, you don't need HDFS to store your data, it's all about your requirements.

Yes, ES can scale but for a different reason. Let's say you have an index with 5 shards, when you fill up all the shards (in theory, each shard can hold 2B documents), you can't scale it out because once the index is created, you can't change the number of shards. So think about this.

If you want a MapReduce job to process your data, HDFS is required. If not, HDFS is just one of many options. There is another part of HDFS regarding to storing small file size vs large file size. You can find the information about that online.

1 Like

Yes, I read what you wrote. And my question remains, what requirements would lead one to choose HDFS for storage, instead of just using ES for the same purpose, in the context of a log management use case?

Here is one of the reasons :slight_smile:

What's the average log file size? how many log files are you dealing with? can you afford to lose a file or a bunch of files when a disk dies? a machine dies?

With a search on the internet, you may have seen the article below, if not, check it out to get a feel for what HDFS is for comparing to other storage technologies.

http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/

That's not a reason to use HDFS for storage. I don't want a MapReduce job to process my data. Why would I choose to store stuff in Hadoop?

Please read the discussion I had with Mark Walkom before you joined the thread, where he suggested storing (not processing) the data in Hadoop. It is the reason behind that recommendation I am trying to understand. Or are you using the terms "storage" and "processing" interchangeably?

Thanks for the link. Do you mean to suggest that Hadoop simply scales better than ES? That would certainly be an answer to my question above.

I said "it's one of the reasons, not the only reason" and NO, I don't use "storage" and "processing" interchangeably.

Regarding your conversation with Mark, as I said, what kind of dataset size are we talking about here? The reason I ask this question b/c as in my earlier post, if the data size is "small" (relative term), it's not worth storing in HDFS, it may be okay to store the data with the index.

As far as storage goes in general, HDFS is a lot better than ES. Hint: Which one has "filesystem" in the name? :wink:

But as far as your concern, I suggest to look at your data size and how it grows over time before deciding to use ES as a storage. Since I don't know these numbers, I rather use other filesystem as a storage, not ES.

ES is a search engine, it's using Lucene under the hood and Lucene is not intended to be a storage engine either. Yes, you can store "text" content along with the index, in your case, log files are text files so yes, you can store them along with the index if you would like to. The index size certainly will grow big in this case and if your index has replicas, and you can afford to see "text" data replicated to the replicas, then use it but make sure the performance meets your expectation.

Lastly, if you think you can use ES as a storage and try to make it work like a storage, you'll discover more problems than it's worth.

1 Like

There is no data. This is a largely academic question.

Aha! This is extremely interesting. Perhaps this is the very basic thing that I don't understand about ES. I thought that ES can only search text that is actually stored within ES. But if I understand correctly what you are saying, ES could index data that is stored elsewhere (for example in HDFS), and only store the index. Is this correct, what I understood here?

No problem, at least I know where you are heading to. When you have a very large datasets, you will need to look for something like HDFS or similar because they are built for that.

If you think ES does that automatically, the answer is no. You either use something like logstash or write your own application where it pulls data from a data source and index it with ES. ES will take care of the indexing part., for this part, the answer is yes.

It's precisely this that I don't understand... doesn't that mean storing the data in ES? What exactly do you mean by "indexing"? Can you index data in ES without storing the data in ES also? Logstash, for example, does both - indexing and storage of data - using ES. Isn't that right?

1 Like

Let's say you have

  • a data source where your data files are stored
  • an instance of ES running

If you want data files from your data source to be indexed, something needs to read/retrieve data files from the data source then pass the text content to ES through a method call to tell ES to index the data.

How do you do that? at least two options right now

  • option 1: use logstash if it can be configured to retrieve your data then it's definitely capable of indexing your data by calling a method to tell ES to index the data

  • option 2: if logstash can't be configured to process (retrieve and index) your data, you'll need to write your own application to retrieve and index your data.

In ES, you can configure it to not store the content that you want ES to index. For example, let's say you have a text file (I mean text, not binary) and you tell ES to index the content into a field called "mycontent". By default, once the text content is indexed, you can see this content under "mycontent" field that is under _source. You can configured ES to turn off _source so you won't be able to see that data there, this also reduces the index physical size that is stored on disk.

WRT to "mycontent" filed, you can configure ES to "not store the content for this field", you can see how to do so at the link below

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-store.html

The reason you are allowed to configured ES to not store the value stored under any field is because there will be a case where you want to see data under other fields but not on some fields. If you use the example above, let's say beside indexing text content under "mycontent" field, you would like to index a second field called "mypath" where the directory path of the file is indexed. When you look into ES, you'll see "mypath" under "_source" but not "mycontent". You can search anything that is indexed, but not on things that are not indexed, that's why I use the term "index".

2 Likes

Aha! This is starting to make a little more sense to me now. So it would be possible to, for example, store the actual data in HDFS for e.g. scalability, and use ES just for indexing, possibly storing just the HDFS path to the indexed document, rather than its content. It's possible that this type of setup was being suggested by the previous poster for his legal retention scenario.

tri-man, thank you very much for your time and patience. Our discussion really helped!

Glad to hear that it helps.