I am new to ES. I am using ES via logstash "embedded" setting for
standalone logstash setup.
However i am planning to use ES in a centralized logstash setup. Hence I
need to install ES in a seperate cluster.
On these lines,I have the following couple of questions.
Can i install ES in a hadoop cluster havind HDFS. Can ES use HDFS as the
file system instead of plain disk. What are the pros n cons of using HDFS
as the filesystem for ES. If its a performance issue, what is the impact.
Is it 3x or 10x etc.
I saw from an older thread, by selecing HDFS gateway, we can replicate
and store the ES index in HDFS. Does index mean all "data" or its a index
like a mysql index.
How do we delete data from ES. Can you give pointer to doc/tutorial on
how to delete data indexed from ES.
I am new to ES. I am using ES via logstash "embedded" setting for standalone
logstash setup.
However i am planning to use ES in a centralized logstash setup. Hence I
need to install ES in a seperate cluster.
On these lines,I have the following couple of questions.
Can i install ES in a hadoop cluster havind HDFS. Can ES use HDFS as the
file system instead of plain disk. What are the pros n cons of using HDFS as
the filesystem for ES. If its a performance issue, what is the impact. Is it
3x or 10x etc.
I haven't done any benchmarks, but I would use local gateway if this
would be an option. It's the more recommended and more tested option.
I saw from an older thread, by selecing HDFS gateway, we can replicate
and store the ES index in HDFS. Does index mean all "data" or its a index
like a mysql index.
"Index" in the context of Elasticsearch usually refers to something
like "database" in mysql. It might contain the source - which is
default and basically means all data - or not - in which case you only
have the inverted index, like in mysql index terminology. More
information on "source" here:
How do we delete data from ES. Can you give pointer to doc/tutorial on
how to delete data indexed from ES.
If it applies to your usecase, the recommended option is to have
rolling indices (eg: one index per day, or per week, month, etc) and
remove old data by simply deleting old indices. Like:
curl -XDELETE localhost:9200/old_index
This is very fast, basically like removing the corresponding files from disk.
If you don't have that option, you can use TTL:
and old data will automatically deleted after the specified time.
Or, you can manually delete all documents that match a certain query:
Please note that when documents are deleted from an index (as opposed
to when you delete a whole index), they're only marked for deletion,
and will be physically removed when segments are merged. The way it
actually happens depends on the merge policy:
Either way, merging implies quite a heavy I/O activity, which is why
it's better for performance to have rolling indices.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.