Hello Subramanian,
On Thu, Oct 18, 2012 at 8:14 AM, Subramanian Narayanan
ping2sriram@gmail.com wrote:
Hi,
I am new to ES. I am using ES via logstash "embedded" setting for standalone
logstash setup.
However i am planning to use ES in a centralized logstash setup. Hence I
need to install ES in a seperate cluster.
On these lines,I have the following couple of questions.
- Can i install ES in a hadoop cluster havind HDFS. Can ES use HDFS as the
file system instead of plain disk. What are the pros n cons of using HDFS as
the filesystem for ES. If its a performance issue, what is the impact. Is it
3x or 10x etc.
I haven't done any benchmarks, but I would use local gateway if this
would be an option. It's the more recommended and more tested option.
- I saw from an older thread, by selecing HDFS gateway, we can replicate
and store the ES index in HDFS. Does index mean all "data" or its a index
like a mysql index.
"Index" in the context of Elasticsearch usually refers to something
like "database" in mysql. It might contain the source - which is
default and basically means all data - or not - in which case you only
have the inverted index, like in mysql index terminology. More
information on "source" here:
- How do we delete data from ES. Can you give pointer to doc/tutorial on
how to delete data indexed from ES.
If it applies to your usecase, the recommended option is to have
rolling indices (eg: one index per day, or per week, month, etc) and
remove old data by simply deleting old indices. Like:
curl -XDELETE localhost:9200/old_index
This is very fast, basically like removing the corresponding files from disk.
If you don't have that option, you can use TTL:
and old data will automatically deleted after the specified time.
Or, you can manually delete all documents that match a certain query:
Please note that when documents are deleted from an index (as opposed
to when you delete a whole index), they're only marked for deletion,
and will be physically removed when segments are merged. The way it
actually happens depends on the merge policy:
Either way, merging implies quite a heavy I/O activity, which is why
it's better for performance to have rolling indices.
Thanks in advance.
Subbu
--
Best regards,
Radu
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene
--