Have a couple of questions on ES

Hi,

I am new to ES. I am using ES via logstash "embedded" setting for
standalone logstash setup.

However i am planning to use ES in a centralized logstash setup. Hence I
need to install ES in a seperate cluster.

On these lines,I have the following couple of questions.

  1. Can i install ES in a hadoop cluster havind HDFS. Can ES use HDFS as the
    file system instead of plain disk. What are the pros n cons of using HDFS
    as the filesystem for ES. If its a performance issue, what is the impact.
    Is it 3x or 10x etc.

  2. I saw from an older thread, by selecing HDFS gateway, we can replicate
    and store the ES index in HDFS. Does index mean all "data" or its a index
    like a mysql index.

  3. How do we delete data from ES. Can you give pointer to doc/tutorial on
    how to delete data indexed from ES.

Thanks in advance.

Subbu

--

Hello Subramanian,

On Thu, Oct 18, 2012 at 8:14 AM, Subramanian Narayanan
ping2sriram@gmail.com wrote:

Hi,

I am new to ES. I am using ES via logstash "embedded" setting for standalone
logstash setup.

However i am planning to use ES in a centralized logstash setup. Hence I
need to install ES in a seperate cluster.

On these lines,I have the following couple of questions.

  1. Can i install ES in a hadoop cluster havind HDFS. Can ES use HDFS as the
    file system instead of plain disk. What are the pros n cons of using HDFS as
    the filesystem for ES. If its a performance issue, what is the impact. Is it
    3x or 10x etc.

I haven't done any benchmarks, but I would use local gateway if this
would be an option. It's the more recommended and more tested option.

  1. I saw from an older thread, by selecing HDFS gateway, we can replicate
    and store the ES index in HDFS. Does index mean all "data" or its a index
    like a mysql index.

"Index" in the context of Elasticsearch usually refers to something
like "database" in mysql. It might contain the source - which is
default and basically means all data - or not - in which case you only
have the inverted index, like in mysql index terminology. More
information on "source" here:

  1. How do we delete data from ES. Can you give pointer to doc/tutorial on
    how to delete data indexed from ES.

If it applies to your usecase, the recommended option is to have
rolling indices (eg: one index per day, or per week, month, etc) and
remove old data by simply deleting old indices. Like:

curl -XDELETE localhost:9200/old_index

This is very fast, basically like removing the corresponding files from disk.

If you don't have that option, you can use TTL:

and old data will automatically deleted after the specified time.

Or, you can manually delete all documents that match a certain query:

Please note that when documents are deleted from an index (as opposed
to when you delete a whole index), they're only marked for deletion,
and will be physically removed when segments are merged. The way it
actually happens depends on the merge policy:

Either way, merging implies quite a heavy I/O activity, which is why
it's better for performance to have rolling indices.

Thanks in advance.

Subbu

--

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--