Storing large amount of data in ES

Hi,

We have a use-case in which we need to index petabytes of data into ES. I
was assuming that all the indices will be stored in HDFS using the HDFS
gateway. But from the guide, I understand that each ES node will also
maintain a local copy of the indices. Is this the correct interpretation?
If yes, then what strategy can I use to distribute the data among various
ES nodes as having that much storage on a single node is not possible.

Also, since HDFS gateway is deprecated, is there some other way of storing
the indices on HDFS.

Thanks,
Anand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you use the local gateway, you don't need a single location to hold all
of your index data.

Indexes always need to be stored locally on each node for
searching/indexing operations and the local gateway utilizes this local
store for persistence.

So, you need lots of nodes with lots of local storage available on each
node. Remember to take into account replicas if you want to be tolerant to
losing a node.

Best Regards,
Paul

On Monday, April 1, 2013 8:07:22 AM UTC-6, anand nalya wrote:

Hi,

We have a use-case in which we need to index petabytes of data into ES. I
was assuming that all the indices will be stored in HDFS using the HDFS
gateway. But from the guide, I understand that each ES node will also
maintain a local copy of the indices. Is this the correct interpretation?
If yes, then what strategy can I use to distribute the data among various
ES nodes as having that much storage on a single node is not possible.

Also, since HDFS gateway is deprecated, is there some other way of storing
the indices on HDFS.

Thanks,
Anand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Regarding the HDFS part. Do you really want to store indices on HDFS or
just (raw) data? Storing indices in HDFS doesn't have a ton of value other
than treating HDFS as backup with its replication. But if you want to do
that, you can simply copy indices to HDFS while there are no writes being
done on them. If you want to store the raw data to HDFS, you could do it
at write time. Lots of people hook up Kafka or Storm in the indexing
pipeline and use Kafka's and Storm's (or Flume's) support for writing to
HDFS.

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, April 1, 2013 10:07:22 AM UTC-4, anand nalya wrote:

Hi,

We have a use-case in which we need to index petabytes of data into ES. I
was assuming that all the indices will be stored in HDFS using the HDFS
gateway. But from the guide, I understand that each ES node will also
maintain a local copy of the indices. Is this the correct interpretation?
If yes, then what strategy can I use to distribute the data among various
ES nodes as having that much storage on a single node is not possible.

Also, since HDFS gateway is deprecated, is there some other way of storing
the indices on HDFS.

Thanks,
Anand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.