Does anyone know if Elastic Search is/has been used in a production
environment? Specifically looking at a case where one index would be
receiving ~3m new records indexed into the cluster per week with a starting
set of ~600m. While we wouldn't have a huge user base, downtime and data
loss would be a huge issue.
I guess I'm slightly discouraged by the amount of issues (likely operator
error) I've been having with both ingest and general cluster health w/ ES
(lost an unreplicated shard today), and was hoping for a "yeah we've been
there, and we are processing 10x your data needs no issues..."
I am up to 200M documents inserted per day using HTTP REST API bulk
insertion with refresh settings from 5s to 20s.
The only problems I've had so far:
Transaction logs take up a lot of space, which I believe 0.17.3 will
address that.
Some operational issues had catastrophic impact, such as not having enough
open files and the index getting corrupted.
Smaller java heap size that ran out of memory, when crashing, sometimes
corrupted the index. I tried to reproduce but corruption on crash doesn't
occur in all cases..
Index and search response can be slow for a few minutes when starting up
ES and you have really large indexes.
I would like a way to recover an index short of re-indexing, as 6 months
down the road, I do not want to re-index everything should something
unexpected happen.
Can you (and other other people who responded) say something about your
setup? Will be very interesting to know what configuration you're using to
support this index.
I'm particularly interested in:
How many servers, and roughly what's their hardware configuration?
Can you (and other other people who responded) say something about your setup? Will be very interesting to know what configuration you're using to support this index.
I'm particularly interested in:
How many servers, and roughly what's their hardware configuration?
How many shards?
How many replicas?
Are they all on the same index?
How many documents per index?
What kind of performance are you getting?
Anything else interesting
We just went into production this week with 0.17.0 on Amazon EC2.
Two large instances running tomcat for with an Amazon elastic load
balancer out in front.
Two large instances that run Mongo (using replicasets) and
Elasticsearch. We are using the default settings for shards/replicas
at the moment for elasticsearch.
We have a micro AMI with only elasticsearch on it so that we can bring
up/take down one or more elasticsearch nodes at any given time.
At this point we are only in the thousands as far as documents go, but
our application records every request that a user makes so that will
quickly grow.
All I can say about performance is that it's been fast and it's been
reliable. We had an issue with the trans log and too many open files
(we had our limit set at 32k) the other day after we imported 10
thousand users without using a bulk request in our Grails application.
0.17.3 should fix the trans log issues.
Elasticsearch is amazing. It was so easy to set up and scale out. I
think that MongoDB and Elasticsearch make a great couple in an
environment like EC2
Can you (and other other people who responded) say something about your
setup? Will be very interesting to know what configuration you're using to
support this index.
I'm particularly interested in:
How many servers, and roughly what's their hardware configuration?
Depends on the index size, typically 1-5. (indexes only using meta
fields have 1, indexes using lots of full text extracted from PDFs get more)
How many replicas?
1, but we are adding more storage to increase this.
Are they all on the same index?
We have ~50 indexes. Some for historical reasons, others for
operational efficiency. We've found that when an index gets very large, the
pain of handling incompatible field mappings, requiring a rebuild becomes a
motivation fr capping indexes.
How many documents per index?
thousands to few million (depending on application)
Depends on the index size, typically 1-5. (indexes only using meta
fields have 1, indexes using lots of full text extracted from PDFs get more)
How many replicas?
1, but we are adding more storage to increase this.
Are they all on the same index?
We have ~50 indexes. Some for historical reasons, others for
operational efficiency. We've found that when an index gets very large, the
pain of handling incompatible field mappings, requiring a rebuild becomes a
motivation fr capping indexes.
How many documents per index?
thousands to few million (depending on application)
Regarding the crashes, both Lucene itself and elasticsearch go through
great effort to not corrupt the data in case of out of memory or other
problems, like open file handles. I actually have several (non automated
tests) that I run regularly that simulate those problems with no data loss.
If you can help in trying to recreate what you saw, it would be great!
Regarding recovery of data, I have been thinking, and mentioned on the
mailing list, of the ability to snapshot an index to a shared storage when
using local gateway (basically, combine the shared gateway snapshot
capabilities with local gateway). The main thing to note with this is the
fact that it takes time to transfer large amount of data back to the nodes
in case a full recovery is needed...
I am up to 200M documents inserted per day using HTTP REST API bulk
insertion with refresh settings from 5s to 20s.
The only problems I've had so far:
Transaction logs take up a lot of space, which I believe 0.17.3 will
address that.
Some operational issues had catastrophic impact, such as not having
enough open files and the index getting corrupted.
Smaller java heap size that ran out of memory, when crashing, sometimes
corrupted the index. I tried to reproduce but corruption on crash doesn't
occur in all cases..
Index and search response can be slow for a few minutes when starting up
ES and you have really large indexes.
I would like a way to recover an index short of re-indexing, as 6 months
down the road, I do not want to re-index everything should something
unexpected happen.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.