Elasticsearch as a primary database

How important is the data?

it is important : )
in insert time, loosing 50 or 100 packet for 1000,000 is acceptable
but after insert, loosing data is not acceptable

Similar use case. But, we never use elasticsearch as a primary database. Once the data is there is our databases (mostly SQL) we transform and store it on elasticsearch cluster for analysis and some adhoc projects but we do not use ES as primary. It's because, our systems were built long back and they are critical. But if you are building completely new systems then you have your freedom.


Make sure you read https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

1 Like

I check this link before ask question.
but I don't understand what exactly it is...

Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.

Basically you are aware that something can hoes wrong but we are trying hard to make that never happen

1 Like

the most important issue is lost data in Insert?
or after data indexed in elasticsearch something can goes wrong ?

1 Like


Yes. The cluster can in some extreme conditions become RED which means that some primary shards are not available anymore.

1 Like

so in this situation data completely lost ? we can not restore that ?
what is your recommendation? do we use elasticsearch as a primary database or we use another database, if we should use another, what is the best database matched with elasticsearch?

1 Like

No. Not completely. Some shards might be missing. Not all.

It depends but it might be hard to recover from that situation. Depending on how the RED actually occurred.

The one I already pasted:

Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.

So it really depends on your case. IMO you have 3 options:

  • You absolutely don't care about loosing part of your data. Let say "non critical logs". Then, may be you will loose 1 day of data at some point but may be it's not that critical for you and actually don't worth adding another server for storing
  • You absolutely care about your data and you want to be able to reindex in all cases. You need for that a datastore. A datastore can be a filesystem where you store JSON, HDFS, and/or a database you prefer and you are confident with. About how to inject data in it, you may want to read: http://david.pilato.fr/blog/2015/05/09/advanced-search-for-your-legacy-application/.
  • Well. You know that you can loose data in some extreme corner cases but you don't want to pay the price of other servers. Do backups, use elasticsearch replicas (increase to 2)... But you know the risk.


How are you getting your packet information into ES ? If you have it go through Logstash at some point in your pipeline you could easily just configure another output to a more robust datastore if you're afraid of losing data. If it's just for backup purposes I'd probably just dump it in a compressed file on my SAN through the File output plugin.

PS : 50K packets per second hardly seems like a small office's activity, that's almost what we get on one of our datacenters for 7K users :smiley:

PPS : also, unless you specifically require to store a copy of each and every network packet, I think you should look into using Netflow/IPFIX instead, that'd probably make it easier on your ES cluster than having all those packets going into it :wink:

I would say use Cassandra database as primary database and for analysis use elasticsearch

1 Like

I've consulted on two projects with 15,000 events per second. Elasticsearch puts indexing requests in the Indexing Request Queue. When that overflows, ES will reject indexing requests and lose data. To prevent data loss it is prudent to use Kafka as a buffer to level out traffic.

Read the resiliency page in the right context. Elastic is perfectionist and Data Stax is cavalier. Lack of file checksums was identified and now fixed by Elastic. I asked a Data Stax Solutions Architect if Cassandra used file checksums and he said he didn't know.

Elasticsearch uses two-phase commit to update cluster state. Cassandra gossips cluster state among nodes making the eventual consistency model very complicated and needing repairs. For writing data, turning on WRITE_CONSISTENCY=ALL differs. If not all shards can be written, the document-index operation is rolled back by Elasticsearch. Cassandra warns the application to do a roll back.

Unlike Couchbase, Cassandra can not determine the most recent updates and can not update a parallel Elasticsearch cluster. Writing a Cassandra secondary-indexing plugin for Elasticsearch is very complicated, and will result in duplicate documents that must be filtered after search. It almost impossible to use Cassandra with Elasticsearch effectively.

In any system, there is a long multi-minute latency to satisfy a query over 4.2 billion documents. A 7-day query would be 29 billion documents. The only systems capable of handling this use case are Hadoop/Spark and/or Elasticsearch.

Exact choices depend on more details of the use case.

  • One architecture is HDFS or S3 as event storage with Spark programs that write results into Elasticsearch for visualization.
  • Another choice is Elasticsearch only, with Elasticsearch Scroll programs that run on periodic jobs. ( See http://www.leapfire.com/elasticsearchjoin.html )
  • For analysis on time periods for 24 hours or less, consider Spark Streaming for short-term disposable summaries.

For one of my clients, I consulted for just 5 hours and they said I saved them weeks of work in trying to make a good decision. As part of that, I did a one-hour presentation on Time-Series Event Systems.

Consider Kafka, Logstash, Elasticsearch, Hadoop, HDFS, S3, Spark, and Spark Streaming.



It's up to the client to retry, ES has told it it cannot accept anything else, so it's not losing anything.


Well, ES is not losing the data, the ETL pipeline is losing the data. Kafka with a Logstash consumer is a lossless ETL pipeline.

Another reason Elasticsearch is a better primary store is that Snapshots are Consistent at a time point. This is a natural benefit of Lucene's immutable files. Cassandra backups are Eventually Consistent and are difficult to restore. I saw co-workers doing a Cassandra restore; it was not a good day for them.

For complete disaster recovery, restore a Snapshot from time T, then have Logstash consume messages after time T from Kafka. This will result in reading some messages twice. If the application assigns document ids (or Logstash computes them from other fields), then indexing requests will just replace a duplicate in Elasticsearch.

If a shard goes Red and can not be restored, this disaster recovery method will work well.

As a distributed primary store for Time-Series Event data, nothing is more solid than Elasticsearch using these techniques.



I think Kafka does not support random access and filter messages by time, how you handle this issue?

Just so we're all clear; This is your product you are promoting.

There's no issue with that, but it's always good to be transparent when suggesting commercial solutions.


No, it's the client's responsibility to backoff and resend these requests. They would never be acknowledged by Elasticsearch.

Write consistency all means that if not all replica shards are available at the start of an indexing request (up to a certain time window), Elasticsearch does not attempt to index the data at all and fails the request. If the write consistent is met, an indexing operation is performed on the primary and sent to the replicas and a replica fails to acknowledge an indexing request, we fail the replica and indicate that the write was not successful to all replicas in the response to the client. There is no rollback here.

Also, it's important to note that write consistency has been replaced by wait_for_active_shards which we think clarifies that it is a pre-flight check.