Elasticsearch as a primary database

hossein_ey · May 15, 2017, 9:59am

we have a project with 50,000 rows insert per second in database.
our tables has schema...
after inserts, we process data and make some reports and analysis...
is good solution using elasticsearch as a primary database for our case or we should use another database and use elasticsearch just for analysis.
and if I want use another database with elasticsearch, what is the recommended database for use with elasticsearch?

warkolm · May 15, 2017, 10:17am

There's a lot of "it depends" here.
What sort of data, what sort of analysis?

hossein_ey · May 15, 2017, 10:41am

our data is network packets of local networks, for example a small office network.
we analyse this network data for detect trends and for monitor network usage

for now, our analysis is aggregation and filter on some filed for define trends and draw some charts

warkolm · May 15, 2017, 10:46am

How important is the data?

hossein_ey · May 15, 2017, 10:56am

it is important : )
in insert time, loosing 50 or 100 packet for 1000,000 is acceptable
but after insert, loosing data is not acceptable

bobby259 · May 15, 2017, 2:40pm

Similar use case. But, we never use elasticsearch as a primary database. Once the data is there is our databases (mostly SQL) we transform and store it on elasticsearch cluster for analysis and some adhoc projects but we do not use ES as primary. It's because, our systems were built long back and they are critical. But if you are building completely new systems then you have your freedom.

warkolm · May 15, 2017, 9:01pm

Make sure you read https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

hossein_ey · May 16, 2017, 5:31am

I check this link before ask question.
but I don't understand what exactly it is...

dadoonet · May 16, 2017, 5:45am

Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.

Basically you are aware that something can hoes wrong but we are trying hard to make that never happen

hossein_ey · May 16, 2017, 6:03am

the most important issue is lost data in Insert?
or after data indexed in elasticsearch something can goes wrong ?

dadoonet · May 16, 2017, 6:48am

Yes.

Yes. The cluster can in some extreme conditions become RED which means that some primary shards are not available anymore.

hossein_ey · May 16, 2017, 7:03am

so in this situation data completely lost ? we can not restore that ?
what is your recommendation? do we use elasticsearch as a primary database or we use another database, if we should use another, what is the best database matched with elasticsearch?

dadoonet · May 16, 2017, 7:30am

No. Not completely. Some shards might be missing. Not all.

It depends but it might be hard to recover from that situation. Depending on how the RED actually occurred.

The one I already pasted:

Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.

So it really depends on your case. IMO you have 3 options:

You absolutely don't care about loosing part of your data. Let say "non critical logs". Then, may be you will loose 1 day of data at some point but may be it's not that critical for you and actually don't worth adding another server for storing
You absolutely care about your data and you want to be able to reindex in all cases. You need for that a datastore. A datastore can be a filesystem where you store JSON, HDFS, and/or a database you prefer and you are confident with. About how to inject data in it, you may want to read: https://david.pilato.fr/blog/2015-05-09-advanced-search-for-your-legacy-application/.
Well. You know that you can loose data in some extreme corner cases but you don't want to pay the price of other servers. Do backups, use elasticsearch replicas (increase to 2)... But you know the risk.

n.maire · May 18, 2017, 8:28am

Hi,

How are you getting your packet information into ES ? If you have it go through Logstash at some point in your pipeline you could easily just configure another output to a more robust datastore if you're afraid of losing data. If it's just for backup purposes I'd probably just dump it in a compressed file on my SAN through the File output plugin.

PS : 50K packets per second hardly seems like a small office's activity, that's almost what we get on one of our datacenters for 7K users

PPS : also, unless you specifically require to store a copy of each and every network packet, I think you should look into using Netflow/IPFIX instead, that'd probably make it easier on your ES cluster than having all those packets going into it

Arun_Palanisamy · May 19, 2017, 5:50pm

I would say use Cassandra database as primary database and for analysis use elasticsearch

geena.rollins · May 19, 2017, 8:21pm

I've consulted on two projects with 15,000 events per second. Elasticsearch puts indexing requests in the Indexing Request Queue. When that overflows, ES will reject indexing requests and lose data. To prevent data loss it is prudent to use Kafka as a buffer to level out traffic.

Read the resiliency page in the right context. Elastic is perfectionist and Data Stax is cavalier. Lack of file checksums was identified and now fixed by Elastic. I asked a Data Stax Solutions Architect if Cassandra used file checksums and he said he didn't know.

Elasticsearch uses two-phase commit to update cluster state. Cassandra gossips cluster state among nodes making the eventual consistency model very complicated and needing repairs. For writing data, turning on WRITE_CONSISTENCY=ALL differs. If not all shards can be written, the document-index operation is rolled back by Elasticsearch. Cassandra warns the application to do a roll back.

Unlike Couchbase, Cassandra can not determine the most recent updates and can not update a parallel Elasticsearch cluster. Writing a Cassandra secondary-indexing plugin for Elasticsearch is very complicated, and will result in duplicate documents that must be filtered after search. It almost impossible to use Cassandra with Elasticsearch effectively.

In any system, there is a long multi-minute latency to satisfy a query over 4.2 billion documents. A 7-day query would be 29 billion documents. The only systems capable of handling this use case are Hadoop/Spark and/or Elasticsearch.

Exact choices depend on more details of the use case.

One architecture is HDFS or S3 as event storage with Spark programs that write results into Elasticsearch for visualization.
Another choice is Elasticsearch only, with Elasticsearch Scroll programs that run on periodic jobs. ( See http://www.leapfire.com/elasticsearchjoin.html )
For analysis on time periods for 24 hours or less, consider Spark Streaming for short-term disposable summaries.

For one of my clients, I consulted for just 5 hours and they said I saved them weeks of work in trying to make a good decision. As part of that, I did a one-hour presentation on Time-Series Event Systems.

Consider Kafka, Logstash, Elasticsearch, Hadoop, HDFS, S3, Spark, and Spark Streaming.

...Geena

warkolm · May 19, 2017, 9:20pm

It's up to the client to retry, ES has told it it cannot accept anything else, so it's not losing anything.

geena.rollins · May 19, 2017, 9:37pm

Well, ES is not losing the data, the ETL pipeline is losing the data. Kafka with a Logstash consumer is a lossless ETL pipeline.

geena.rollins · May 19, 2017, 10:05pm

Another reason Elasticsearch is a better primary store is that Snapshots are Consistent at a time point. This is a natural benefit of Lucene's immutable files. Cassandra backups are Eventually Consistent and are difficult to restore. I saw co-workers doing a Cassandra restore; it was not a good day for them.

For complete disaster recovery, restore a Snapshot from time T, then have Logstash consume messages after time T from Kafka. This will result in reading some messages twice. If the application assigns document ids (or Logstash computes them from other fields), then indexing requests will just replace a duplicate in Elasticsearch.

If a shard goes Red and can not be restored, this disaster recovery method will work well.

As a distributed primary store for Time-Series Event data, nothing is more solid than Elasticsearch using these techniques.

...Geena

hossein_ey · May 20, 2017, 8:15am

I think Kafka does not support random access and filter messages by time, how you handle this issue?

Topic		Replies	Views
May I use ES as DB to replace MongoDB? Elasticsearch	9	2814	July 6, 2017
Using ElasticSearch as Primary Data Store Elasticsearch	8	1870	July 6, 2017
Using ES as a primary datastore Elasticsearch	7	1247	July 6, 2017
elasticSearch as a document database Elasticsearch	16	1440	July 6, 2017
Is Elasticsearch capable of storing this amount of data? Elasticsearch	10	2540	July 6, 2017

Elasticsearch as a primary database

Related topics