Elasticsearch as a primary database


(Hossein Eivazy) #1

we have a project with 50,000 rows insert per second in database.
our tables has schema...
after inserts, we process data and make some reports and analysis...
is good solution using elasticsearch as a primary database for our case or we should use another database and use elasticsearch just for analysis.
and if I want use another database with elasticsearch, what is the recommended database for use with elasticsearch?


(Mark Walkom) #2

There's a lot of "it depends" here.
What sort of data, what sort of analysis?


(Hossein Eivazy) #3

our data is network packets of local networks, for example a small office network.
we analyse this network data for detect trends and for monitor network usage

for now, our analysis is aggregation and filter on some filed for define trends and draw some charts


(Mark Walkom) #4

How important is the data?


(Hossein Eivazy) #5

it is important : )
in insert time, loosing 50 or 100 packet for 1000,000 is acceptable
but after insert, loosing data is not acceptable


#6

Similar use case. But, we never use elasticsearch as a primary database. Once the data is there is our databases (mostly SQL) we transform and store it on elasticsearch cluster for analysis and some adhoc projects but we do not use ES as primary. It's because, our systems were built long back and they are critical. But if you are building completely new systems then you have your freedom.


(Mark Walkom) #7

Make sure you read https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html


(Hossein Eivazy) #8

I check this link before ask question.
but I don't understand what exactly it is...


(David Pilato) #9

Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.

Basically you are aware that something can hoes wrong but we are trying hard to make that never happen


(Hossein Eivazy) #10

the most important issue is lost data in Insert?
or after data indexed in elasticsearch something can goes wrong ?


(David Pilato) #11

Yes.

Yes. The cluster can in some extreme conditions become RED which means that some primary shards are not available anymore.


(Hossein Eivazy) #12

so in this situation data completely lost ? we can not restore that ?
what is your recommendation? do we use elasticsearch as a primary database or we use another database, if we should use another, what is the best database matched with elasticsearch?


(David Pilato) #13

No. Not completely. Some shards might be missing. Not all.

It depends but it might be hard to recover from that situation. Depending on how the RED actually occurred.

The one I already pasted:

Some customers use Elasticsearch as a primary datastore, some set-up comprehensive back-up solutions using features such as our Snapshot and Restore, while others use Elasticsearch in conjunction with a data storage system like Hadoop or even flat files. Elasticsearch can be used for so many different use cases which is why we have created this page to make sure you are fully informed when you are architecting your system.

So it really depends on your case. IMO you have 3 options:

  • You absolutely don't care about loosing part of your data. Let say "non critical logs". Then, may be you will loose 1 day of data at some point but may be it's not that critical for you and actually don't worth adding another server for storing
  • You absolutely care about your data and you want to be able to reindex in all cases. You need for that a datastore. A datastore can be a filesystem where you store JSON, HDFS, and/or a database you prefer and you are confident with. About how to inject data in it, you may want to read: http://david.pilato.fr/blog/2015/05/09/advanced-search-for-your-legacy-application/.
  • Well. You know that you can loose data in some extreme corner cases but you don't want to pay the price of other servers. Do backups, use elasticsearch replicas (increase to 2)... But you know the risk.

(Nicolas Maire) #14

Hi,

How are you getting your packet information into ES ? If you have it go through Logstash at some point in your pipeline you could easily just configure another output to a more robust datastore if you're afraid of losing data. If it's just for backup purposes I'd probably just dump it in a compressed file on my SAN through the File output plugin.

PS : 50K packets per second hardly seems like a small office's activity, that's almost what we get on one of our datacenters for 7K users :smiley:

PPS : also, unless you specifically require to store a copy of each and every network packet, I think you should look into using Netflow/IPFIX instead, that'd probably make it easier on your ES cluster than having all those packets going into it :wink:


(Arun Palanisamy) #15

I would say use Cassandra database as primary database and for analysis use elasticsearch


(Geena Rollins) #16

I've consulted on two projects with 15,000 events per second. Elasticsearch puts indexing requests in the Indexing Request Queue. When that overflows, ES will reject indexing requests and lose data. To prevent data loss it is prudent to use Kafka as a buffer to level out traffic.

Read the resiliency page in the right context. Elastic is perfectionist and Data Stax is cavalier. Lack of file checksums was identified and now fixed by Elastic. I asked a Data Stax Solutions Architect if Cassandra used file checksums and he said he didn't know.

Elasticsearch uses two-phase commit to update cluster state. Cassandra gossips cluster state among nodes making the eventual consistency model very complicated and needing repairs. For writing data, turning on WRITE_CONSISTENCY=ALL differs. If not all shards can be written, the document-index operation is rolled back by Elasticsearch. Cassandra warns the application to do a roll back.

Unlike Couchbase, Cassandra can not determine the most recent updates and can not update a parallel Elasticsearch cluster. Writing a Cassandra secondary-indexing plugin for Elasticsearch is very complicated, and will result in duplicate documents that must be filtered after search. It almost impossible to use Cassandra with Elasticsearch effectively.

In any system, there is a long multi-minute latency to satisfy a query over 4.2 billion documents. A 7-day query would be 29 billion documents. The only systems capable of handling this use case are Hadoop/Spark and/or Elasticsearch.

Exact choices depend on more details of the use case.

  • One architecture is HDFS or S3 as event storage with Spark programs that write results into Elasticsearch for visualization.
  • Another choice is Elasticsearch only, with Elasticsearch Scroll programs that run on periodic jobs. ( See http://www.leapfire.com/elasticsearchjoin.html )
  • For analysis on time periods for 24 hours or less, consider Spark Streaming for short-term disposable summaries.

For one of my clients, I consulted for just 5 hours and they said I saved them weeks of work in trying to make a good decision. As part of that, I did a one-hour presentation on Time-Series Event Systems.

Consider Kafka, Logstash, Elasticsearch, Hadoop, HDFS, S3, Spark, and Spark Streaming.

...Geena


(Mark Walkom) #17

It's up to the client to retry, ES has told it it cannot accept anything else, so it's not losing anything.


(Geena Rollins) #18

Well, ES is not losing the data, the ETL pipeline is losing the data. Kafka with a Logstash consumer is a lossless ETL pipeline.


(Geena Rollins) #19

Another reason Elasticsearch is a better primary store is that Snapshots are Consistent at a time point. This is a natural benefit of Lucene's immutable files. Cassandra backups are Eventually Consistent and are difficult to restore. I saw co-workers doing a Cassandra restore; it was not a good day for them.

For complete disaster recovery, restore a Snapshot from time T, then have Logstash consume messages after time T from Kafka. This will result in reading some messages twice. If the application assigns document ids (or Logstash computes them from other fields), then indexing requests will just replace a duplicate in Elasticsearch.

If a shard goes Red and can not be restored, this disaster recovery method will work well.

As a distributed primary store for Time-Series Event data, nothing is more solid than Elasticsearch using these techniques.

...Geena


(Hossein Eivazy) #20

I think Kafka does not support random access and filter messages by time, how you handle this issue?