I'm planning a setup with the following 2 goals:
audit data from multiple (ruby on rails) applications need to be combined and made available for reporting. The audit data does not have a scheme, i.e. different key values will be available based on the source of the audit, event nested keys can be present so scheme-less storage is required. Data loss for this audit information is not acceptable as for example reports will be made to get a list of all clients who have downloaded product X via the rails app where a security issue was detected.
In addition we are also planning to store rails app journaling information, rails app and apache reverse proxy logging to allow easy searching and report creation . Data-loss is acceptable here.
For both goals we do not expect to have real 'big data sets' for the next 5 years.
After reading lots of stuff and founding different products I came to the following 2 solutions:
a rails application on top of postgress JSONB could solve the audit storing, however then I need to build a query REST api similar to what is already available in elasticsearch which seems stupid + is not really scalable when data is growing
existing no-sql solutions like mongodb, splunk and elasticsearch seem to provide a nice solution for both goals but apparently introduce possible undetected data-loss due to the distributed nature (I added several links on this at the bottom of my email for reference). So using this as a primary database seems tricky. Of all those products elasticsearch seems to be the best fit for our purpose.
Solution 2 seems to be the best one since future grow is guaranteed (bigger data sets, even better integration of rails app using elastic search as secondary database or even use kibana as search and reporting module within the rails app). However this solution is only acceptable if the undetected 'data-loss' part can be controlled. Most or all the references I found regarding data-loss seem to be related to the 'distributed' aspect of the elastic search solution and apparently lots of work is currently being done and has been done the past few year(s) to solve the data loss part. However https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html seems to indicate that not all issues are solved yet.
Given all this I got the following questions:
Can anyone advise whether an elasticsearch solution with only one node + using an index with only one primary shard with no backup shards can ensure no data-loss will occur? I know it won't be possible to add additional shards later without creating a new index and migrating all data. So adding more nodes later on won't be easy. I'm also aware that node failure will cause complete unavailability, but at least we know that when we receive data it is complete
If the answer to the previous question is 'yes', can I use indexes with multiple primary shards but no backup shards to allow future grow and reallocation of shards to additional servers, without introducing undetected data-loss again
Finally, perhaps more difficult to answer, is it technically even possible to make a distributed solution completely fail-save regarding data-loss? In other words, will ealsticsearch ever be able to protect against undetected data-loss. As far as I know a postgress installation will never have data loss (at least undetected data-loss). But perhaps this assumption is wrong?
data-loss links for future reference:
lucene durability tests : http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-after.html
- esiliency Status: https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html
- nice presentation on data-loss: https://www.elastic.co/videos/improving-elasticsearch-resiliency
- explanation on translog : https://www.elastic.co/guide/en/elasticsearch/guide/current/translog.html
- coping with failure : https://www.elastic.co/guide/en/elasticsearch/guide/current/_coping_with_failure.html
jepsen results for elastic search :
- https://aphyr.com/posts/317-jepsen-elasticsearch (elasticsearch version 1.1.0)
- changed info since report on 1.5.0 regarding fsync of translog https://www.elastic.co/guide/en/elasticsearch/guide/current/translog.html
elasticsearch as primary database : https://groups.google.com/forum/#!msg/elasticsearch/o0RgAvpSGdU/rHFdyyRWk1QJ