Loosing data in elasticsearch

I have elasticsearch 2.0.0 set up with kafka elasticsearch connector. I recently discovered I am loosing data. I can't see some old data anymore. Data is lost in a FIFO way.

At first, I thought it was a heap error and increased my heap size but it is still happening.

Has anyone experienced this? Any pointers on what to do?

I Would appreciate all help.

How do you know it is being lost?

I search for the data based on the period and can't find it. I also have a graphical display of the data on a webpage so that was what alerted me to this

And are you sure it's making it to kafka and to Elasticsearch?

Yes, the data is already saved and I could see it but after a while it is gone from elasticsearch

Have you by any chance got Curator setup in a cron job to delete old indices? Is there anything in the Elasticsearch logs about indices being deleted?

No this is a new setup, I have not set up curator yet. I checked logs but couldn't see anything, I could share my logs if that helps

I installed elasticsearch from the source (apt-get)

Can you explain this more please.

How did you verify the data reached Elasticsearch?

1 Like

This:

Does this mean that you inspected the data through Kibana? Did you query Elasticsearch through the APIs?

So I am ingesting data from kafka into elasticsearch using the kafka elasticsearch connector

I can view the data as it comes in via a graph (I have a graph aggregated by day created with d3), after a while I notice days that formerly had data are empty. So I check using sense and I find that all data or most times some part of that data is missing. This is data I verified formerly was there.

Hope this helps?

I used sense to search within the period that was missing

Elasticsearch won't just delete the data, so something else must be requesting it.

1 Like

Are you using time-based indices? What does GET /_cat/indices show?

yellow open   grits                     5   1   23408051      4208025      2.7gb          2.7gb 
yellow open   .kibana                   1   1          1            0      2.9kb          2.9kb 
yellow open   test-elasticsearch-sink   5   1          0            0       785b           785b 
yellow open   grit                      5   1          0            0       785b           785b 

grits is the main index

Are you assigning an external ID to the documents you are indexing or are you allowing Elasticsearch to automatically generate an ID?

Kafka elasticsearch connector generates ids : http://docs.confluent.io/current/connect/connect-elasticsearch/docs/elasticsearch_connector.html

The index stats shows that you have deleted documents in your index. This may indicate that you have indexed multiple documents with the same ID, which would cause an update (delete + creation of new document) or that you have deleted documents by some other means, e.g. through the delete-by-query API. Elasticsearch does not by itself delete data.

If you have the ID of a document that you would expect to find in the period for which data is no longer found you could look this up and see if it has been updated or deleted.

As you are using such an old version of Elasticsearch, it just struck me that it might be possible that TTL is enabled, which would cause documents to get deleted. It was listed as deprecated in 2.0, but may still have been available. Get the settings for the index through the get index settings API and also check the mappings for the index and look for any ttl related settings.