Elasticsearch is recreating indexes after they were deleted

I am trying to sync relational data with Elasticsearch using Logstash. To do that I have crated a pipeline to Logstash like this:

input {
  jdbc {
    tags => ["vehicles"]
    jdbc_driver_library => "/usr/share/logstash/drivers/ojdbc8.jar"
    jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
    jdbc_connection_string => "jdbc:oracle:thin:@oraprod-scan:1521/oraprod"
    jdbc_user => "username"
    jdbc_password => "password"
    jdbc_validate_connection => true
    jdbc_paging_enabled => true
    tracking_column => "unix_ts_in_secs"
    use_column_value => true
    tracking_column_type => "numeric"
    statement_filepath => "/usr/share/logstash/pipeline/vehicles.sql"

filter {
  mutate {
    copy => { "id" => "[@metadata][_id]"}
    remove_field => ["id", "@version", "unix_ts_in_secs"]

output {
  stdout {codec => rubydebug}
  elasticsearch {
    hosts => "https://es-master:9200"
    index => "vehicles"
    document_id => "%{[@metadata][_id]}"

And here's the query:

    TB_VEICULO.PLACA as placa,
    TB_VEICULO.ANO_EXERCICIO as ano_exercicio,
    TB_VEICULO.ANO_FABRICACAO as ano_fabricacao,
    TB_VEICULO.ANO_MODELO as ano_modelo,
    TB_VEICULO.MODELO as modelo,
    TB_COR_VEICULO.COR as cor,
    (SELECT EXTRACT(DAY FROM(TB_VEICULO.MODIFICATION_TIME - to_timestamp('1970-01-01', 'YYYY-MM-DD'))) * 86400000 + to_number(TO_CHAR(TB_VEICULO.MODIFICATION_TIME, 'SSSSSFF3')) FROM dual) as unix_ts_in_secs

But I had to delete this index and recreate it for testing purpose. I deleted it with a simple DELETE:

curl -XDELETE https://es-master:9200/vehicles


After that the index actually disappear from Elasticsearch indexes. But after a few minutes it is recreated! And I have noticed that now the index has duplicated data.

That's the Elasticsearch log when it is automatically recreating the index:

{"type": "server", "timestamp": "2021-10-29T21:11:35,194Z", "level": "INFO", "component": "o.e.c.m.MetadataCreateIndexService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "[vehicles] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]", "cluster.uuid": "xP6fCxy9SwCiwpqOV_X-Hg", "node.id": "pqq7TdWFTG2wtWc7Adh9RA"  }

I have searched for this auto bulk cause and I tried to change the Elasticsearch config to this:


Which stops the indexes to be recreated after the DELETE operation.

But than I can't run the Logstash pipeline again because I get this error message:

[2021-10-29T21:31:46,989][WARN ][logstash.outputs.elasticsearch][main][0de644d05de23d1cacf891dbd64de60312e77612ce6be74a7c4ad47bd744b97c] Could not index event to Elasticsearch. {:status=>404, :action=>["index", {:_id=>"5.3E+2", :_index=>"vehicles", :routing=>nil}, {"ano_fabricacao"=>"2019", "cor"=>"PRETA", "modelo"=>"SENTRA", "ano_exercicio"=>"2019", "@timestamp"=>2021-10-29T21:31:44.504Z, "ano_modelo"=>"2020", "classificacao_cnj"=>"Institucional", "placa"=>"PAL4009", "tags"=>["veĆ­culos"]}], :response=>{"index"=>{"_index"=>"vehicles", "_type"=>"_doc", "_id"=>"5.3E+2", "status"=>404, "error"=>{"type"=>"index_not_found_exception", "reason"=>"no such index [vehicles]", "resource.type"=>"index_expression", "resource.id"=>"vehicles", "index_uuid"=>"_na_", "index"=>"vehicles"}}}}

Which I discovered that is because I have set the auto_create_index to false.

I am starting with the ELK stack and I'm very confused about that. I want to be able to create indexes, delete them and create again if I have any changes to make. How can I achieve this behavior? Why are the indexes being automatically recreated?

Thanks in advance.

Hi @kamillacrozara Welcome to the community!

Perhaps There may a couple different things going on here, not sure exactly but let's take a looks.

First lets get a couple things out of the way.

When set to true This setting just means that you can POST a doc to Elasticsearch and if the index does not already exists it will create it for you. It does not automatically just create an index for no reason; a document must be POSTed. The auto create creates the index with the mapping when it receives the POST document request. If you are coming from an RDBMS world think of it as automatically creating the table and then inserting the doc. The index has a schema and settings etc these get created for you. (There is more to this as the schema can be defined in a template or just the default will get created) but that is what this setting does when set to true

When set to false you will need to create the index manually before you can ingest / POST documents to the index.

That is all that setting does.

Now what you are explaining about the indices reappearing on their on, that should not / does not happen.

Typically what is happening is that there is something else / still writing to the index (like your logstash had not completely finished, flushed or shutdown). Perhaps is was still another process writing some data and then that data got pushed , or there was data that was in the process of being flushed / synced which just showed up (this is unlikely because that process is usually set to 1s unless you changed something), indices don't just magically recreate themselves :slight_smile:

So you need to look close at that, indices do not just appear by themselves (there is one minor case when they do but that involves Index Lifecycle Management but I am pretty sure you are not using that)

Also and I am not saying that you need to change this, but users often struggle with manually defining their own document _id . Perhaps until you get the other items figure out you should let Elasticsearch do that.

If / when you go back to using the Vehicle ID if you only want 1 record you will need to use doc_as_upsert see here .

You should also set the action if you are doing this see here

Finally Also I am not sure if you only have 1 node and it is called master and it is both a master and data node if so you are fine... but if you have a dedicated master and a different data node you should direct logstash to the data node because in that case the master does not actually service the index request and it is just passes on the request... its less efficient.

Hope this helps a bit

Hey @stephenb! Thank you so much for your thoughtful answer!

Regarding this:

I am trying to follow this guide:

I want to sync the IDs on the database with the IDs on Elasticsearch. I thought this was a good approach to achieve this.

I have only one Logstash node that ends all the logs as soon as the pipeline finishes. And I have tow Elastichsearch nodes, one is the master. Do you think it may be the problem?
I am seeing the creating logs that I mentioned only on the master node.

I have tried to manually create the index with a POST request and then run the pipeline again, but it still recreating all the data (and duplicating it). It was running during the night and right now I have 198 duplicated registers of each line on my database.
I still have no idea why it's still getting duplicated and why it stopped on 198 registries.

Do I have to change some configs?

My logstash.yml looks like this:

http.host: ""
node.name: logstash
path.config: "/usr/share/logstash/pipeline/logstash.conf"

I am checking the Kibana logs right now, maybe it is there that the registries are being recreated?

I am not sure were else to look to find whats going on. :cold_sweat:

Hi @kamillacrozara

Thanks for the further explanation.

Let's look at a couple things and perhaps @Badger as he knows a lot about logstash might be able to help.

BTW Here is another article that I think others found helpful

First things that looks perhaps odd is the way you are creating the tacking column, in all the examples the tracking_column is a already concrete column in the source database, perhaps that is the issue.

No I do not think that is the problem, most likely your master is also a data node, which is fine for a small / test cluster.

That is probably not an issue, and I suspect you have perhaps index setting replicas set 0

I do not believe this is a kibana issue.

Note that there are a lot of ifs and maybes in this. It is not intended to diagnose the problem, just to suggest lines of investigation...

Your jdbc input does not have a schedule option. So once the query has finished executing the input has no more work to do. Some inputs will shutdown at that point, resulting in the logstash process exiting. If you are running logstash as a service then the service manager could interpret that as logstash crashing, and it might restart it, which would result in it running the same query again. There may be a limit to the number of times that the service manager will do the restart.

You should check whether $HOME/.logstash_jdbc_last_run (where $HOME is for the account running logstash) contains the value you expect. If not it could be a permissions issue, maybe logstash is getting an exception when it tries to write to it, and that causes the restart. Maybe the value fetched from the database is not what you expect and value.is_a?(Numeric) returns false, so that it is not persisted.

You should definitely read the logstash logs to see if it is restarting.

1 Like

Thank you very much @stephenb and @Badger.

As @Badger said, the Logstash was crashing! I didn't see in the logs the indexes being recreated because when the container crashed it stops showing the logs (at least on this configuration that I am using). Thank you for that hint! After I changed the schedule it stopped crashing.

So I did as @stephenb suggested and tried to change the approach to sync the relational database.

I had to share the logs using GitHub Gist because the discuss has a character limit.

First I changed the auto_create_index config again:


And now I have this pipeline:

input {
  jdbc {
    tags => ["vehicles"]
    jdbc_driver_library => "/usr/share/logstash/drivers/ojdbc8.jar"
    jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
    jdbc_connection_string => "jdbc:oracle:thin:@oraprod-scan:1521/oraprod"
    jdbc_user => "username"
    jdbc_password => "password"
    jdbc_validate_connection => true
    jdbc_paging_enabled => true
    statement_filepath => "/usr/share/logstash/pipeline/vehicles.sql"
    schedule => "* * * * * *" 
output {
  stdout {codec => rubydebug}
  elasticsearch {
    hosts => "https://es-master:9200"
    index => "vehicles"
    document_id => "%{id}"

First of all I have manually deleted the index and I saw this on the Elasticsearch logs:

{"type": "server", "timestamp": "2021-11-15T13:41:28,221Z", "level": "INFO", "component": "o.e.c.m.MetadataDeleteIndexService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "[vehicles/PSTfPoeOTyC8cVjKbTVnzA] deleting index", "cluster.uuid": "xP6fCxy9SwCiwpqOV_X-Hg", "node.id": "pqq7TdWFTG2wtWc7Adh9RA"  }

And when the pipeline starts I see this logs on Elastisearch master node:

It's not only creating the vehicles index but also recreating previously deleted indexes that I have inserted before (that's all the data that I want to sync).

And the Logstash logs still showing the vehicles queries being executed because it's scheduled to every minute. I didn't see any errors until now on the Logstash logs.

And the vehicles that are inserted looks like this:

One has the right id (350):

placa:JKO4691 @version:1 ano_fabricacao:2012 ano_exercicio:2017 @timestamp:Nov 15, 2021 @ 10:52:27.164 cor:PRETA modelo:FLUENCE classificacao_cnj:Institucional id:350 tags:veĆ­culos ano_modelo:2013 _id:350 _type:_doc _index:veiculos _score:0

But the others have the same data, with different ids (auto generated).

placa:JKO4691 classificacao_cnj:Institucional id:350 @version:1 type:veiculo @timestamp:Nov 15, 2021 @ 10:46:40.117 cor:PRETA ano_exercicio:2017 ano_modelo:2013 ano_fabricacao:2012 tags:veĆ­culos modelo:FLUENCE _id:CdfZI30BW8o-ZwjUmofx _type:_doc _index:veiculos _score:0`
placa:JKO4691 @version:1 id:350 type:veiculo ano_modelo:2013 ano_fabricacao:2012 modelo:FLUENCE classificacao_cnj:Institucional tags:veĆ­culos cor:PRETA @timestamp:Nov 15, 2021 @ 10:53:43.117 ano_exercicio:2017 _id:2tfgI30BW8o-ZwjUCfft _type:_doc _index:veiculos _score:0
placa:JKO4691 @version:1 ano_fabricacao:2012 ano_exercicio:2017 @timestamp:Nov 15, 2021 @ 11:00:52.548 cor:PRETA modelo:FLUENCE classificacao_cnj:Institucional id:350 tags:veĆ­culos ano_modelo:2013 _id:350 _type:_doc _index:veiculos _score:0

And it keeps inserting and duplicating data (except for the ids).

I have watched for a while and the logs still the same at Elasticsearch and Logstash.

The ./logstash_jdbc_last_run keeps changing when Logstash container is running:

sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:12:56.275203000 Z
sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:13:04.094678000 Z
sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:13:05.305984000 Z
sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:13:07.118336000 Z
sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:13:09.220938000 Z
sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:13:10.163945000 Z
sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:13:11.071724000 Z
sh-4.2$ tail ./.logstash_jdbc_last_run
--- 2021-11-15 14:13:15.298145000 Z

Then I've stopped the Logstash container and deleted all the indexes to see what would happen:

And then after a while it recreates all the indexes again and keeps duplicating data.

Here are the Elasticsearch master logs:

So I guess it's not a Logstash issue since the container was stopped.

After that the logs still the same until I delete the indexes. The Elasticsearch logs didn't change even after I restarted the Logstash container.

Let me know if I can give you more info about this.

Thank you very much for your help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.