SELECT
TB_VEICULO.PK_VEICULO as id,
TB_VEICULO.PLACA as placa,
TB_VEICULO.ANO_EXERCICIO as ano_exercicio,
TB_VEICULO.ANO_FABRICACAO as ano_fabricacao,
TB_VEICULO.ANO_MODELO as ano_modelo,
TB_VEICULO.MODELO as modelo,
TB_CLASSIFICACAO_CNJ.CLASSIFICACAO as classificacao_cnj,
TB_COR_VEICULO.COR as cor,
(SELECT EXTRACT(DAY FROM(TB_VEICULO.MODIFICATION_TIME - to_timestamp('1970-01-01', 'YYYY-MM-DD'))) * 86400000 + to_number(TO_CHAR(TB_VEICULO.MODIFICATION_TIME, 'SSSSSFF3')) FROM dual) as unix_ts_in_secs
FROM
TB_VEICULO
INNER JOIN TB_CLASSIFICACAO_CNJ
ON TB_VEICULO.ID_CLASSIFICACAO_CNJ = TB_CLASSIFICACAO_CNJ.PK_CLASSIFICACAO_CNJ
INNER JOIN TB_COR_VEICULO
ON TB_VEICULO.COR_ID = TB_COR_VEICULO.PK_COR_VEICULO
WHERE TB_VEICULO.ATIVO = 1
AND (MODIFICATION_TIME < CURRENT_TIMESTAMP AND (SELECT EXTRACT(DAY FROM(TB_VEICULO.MODIFICATION_TIME - to_timestamp('1970-01-01', 'YYYY-MM-DD'))) * 86400000 + to_number(TO_CHAR(TB_VEICULO.MODIFICATION_TIME, 'SSSSSFF3')) FROM dual) > :sql_last_value)
But I had to delete this index and recreate it for testing purpose. I deleted it with a simple DELETE:
After that the index actually disappear from Elasticsearch indexes. But after a few minutes it is recreated! And I have noticed that now the index has duplicated data.
That's the Elasticsearch log when it is automatically recreating the index:
Which stops the indexes to be recreated after the DELETE operation.
But than I can't run the Logstash pipeline again because I get this error message:
[2021-10-29T21:31:46,989][WARN ][logstash.outputs.elasticsearch][main][0de644d05de23d1cacf891dbd64de60312e77612ce6be74a7c4ad47bd744b97c] Could not index event to Elasticsearch. {:status=>404, :action=>["index", {:_id=>"5.3E+2", :_index=>"vehicles", :routing=>nil}, {"ano_fabricacao"=>"2019", "cor"=>"PRETA", "modelo"=>"SENTRA", "ano_exercicio"=>"2019", "@timestamp"=>2021-10-29T21:31:44.504Z, "ano_modelo"=>"2020", "classificacao_cnj"=>"Institucional", "placa"=>"PAL4009", "tags"=>["veĆculos"]}], :response=>{"index"=>{"_index"=>"vehicles", "_type"=>"_doc", "_id"=>"5.3E+2", "status"=>404, "error"=>{"type"=>"index_not_found_exception", "reason"=>"no such index [vehicles]", "resource.type"=>"index_expression", "resource.id"=>"vehicles", "index_uuid"=>"_na_", "index"=>"vehicles"}}}}
Which I discovered that is because I have set the auto_create_index to false.
I am starting with the ELK stack and I'm very confused about that. I want to be able to create indexes, delete them and create again if I have any changes to make. How can I achieve this behavior? Why are the indexes being automatically recreated?
Perhaps There may a couple different things going on here, not sure exactly but let's take a looks.
First lets get a couple things out of the way.
When set to true This setting just means that you can POST a doc to elasticsearch and if the index does not already exists it will create it for you. It does not automatically just create an index for no reason; a document must be POSTed. The auto create creates the index with the mapping when it receives the POST document request. If you are coming from an RDBMS world think of it as automatically creating the table and then inserting the doc. The index has a schema and settings etc these get created for you. (There is more to this as the schema can be defined in a template or just the default will get created) but that is what this setting does when set to true
When set to false you will need to create the index manually before you can ingest / POST documents to the index.
That is all that setting does.
Now what you are explaining about the indices reappearing on their on, that should not / does not happen.
Typically what is happening is that there is something else / still writing to the index (like your logstash had not completely finished, flushed or shutdown). Perhaps is was still another process writing some data and then that data got pushed , or there was data that was in the process of being flushed / synced which just showed up (this is unlikely because that process is usually set to 1s unless you changed something), indices don't just magically recreate themselves
So you need to look close at that, indices do not just appear by themselves (there is one minor case when they do but that involves Index Lifecycle Management but I am pretty sure you are not using that)
Also and I am not saying that you need to change this, but users often struggle with manually defining their own document _id . Perhaps until you get the other items figure out you should let elasticsearch do that.
If / when you go back to using the Vehicle ID if you only want 1 record you will need to use doc_as_upsert see here .
You should also set the action if you are doing this see here
Finally Also I am not sure if you only have 1 node and it is called master and it is both a master and data node if so you are fine... but if you have a dedicated master and a different data node you should direct logstash to the data node because in that case the master does not actually service the index request and it is just passes on the request... its less efficient.
Hey @stephenb! Thank you so much for your thoughtful answer!
Regarding this:
I am trying to follow this guide:
I want to sync the IDs on the database with the IDs on Elasticsearch. I thought this was a good approach to achieve this.
I have only one Logstash node that ends all the logs as soon as the pipeline finishes. And I have tow Elastichsearch nodes, one is the master. Do you think it may be the problem?
I am seeing the creating logs that I mentioned only on the master node.
I have tried to manually create the index with a POST request and then run the pipeline again, but it still recreating all the data (and duplicating it). It was running during the night and right now I have 198 duplicated registers of each line on my database.
I still have no idea why it's still getting duplicated and why it stopped on 198 registries.
Let's look at a couple things and perhaps @Badger as he knows a lot about logstash might be able to help.
BTW Here is another article that I think others found helpful
First things that looks perhaps odd is the way you are creating the tacking column, in all the examples the tracking_column is a already concrete column in the source database, perhaps that is the issue.
No I do not think that is the problem, most likely your master is also a data node, which is fine for a small / test cluster.
That is probably not an issue, and I suspect you have perhaps index setting replicas set 0
Note that there are a lot of ifs and maybes in this. It is not intended to diagnose the problem, just to suggest lines of investigation...
Your jdbc input does not have a schedule option. So once the query has finished executing the input has no more work to do. Some inputs will shutdown at that point, resulting in the logstash process exiting. If you are running logstash as a service then the service manager could interpret that as logstash crashing, and it might restart it, which would result in it running the same query again. There may be a limit to the number of times that the service manager will do the restart.
You should check whether $HOME/.logstash_jdbc_last_run (where $HOME is for the account running logstash) contains the value you expect. If not it could be a permissions issue, maybe logstash is getting an exception when it tries to write to it, and that causes the restart. Maybe the value fetched from the database is not what you expect and value.is_a?(Numeric) returns false, so that it is not persisted.
You should definitely read the logstash logs to see if it is restarting.
As @Badger said, the Logstash was crashing! I didn't see in the logs the indexes being recreated because when the container crashed it stops showing the logs (at least on this configuration that I am using). Thank you for that hint! After I changed the schedule it stopped crashing.
So I did as @stephenb suggested and tried to change the approach to sync the relational database.
I had to share the logs using GitHub Gist because the discuss has a character limit.
First I changed the auto_create_index config again:
And when the pipeline starts I see this logs on Elastisearch master node:
It's not only creating the vehicles index but also recreating previously deleted indexes that I have inserted before (that's all the data that I want to sync).
And the Logstash logs still showing the vehicles queries being executed because it's scheduled to every minute. I didn't see any errors until now on the Logstash logs.
And the vehicles that are inserted looks like this:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.