Data mismatch in elasticsearch when loading data from database using logstash

balumurari1 · December 19, 2018, 12:46pm

Hello,
I have loaded 6lakhs+ of data from database into elasticsearch using logstash.
Observed the count of the number of rows in database is equal to the number of records in elasticsearch(as highlighted in yellow color below ). But observes some data is missing when validating group by query within database and with elasticsearch query(as observed from DB Count & ES(field count)).

Is there is any chance of mismatch occuring with the data when loading data from database into elasticsearch using logstash.

Ex:-
DBQuery:
select customer_group_name,count() FROM TEST group by customer_group_name order by count() desc

Elasticsearch Query:
    GET index_name/_search
    {
       "size":0,
       "aggs":{
          "group_by_customer_group_name":{
             
             "terms":{
               "size":1000,
                "field":"customer_group_name.keyword"
             }
          }
       }
    }

Please find the difference that occured as shown below,
MisMatchedDataCount

Please help me. Waiting for your response...,
Thanks inadvance.

balumurari1 · December 20, 2018, 11:18am

The input code used is as follows,

input {

jdbc {
jdbc_driver_library => "XXXXXX/Downloads/sqljdbc4-2.0.jar"
jdbc_driver_class => "XXXXXX"
jdbc_connection_string => "jdbc:sqlserver://XXXXXX;user=XXXXXX;password=yyyyyy;database=ZZZZZZZZZ"
jdbc_user => "XXXXXX"
jdbc_password => "yyyyyy"
statement => "SELECT company ,name , customer_group_name  from TEST"
jdbc_paging_enabled => "true"
jdbc_page_size => "50000"
}
}

 output{
	 
	   elasticsearch { codec => json hosts => ["localhost:9200"] index => "index_name" document_type => "_doc"  } 
       stdout { codec => rubydebug }
}

Waiting for your response,

thanks inadvance.

Igor_Motov · December 20, 2018, 2:40pm

What do you get when you do match query on customer_group_name.keyword with one of the terms?

Can you identify 8 extra records in X5 and and post them here together with mapping and the original record in SQL?

balumurari1 · December 20, 2018, 2:45pm

Thanks a lot for the reply,

From the above image, 3rd and 4th columns represents the data when we execute the query.

1st and 2nd columns represents the data when we exeute the DBQuery as shown above.

Waiting for your response.
Thanks in advance..,

Igor_Motov · December 20, 2018, 2:55pm

What do you get when you execute this query?

GET index_name/_search
{
  "size": 0,
  "query": {
    "match": {
      "customer_group_name.keyword": "X5"
    }
  }
}

If you get 1048, can you identify 8 extra records and and post one of them here together with mapping and the corresponding original record in SQL?

yaauie · December 20, 2018, 4:01pm

Without an ORDER BY clause in your SQL, it is possible that your database is returning non-deterministic windows during pagination (returning some rows in multiple windows, while omitting other rows entirely).

Igor_Motov · December 20, 2018, 5:00pm

Yes that makes sense, and with auto id generation, the same record will appear as 2 different records in elasticsearch. Good to know. So the solution for that would be to add ORDER BY something to

statement => "SELECT company ,name , customer_group_name  from TEST"

balumurari1 · December 21, 2018, 12:56am

Thanks a lot @Igor_Motov , @yaauie for your response,
So if I add order by clause in my SQL statement(used in logstash file) for any particular column, will it eliminate my issue.

Waiting for your response
Thanks in advance

balumurari1 · December 21, 2018, 7:34am

Hello @Igor_Motov, @yaauie, thanks a lot for your support.
I want to inform that none of the columns data present in the database is unique.

As per your suggestion, I have modified the query as shown below,

statement => "SELECT company ,name , customer_group_name  from TEST ORDER BY customer_group_name desc"

i am getting the error as shown below,

Exception when executing JDBC query {:exception=>#<Sequel::DatabaseError:
Java::ComMicrosoftSqlserverJdbc::SQLServerException: The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.>}

So, i have modified the query as,

statement => "SELECT TOP 638858 company ,name , customer_group_name  from TEST ORDER BY customer_group_name desc"

the data has been loaded into ES correctly.

I request you to please check the query and give me if i am moving in the wrong path.

waiting for your response.

Thanks inadvance.

yaauie · December 21, 2018, 12:24pm

If you ORDER BY one field, and that field is not unique, you are not guaranteeing order; you need to define an ORDER BY clause that guarantees order of the entire dataset.

balumurari1 · December 21, 2018, 12:38pm

thank you @yaauie for your feedback,

Right now,we have the same scenario like suppose we have table used in production which doesn't have any field unique, then how can we accomplish this issue to load into elasticsearch. is it possible or not.

waiting for your response. Thanks inadvance.

yaauie · December 21, 2018, 4:07pm

If no field is guaranteed to be unique, then the only way to effectively guarantee ordering is to specify all fields in the ORDER BY clause.

SELECT TOP 638858 company, name, customer_group_name  from TEST ORDER BY company, name, customer_group_name DESC

balumurari1 · December 22, 2018, 2:36am

Thank you @yaauie for your guidance in resolving this issue. Cheers!!

I have learned that
Logstash seems to use the last received record to determine the value of the tracking column. As Oracle does not guarantee the sort order of the query, the last record received is not necessarily the one with the latest ID. Thus records could be selected multiple times.

We solved this by adding "ORDER BY ID" to the SQL statement to guarantee the last received record contains the latest value in the tracking column.

balumurari1 · January 4, 2019, 6:28am

hi @yaauie, I have 28 columns in a table. i have kept order by for 3 columns and the issue repeats again with data mismatch.

do i need to order by all the 28 columns. I am still facing this issue and it is effecting me a lot.
waiting for your suggestion.

Regards,
Balu

yaauie · January 4, 2019, 8:00am

Does this table not have a primary key (even a compound primary key)? If not, then as I stated before, the only way to guarantee order is to specify all selected columns in your ORDER BY clause.

balumurari1 · January 4, 2019, 8:12am

Thanks for the reply @yaauie

yes for the 28 columns there is no primary key.
i will try by keeping all selected columns in order by clause.

balumurari1 · January 4, 2019, 9:28am

Hi @yaauie,

I have tried by specifying all the 28 columns in ORDER BY clause but observes data mismatch is occurring again.
Can I add a column to SQL statement to get row number and use it to order by rownum.

can you please suggest me to eliminate the issue.

Waiting for your response,
Thanks inadvance.

balumurari1 · January 7, 2019, 7:32am

Hi @yaauie, @Igor_Motov

I have updated the table with a new column row_number and updated that column data with the rownumber sequentially.

So, i can just say

SELECT TOP 638858 company, name, customer_group_name,row_number   from TEST ORDER BY row_number  DESC

It had run successfully. Thanks a lot for your help.

system · February 4, 2019, 7:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data showing wrong when searched Elasticsearch	9	583	September 28, 2017
Log stash not loading exact number of records in elasticsearch and on every hit results are changing Logstash	2	479	February 17, 2021
ELK 6.1.1 indices document count mismatch Elasticsearch	1	473	March 6, 2018
Elastic document count mismatch - SQL to ES Elasticsearch	4	1412	December 13, 2017
Logstash to elasticsearch Logstash	4	372	May 18, 2018

Data mismatch in elasticsearch when loading data from database using logstash

Related topics