Data mismatch in elasticsearch when loading data from database using logstash

Hello,
I have loaded 6lakhs+ of data from database into elasticsearch using logstash.
Observed the count of the number of rows in database is equal to the number of records in elasticsearch(as highlighted in yellow color below ). But observes some data is missing when validating group by query within database and with elasticsearch query(as observed from DB Count & ES(field count)).

Is there is any chance of mismatch occuring with the data when loading data from database into elasticsearch using logstash.

Ex:-
DBQuery:
select customer_group_name,count() FROM TEST group by customer_group_name order by count() desc

Elasticsearch Query:
    GET index_name/_search
    {
       "size":0,
       "aggs":{
          "group_by_customer_group_name":{
             
             "terms":{
               "size":1000,
                "field":"customer_group_name.keyword"
             }
          }
       }
    }

Please find the difference that occured as shown below,
MisMatchedDataCount

Please help me. Waiting for your response...,
Thanks inadvance.

The input code used is as follows,

input {

jdbc {
jdbc_driver_library => "XXXXXX/Downloads/sqljdbc4-2.0.jar"
jdbc_driver_class => "XXXXXX"
jdbc_connection_string => "jdbc:sqlserver://XXXXXX;user=XXXXXX;password=yyyyyy;database=ZZZZZZZZZ"
jdbc_user => "XXXXXX"
jdbc_password => "yyyyyy"
statement => "SELECT company ,name , customer_group_name  from TEST"
jdbc_paging_enabled => "true"
jdbc_page_size => "50000"
}
}

 output{
	 
	   elasticsearch { codec => json hosts => ["localhost:9200"] index => "index_name" document_type => "_doc"  } 
       stdout { codec => rubydebug }
}

Waiting for your response,

thanks inadvance.

What do you get when you do match query on customer_group_name.keyword with one of the terms?

Can you identify 8 extra records in X5 and and post them here together with mapping and the original record in SQL?

Thanks a lot for the reply,

From the above image, 3rd and 4th columns represents the data when we execute the query.

1st and 2nd columns represents the data when we exeute the DBQuery as shown above.

Waiting for your response.
Thanks in advance..,

What do you get when you execute this query?

GET index_name/_search
{
  "size": 0,
  "query": {
    "match": {
      "customer_group_name.keyword": "X5"
    }
  }
}

If you get 1048, can you identify 8 extra records and and post one of them here together with mapping and the corresponding original record in SQL?

Without an ORDER BY clause in your SQL, it is possible that your database is returning non-deterministic windows during pagination (returning some rows in multiple windows, while omitting other rows entirely).

2 Likes

Yes that makes sense, and with auto id generation, the same record will appear as 2 different records in elasticsearch. Good to know. So the solution for that would be to add ORDER BY something to

statement => "SELECT company ,name , customer_group_name  from TEST"
1 Like

Thanks a lot @Igor_Motov , @yaauie for your response,
So if I add order by clause in my SQL statement(used in logstash file) for any particular column, will it eliminate my issue.

Waiting for your response
Thanks in advance

Hello @Igor_Motov, @yaauie, thanks a lot for your support.
I want to inform that none of the columns data present in the database is unique.

As per your suggestion, I have modified the query as shown below,

statement => "SELECT company ,name , customer_group_name  from TEST ORDER BY customer_group_name desc"

i am getting the error as shown below,

Exception when executing JDBC query {:exception=>#<Sequel::DatabaseError:
Java::ComMicrosoftSqlserverJdbc::SQLServerException: The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.>}

So, i have modified the query as,

statement => "SELECT TOP 638858 company ,name , customer_group_name  from TEST ORDER BY customer_group_name desc"

the data has been loaded into ES correctly.

I request you to please check the query and give me if i am moving in the wrong path.

waiting for your response.

Thanks inadvance.

If you ORDER BY one field, and that field is not unique, you are not guaranteeing order; you need to define an ORDER BY clause that guarantees order of the entire dataset.

1 Like

thank you @yaauie for your feedback,

Right now,we have the same scenario like suppose we have table used in production which doesn't have any field unique, then how can we accomplish this issue to load into elasticsearch. is it possible or not.

waiting for your response. Thanks inadvance.

If no field is guaranteed to be unique, then the only way to effectively guarantee ordering is to specify all fields in the ORDER BY clause.

SELECT TOP 638858 company, name, customer_group_name  from TEST ORDER BY company, name, customer_group_name DESC
1 Like

Thank you @yaauie for your guidance in resolving this issue. Cheers!!

I have learned that
Logstash seems to use the last received record to determine the value of the tracking column. As Oracle does not guarantee the sort order of the query, the last record received is not necessarily the one with the latest ID. Thus records could be selected multiple times.

We solved this by adding "ORDER BY ID" to the SQL statement to guarantee the last received record contains the latest value in the tracking column.
:point_up_2:
:handshake:

hi @yaauie, I have 28 columns in a table. i have kept order by for 3 columns and the issue repeats again with data mismatch.

do i need to order by all the 28 columns. I am still facing this issue and it is effecting me a lot.
waiting for your suggestion.

Regards,
Balu

Does this table not have a primary key (even a compound primary key)? If not, then as I stated before, the only way to guarantee order is to specify all selected columns in your ORDER BY clause.

1 Like

Thanks for the reply @yaauie

yes for the 28 columns there is no primary key.
i will try by keeping all selected columns in order by clause.

Hi @yaauie,

I have tried by specifying all the 28 columns in ORDER BY clause but observes data mismatch is occurring again.
Can I add a column to SQL statement to get row number and use it to order by rownum.

can you please suggest me to eliminate the issue.

Waiting for your response,
Thanks inadvance.

Hi @yaauie, @Igor_Motov

I have updated the table with a new column row_number and updated that column data with the rownumber sequentially.

So, i can just say

SELECT TOP 638858 company, name, customer_group_name,row_number   from TEST ORDER BY row_number  DESC

It had run successfully. Thanks a lot for your help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.