Hello,
I have loaded 6lakhs+ of data from database into elasticsearch using logstash.
Observed the count of the number of rows in database is equal to the number of records in elasticsearch(as highlighted in yellow color below ). But observes some data is missing when validating group by query within database and with elasticsearch query(as observed from DB Count & ES(field count)).
Is there is any chance of mismatch occuring with the data when loading data from database into elasticsearch using logstash.
Ex:-
DBQuery:
select customer_group_name,count() FROM TEST group by customer_group_name order by count() desc
Without an ORDER BY clause in your SQL, it is possible that your database is returning non-deterministic windows during pagination (returning some rows in multiple windows, while omitting other rows entirely).
Yes that makes sense, and with auto id generation, the same record will appear as 2 different records in elasticsearch. Good to know. So the solution for that would be to add ORDER BY something to
statement => "SELECT company ,name , customer_group_name from TEST"
Thanks a lot @Igor_Motov , @yaauie for your response,
So if I add order by clause in my SQL statement(used in logstash file) for any particular column, will it eliminate my issue.
Hello @Igor_Motov, @yaauie, thanks a lot for your support.
I want to inform that none of the columns data present in the database is unique.
As per your suggestion, I have modified the query as shown below,
statement => "SELECT company ,name , customer_group_name from TEST ORDER BY customer_group_name desc"
i am getting the error as shown below,
Exception when executing JDBC query {:exception=>#<Sequel::DatabaseError:
Java::ComMicrosoftSqlserverJdbc::SQLServerException: The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.>}
So, i have modified the query as,
statement => "SELECT TOP 638858 company ,name , customer_group_name from TEST ORDER BY customer_group_name desc"
the data has been loaded into ES correctly.
I request you to please check the query and give me if i am moving in the wrong path.
If you ORDER BY one field, and that field is not unique, you are not guaranteeing order; you need to define an ORDER BY clause that guarantees order of the entire dataset.
Right now,we have the same scenario like suppose we have table used in production which doesn't have any field unique, then how can we accomplish this issue to load into elasticsearch. is it possible or not.
Thank you @yaauie for your guidance in resolving this issue. Cheers!!
I have learned that
Logstash seems to use the last received record to determine the value of the tracking column. As Oracle does not guarantee the sort order of the query, the last record received is not necessarily the one with the latest ID. Thus records could be selected multiple times.
We solved this by adding "ORDER BY ID" to the SQL statement to guarantee the last received record contains the latest value in the tracking column.
Does this table not have a primary key (even a compound primary key)? If not, then as I stated before, the only way to guarantee order is to specify all selected columns in your ORDER BY clause.
I have tried by specifying all the 28 columns in ORDER BY clause but observes data mismatch is occurring again.
Can I add a column to SQL statement to get row number and use it to order by rownum.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.