How to get all the outputs in one run using sql api?

zatom · June 16, 2019, 4:39pm

Using Sql Api in ES, kibana console only displays 1000 outputs. But, my outputs are in millions. So far, I am only able to see other batches of outputs using cursor id. But, that takes a long time and is not convenient. Any tricks I am not aware of? Can you get all the outputs no matter the size of the outputs ( without harming the clusters )? Thank you!

Andrei_Stefan · June 17, 2019, 5:11pm

There is no practical usage of getting all the documents (millions of them) in one go from Elasticsearch in a display page. For one, the data transported over the wire and the one the browser needs to deal with is huge and, secondly, ES will probably crash. But first and foremost, a human cannot almost realistically speaking deal with several millions documents.

Using the cursor is the way to go. If you want to have a look at "some" documents, refine your search to show you only those documents you might be interested in.

zatom · June 17, 2019, 6:35pm

@Andrei_Stefan But, i need to use the outputs somewhere outside of elasticsearch(data integration software) that can handle json query and where I can copy the data into a csv or any format.
So far, due to the size limitation of 10,000 , I am not able to work with al the outputs in my data integration software. I want to see all my duplicates, and delete it. what is the best route for that?
Thank you!

Andrei_Stefan · June 18, 2019, 12:29pm

For the first question, the answer is always a set of queries: either using cursors in ES SQL or using Scroll from Elasticsearch. There are also tools out there that are exporting data from Elasticsearch in one format or another, but they are always using the same approach as above.

Regarding your duplicates, how can two documents be equal in your case? (one field is the same in both documents, multiple fields are the same etc?)
Depending on your answer and on how many documents are duplicated, there can be answers to find them without the need to export them in a different tool.

zatom · June 18, 2019, 5:27pm

@Andrei_Stefan I understand all the fields in the documents are the same meaning, there are exact copy/copies of the original documents. But, I found that elastic ids like this "_id" : "DIkra2sBDJMGugEzLaDq" are unique even though it is an exact copy of another document. So, for 5 copies of same documents, there exists 5 different unique ids.

We have a unique field value for each documents. We want to delete duplicates on the basis of that field value. Right now, we can count how many times the field value occurs through out the index using aggregation. We are not sure how do we go from here in order to delete the duplicates

Andrei_Stefan · June 18, 2019, 5:52pm

There are so many duplicates that you cannot do this manually or with a small script using your language/scripting language of choice, if you have the IDs that have duplicated documents?

For example, something like this: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

zatom · June 18, 2019, 6:59pm

@Andrei_Stefan we tried deleting documents by delete_by_query, but that is not practical for millions of documents. Is there any way to delete a list of ids through delete_by_query?

For example, is it possible something like this?
POST /type_1/_delete_by_query
{

"query": {

"match": {
"_id":["DIkra2sBDJMGugEzLaDq","asdfds"...,,,etc]



}

}

system · July 16, 2019, 6:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic search SQL Elasticsearch	2	335	June 5, 2020
Curl Output is not complete Elasticsearch	7	991	December 13, 2020
Get all results at once in Elasticsearch SQL syntax Elasticsearch	12	5381	December 3, 2018
Exporting the output of elasticsearch query Elasticsearch	2	3344	July 5, 2017
Effective Way to Remove Existing Duplicate Documents in ElasticSearch Elasticsearch	12	5502	January 14, 2021

How to get all the outputs in one run using sql api?

Related topics