Using Sql Api in ES, kibana console only displays 1000 outputs. But, my outputs are in millions. So far, I am only able to see other batches of outputs using cursor id. But, that takes a long time and is not convenient. Any tricks I am not aware of? Can you get all the outputs no matter the size of the outputs ( without harming the clusters )? Thank you!
There is no practical usage of getting all the documents (millions of them) in one go from Elasticsearch in a display page. For one, the data transported over the wire and the one the browser needs to deal with is huge and, secondly, ES will probably crash. But first and foremost, a human cannot almost realistically speaking deal with several millions documents.
Using the cursor is the way to go. If you want to have a look at "some" documents, refine your search to show you only those documents you might be interested in.
@Andrei_Stefan But, i need to use the outputs somewhere outside of elasticsearch(data integration software) that can handle json query and where I can copy the data into a csv or any format.
So far, due to the size limitation of 10,000 , I am not able to work with al the outputs in my data integration software. I want to see all my duplicates, and delete it. what is the best route for that?
Thank you!
For the first question, the answer is always a set of queries: either using cursors in ES SQL or using Scroll from Elasticsearch. There are also tools out there that are exporting data from Elasticsearch in one format or another, but they are always using the same approach as above.
Regarding your duplicates, how can two documents be equal in your case? (one field is the same in both documents, multiple fields are the same etc?)
Depending on your answer and on how many documents are duplicated, there can be answers to find them without the need to export them in a different tool.
@Andrei_Stefan I understand all the fields in the documents are the same meaning, there are exact copy/copies of the original documents. But, I found that elastic ids like this "_id" : "DIkra2sBDJMGugEzLaDq" are unique even though it is an exact copy of another document. So, for 5 copies of same documents, there exists 5 different unique ids.
We have a unique field value for each documents. We want to delete duplicates on the basis of that field value. Right now, we can count how many times the field value occurs through out the index using aggregation. We are not sure how do we go from here in order to delete the duplicates
There are so many duplicates that you cannot do this manually or with a small script using your language/scripting language of choice, if you have the IDs that have duplicated documents?
For example, something like this: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/
@Andrei_Stefan we tried deleting documents by delete_by_query, but that is not practical for millions of documents. Is there any way to delete a list of ids through delete_by_query?
For example, is it possible something like this?
POST /type_1/_delete_by_query
{
"query": {
"match": {
"_id":["DIkra2sBDJMGugEzLaDq","asdfds"...,,,etc]
}
}
}
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.