Java API: How to sort scrolled response by document ID


(Michael Celaya) #1

Hi all,

I'm currently developing (libraries' version is 2.3.3) a data validator that tries to ensure the data in our database matches the documents in ES (every field for each document) for a period of time. This could be 10 seconds or it could be hours; the amount of documents to validate could potentially go up to a few millions.

For this purpose I am currently querying the database ordering the data by our internal ID (which is the same as the document id in ES), query ES using the same sorting criteria, and then loop through the DB resultset and ES results; matching each DB column value to the appropriate field in the ES document.

My problem is that, for some reason, the sorting by _id is not working and the current order in the scrollresponse seems to be arbitrary.

Here is the relevant code (some parts omitted for clarity):

[...]
private static SearchResponse      scrollResponse;  
private static SearchHit[]         esResults;
[...]
QueryBuilder ESQuery =  QueryBuilders.rangeQuery("dashboardTimestamp")
                                                    .from(tsFrom)
                                                    .to(tsTo)
                                                    .includeLower(true)
                                                    .includeUpper(false);

FieldSortBuilder sorter = SortBuilders.fieldSort("_id").order(SortOrder.ASC);

scrollResponse = client.prepareSearch().setQuery(ESQuery)
                                       .setScroll( newTimeValue(60000))
                                       .addField("_id")
                                       .addSort(sorter)
                                       .setSize(1000)
                                       .setExplain(true).execute()
                                       .actionGet();
[...]
// printing function
while (scrollResponse.getHits().getHits().length > 0) {
        SearchHit[] results = scrollResponse.getHits().getHits();
        for (SearchHit hit : results) {
            System.out.println("ID: " + hit.getId());
        }
        scrollResponse = client.prepareSearchScroll(scrollResponse.getScrollId())
                                                .setScroll(new TimeValue(60000))
                                                .execute()
                                                .actionGet();
    }
[...]

This is the last code I've used (addField is messing up the results, but I had to try it). However, I have tried several different ways of preparing the search and none have worked so far:

  • Not using addField("_id")
  • Directly sorting with addSort("_id", SortOrder.ASC) instead of creating the sort object separately
  • Using a ScoreSortBuilder instead of FieldSortBuilder

I'm sure I'm missing something, but I haven't found any hints out there as to why this could be not working. I've read that sorting is not enabled for scan searches, but I'm not using that either... Could anybody help me with this, please?

Thank you very much in advance,
Michael.


(Michael Celaya) #2

Could anyone point me in the right direction / help me, please?


(Luca Cavanna) #3

Hi,
sorting by _id doesn't do anything because the _id field is not indexed. Try sorting by _uid instead. That is another metadata field which contains a hash of index, type and id. You are scrolling over a single index, single type I assume? In that case sorting by _uid should give you what you need.

It would be more performant to use the scan search_type if you want to scroll over a large number of documents, but as you said that one doesn't support sorting.

Cheers
Luca


(Michael Celaya) #4

Hi Luca!

Thank you very much for your help; that worked like a charm!

Yes, for the moment we just have one type of document, and regarding the indexes, I can definitely design the validation to go through one index at a time!

Cheers,
Michael.

PS sorry for the late reply, but we had technical issues with ES and I couldn't test it until today


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.