Most efficient way to count entries with specific fields

Jeremy_LHEZ · June 16, 2020, 9:40pm

Hello everyone,

I am currently working on a program and I need to count the number of entries in an index containing some specific fields.
I work in JAVA with a RestHighLevelClient, and I use PutStoredScriptRequest to add my script and SearchTemplateRequest to execute it. My query is in JSON and looks like this:

{"query":{"bool":{"should":[{"exists":{"field":"word1"}},{"exists":{"field":"word2"}},...]}}}

I have the results I want using response.getHits().getTotalHits(), but I wonder if perhaps there is a more efficient way to achieve that. I work on a very huge index (over 50000 entries) and need the best performances.

Jeremy_LHEZ · June 19, 2020, 7:05am

For later readers: I found a way to enhance the performances, simply by adopting a low level client.
It offers the same possibilities for my objectives, while being much faster on huge indices.

dadoonet · June 19, 2020, 7:29am

Could you share the code you wrote for both?

Jeremy_LHEZ · June 25, 2020, 7:40am

Sorry for the late answer, I forgot to check the forum after I found my solution.
My new implementation is based on a simple method to perform queries in java:

   	public String queryES(String queryType, String endpoint, String body) {
    		Request request = new Request(queryType, endpoint);
    		request.setJsonEntity(body);
    		try {
    			Response response = restClient.performRequest(request);
    			return EntityUtils.toString(response.getEntity());
    		} catch (IOException e) {
    			e.printStackTrace();
    		}

    		return null;
    	}

I found those information in the official documentation here (I'm using elasticsearch version 6.5.0).

The ancient version simply used a RestHighLevelClient and SearchTemplateRequest for querying. I thought it was required because it was already in some classes when I started modifying existing code. Once again, documentation about high level client methods can be found here.

It's a bit vague, but I have a lot of different methods and can't really post in details. I hope it will help, I really learnt everything I know about elasticsearch from scratch using the documentation. I can post more details or answer questions if someone requires it.

dadoonet · June 25, 2020, 8:07am

Thanks for sharing your code. But I'm still concerned by this:

If a future reader comes to this page he will think that the HLRestClient is slow. Period.
Which is not the case IMHO. It's obviously a bit slower than the LLRestClient because it has to parse the JSON response to create Java beans. If you don't do it with the HLRestClient, you will probably parse the response yourself in your code I guess unless you just send it back to the interface as is.

So I'd like to understand why it was slow in your case. Do you mind sharing what you now have in body and what was the HLRestClient code looking like?

Jeremy_LHEZ · June 25, 2020, 9:10am

I cannot show the algorithm, it is the property of my company and should remain private.
It is a search engine that looks for documents containing specific keywords. There are several queries to identify keywords given by a user and their synonyms, the body looks like this:

{
   "query":{
      "bool":{
         "should":[
            {
               "exists":{
                  "field":"content.keyword"
               
}
            
},
            {
               "exists":{
                  "field":"content.synonym1"
               
}
            
},
            {
               "exists":{
                  "field":"content.synonym2"
               
}
            
}
         
]
      
}
   
}
}

As I mentioned, it is a big index with over 13000 documents, and since the user request itself can be pretty long it requires several queries. The gains of performances were of approximately two seconds for the most complex tests I made.

The previous requests were put on the server before their execution, using two methods:

public static boolean putStoredScript(String scriptId) {
        PutStoredScriptRequest req = new PutStoredScriptRequest();
        req.id(scriptId);
        req.content(new BytesArray(query), XContentType.JSON);
        try {
            AcknowledgedResponse putStoredScriptResponse = restHighLevelClient.putScript(req, RequestOptions.DEFAULT);
            return putStoredScriptResponse.isAcknowledged();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return false;
    }

public static SearchResponse runQueryFromScript(String index, String scriptId, Map<String, Object> params) {
        SearchTemplateRequest request = new SearchTemplateRequest();
        request.setRequest(new SearchRequest(index));
        request.setScriptType(ScriptType.STORED);
        request.setScript(scriptId);
        request.setScriptParams(params);
        request.setProfile(true);
        SearchTemplateResponse response = null;
        try {
            response = restHighLevelClient.searchTemplate(request, RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        SearchResponse searchResponse = response.getResponse();

        return searchResponse;
    }

Those were already implemented, I only tried alternatives.

dadoonet · June 25, 2020, 9:30am

But the Java code you have is totally different than the query you are running with the LLClient.

The first one is using a script and a search template.
The second one is using a bool query.

This is like comparing oranges and apples.

I'm pretty sure that if you implement the same query you shown as JSON but with the HLClient you probably won't see the difference.

Jeremy_LHEZ · June 25, 2020, 9:49am

The first query used to be a template when only keywords were searched. After the addition of synonyms it was no longer possible, because the number of variables would vary from one query to another. However, it is still possible to execute the query without any variable (there is just no parameter), but I didn't know it would impact the performances (since there is no parameter).

I will give it a try, thank you for the precision. I also find the lower level client nicer to use, I end up with less code overall (perhaps because I was using previous methods the wrong way), and I am more familiar with queries in the mustache format.

system · July 23, 2020, 9:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to count records using JAVA client api? Elasticsearch	2	7136	December 23, 2016
Fetching specific fields in ES 5.X Java client API Elasticsearch	11	15884	January 9, 2017
Count API with multiple fields in Java Elasticsearch	6	1342	July 6, 2017
Query For Counting specific fields in ES Elasticsearch	3	421	June 23, 2017
Most efficient way to bulk update index? Elasticsearch	4	459	June 14, 2018

Most efficient way to count entries with specific fields

Related topics