Most efficient way to count entries with specific fields

Hello everyone,

I am currently working on a program and I need to count the number of entries in an index containing some specific fields.
I work in JAVA with a RestHighLevelClient, and I use PutStoredScriptRequest to add my script and SearchTemplateRequest to execute it. My query is in JSON and looks like this:

{"query":{"bool":{"should":[{"exists":{"field":"word1"}},{"exists":{"field":"word2"}},...]}}}

I have the results I want using response.getHits().getTotalHits(), but I wonder if perhaps there is a more efficient way to achieve that. I work on a very huge index (over 50000 entries) and need the best performances.

For later readers: I found a way to enhance the performances, simply by adopting a low level client.
It offers the same possibilities for my objectives, while being much faster on huge indices.

Could you share the code you wrote for both?

Sorry for the late answer, I forgot to check the forum after I found my solution.
My new implementation is based on a simple method to perform queries in java:

   	public String queryES(String queryType, String endpoint, String body) {
    		Request request = new Request(queryType, endpoint);
    		request.setJsonEntity(body);
    		try {
    			Response response = restClient.performRequest(request);
    			return EntityUtils.toString(response.getEntity());
    		} catch (IOException e) {
    			e.printStackTrace();
    		}

    		return null;
    	}

I found those information in the official documentation here (I'm using elasticsearch version 6.5.0).

The ancient version simply used a RestHighLevelClient and SearchTemplateRequest for querying. I thought it was required because it was already in some classes when I started modifying existing code. Once again, documentation about high level client methods can be found here.

It's a bit vague, but I have a lot of different methods and can't really post in details. I hope it will help, I really learnt everything I know about elasticsearch from scratch using the documentation. I can post more details or answer questions if someone requires it.

Thanks for sharing your code. But I'm still concerned by this:

If a future reader comes to this page he will think that the HLRestClient is slow. Period.
Which is not the case IMHO. It's obviously a bit slower than the LLRestClient because it has to parse the JSON response to create Java beans. If you don't do it with the HLRestClient, you will probably parse the response yourself in your code I guess unless you just send it back to the interface as is.

So I'd like to understand why it was slow in your case. Do you mind sharing what you now have in body and what was the HLRestClient code looking like?

I cannot show the algorithm, it is the property of my company and should remain private.
It is a search engine that looks for documents containing specific keywords. There are several queries to identify keywords given by a user and their synonyms, the body looks like this:

{
   "query":{
      "bool":{
         "should":[
            {
               "exists":{
                  "field":"content.keyword"
               
}
            
},
            {
               "exists":{
                  "field":"content.synonym1"
               
}
            
},
            {
               "exists":{
                  "field":"content.synonym2"
               
}
            
}
         
]
      
}
   
}
}

As I mentioned, it is a big index with over 13000 documents, and since the user request itself can be pretty long it requires several queries. The gains of performances were of approximately two seconds for the most complex tests I made.

The previous requests were put on the server before their execution, using two methods:

public static boolean putStoredScript(String scriptId) {
        PutStoredScriptRequest req = new PutStoredScriptRequest();
        req.id(scriptId);
        req.content(new BytesArray(query), XContentType.JSON);
        try {
            AcknowledgedResponse putStoredScriptResponse = restHighLevelClient.putScript(req, RequestOptions.DEFAULT);
            return putStoredScriptResponse.isAcknowledged();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return false;
    }

public static SearchResponse runQueryFromScript(String index, String scriptId, Map<String, Object> params) {
        SearchTemplateRequest request = new SearchTemplateRequest();
        request.setRequest(new SearchRequest(index));
        request.setScriptType(ScriptType.STORED);
        request.setScript(scriptId);
        request.setScriptParams(params);
        request.setProfile(true);
        SearchTemplateResponse response = null;
        try {
            response = restHighLevelClient.searchTemplate(request, RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        SearchResponse searchResponse = response.getResponse();

        return searchResponse;
    }

Those were already implemented, I only tried alternatives.

But the Java code you have is totally different than the query you are running with the LLClient.

The first one is using a script and a search template.
The second one is using a bool query.

This is like comparing oranges and apples.

I'm pretty sure that if you implement the same query you shown as JSON but with the HLClient you probably won't see the difference.

The first query used to be a template when only keywords were searched. After the addition of synonyms it was no longer possible, because the number of variables would vary from one query to another. However, it is still possible to execute the query without any variable (there is just no parameter), but I didn't know it would impact the performances (since there is no parameter).

I will give it a try, thank you for the precision. I also find the lower level client nicer to use, I end up with less code overall (perhaps because I was using previous methods the wrong way), and I am more familiar with queries in the mustache format.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.