Query's getTermsEnum not executed

I've implemented an ElasticSearch custom Java plugin running an CustomScoreQuery instance. The purpose of the plugin is to search for documents whose long field value is less or equal in terms of Hamming distance then the value provided in the query. In turn, the CustomScoreQuery instance is instantiated with an MultiTermQuery instance. The MultiTermQuery subclass searches matching documents using a long field. Interestedly, the protected TermsEnum getTermsEnum(Terms terms, AttributeSource atts) throws IOException function is never executed. On the other hand, the terms iterator is returned if the property of the field is defined as text. In addition, the function is executed when using Solr, which leads to an assumption that the problem might be in the index document type mapping. The function is paramount to searching the index because it implements methods classifying documents as hit or not based on the given criteria.

Based on this, I would appreciate if anyone could explain why the function never gets executes when the field type is long? The field has been set to not_analyzed and index=true.

Below you may find fragments of the source code.

    public final class SimilarityCustomScoreQuery extends CustomScoreQuery {

    private final String queryField;
    private final Inference.Response response;
    private final String scoreField;

    public SimilarityCustomScoreQuery(String queryField, String scoreField, Inference.Response response, int maxDistance) {
        super(new SimilarityQuery(queryField, response, maxDistance));
        this.queryField = queryField;
        this.scoreField = scoreField;
        this.response = response;
    }

    @Override
    protected CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) throws IOException {
        return new SimilarityCustomScoreProvider(context, this.scoreField, this.response);
    }
}

The SimilarityQuery implementation is as follows:

public final class SimilarityQuery extends MultiTermQuery {

    interface Params {
        String MAX_DISTANCE = "max_distance";
    }

    /**
     * The default maximum Hamming distance. 
    public static int MAX_DISTANCE_DEFAULT = 13;

    /**
     * The maximum allowed distance a document is said to be accepted
     * as a search hit or not.
     */
    private final int maxDistance;

    private final long value;

    public SimilarityQuery(String field, Inference.Response response, Integer maxDistance) {
        super(field);
        this.maxDistance = Objects.isNull(maxDistance) ? MAX_DISTANCE_DEFAULT : maxDistance;
        this.value = SimilarityQuery.value(response);
    }

    @Override
    // NOTE: the getTermsEnum is never executed when the field type within the index mapping is long.
    protected TermsEnum getTermsEnum(Terms terms, AttributeSource atts) throws IOException {
        return new SimilarityTermsEnum(terms.iterator(), this.value, this.maxDistance);
    }


    @Override
    public String toString(String field) {
        return String.format("%s:%d", this.field, this.value);
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int hashCode = prime * super.hashCode() + this.maxDistance;
        if (this.field != null) {
            hashCode = prime * hashCode + this.field.hashCode();
        }
        return hashCode;
    }
}

The long type in Elasticsearch is mapped with the Point datatype in Lucene. This datatype does not create Terms and cannot be used in a MultiTermQuery. The fact that it works on Solr is linked to the fact that the version of Solr that you use still maps numbers to terms. This has changed in es starting in v5 and should also be true in Solr for the latest version (v7).
If you want to do hamming distance on strings you should define your field as a keyword, in this case the MultiTermQuery will be able to access the TermsEnum for the field. For numbers, hamming distance is not applicable and you should use a range query instead.

Thanks for the reply. Hamming distance refers in this case to bit distance and it is calculated using a bitwise XOR. What query class should I implement in order to achieve this?

You can look at PointRangeQuery which is the abstract impl for range queries on points. Though you'll have to enumerate all points (numbers in your case) and test them all to get the matching candidates, this can be slow if you have a lot of long values to index.

Thanks for the clarification. How would you then go along with calculating custom scores by implementing the Scorer interface? Because the custom query is expected to return a custom score based on a point multi value long field.

Sorry I don't understand the question. Why are you using a CustomScoreQuery ? It seems unrelated to your problem.
You only need to extend Query or PointRangeQuery and implement the logic that you want on this point field.
The PointRangeQuery is a good starting point because it shows how you can iterate the values on a point field.

I apologize for the misunderstanding. The CustomScoreQuery was used before, whereas now that I've subclassed the PointRangeQuery I'm not using the CustomScoreQuery any longer.

The PointRangeQuery uses the ConstantScoreScoret, whereas I need a custom Scorer in order to compute the score for each matching document using another long point field. That is, for querying the X long point field is used, whereas scores need to be calculated using another point long field of each matching document.

Ok I understand. Then you need to write you own scorer that takes the distance in account. Though this query as I said before will likely be very slow since it needs to iterate all values even when they are indexed as points. You should maybe try another approach that does not require such cost.

Thanks for the clarification. Is the performance impact due to using multi point long values or using any data type would have the same impact? For example, for the current Scorer implementation retrieves an instance of a Document and its associated IndexedField that is used for scoring. This value is a multi point value, i.e. an array of longs. Would using another data type, such as a keyword for example boost performance because there ElasticSearch treats arrays as individual values or not?

Is the performance impact due to using multi point long values or using any data type would have the same impact?

Any data type would be slow if you need to iterate all possible values to find matching candidates.

Regarding the Scorer implementation you should not retrieve a Document, that is far too costly to do it for all documents so you should rely on the indexed field or the doc_values. You can check
SortedNumericDocValues.newSlowRangeQuery for an example of query that retrieves numeric values from doc_values to match specific documents.
This discussion is more about Lucene than Elasticsearch so you'd have better advice if you ask the Lucene mailing list instead:
java-user@lucene.apache.org

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.