Search By Document Length

krmathieu · October 26, 2016, 2:10pm

I have a field called document.content that contains the overall text of the document that I am indexing and I want to return only those records where document.content is greater than a certain length. How would I go about doing that? I did come across some examples using scripts but really didn't understand the format of the query. Here is one example I saw and I was expecting to be able to do something more like document.content.length() > 50:

{"query": {"constant_score": {"filter" : {"script" : {"script": "doc["text"].values.size() > 50" } } } }}}'

In this case above, I am not sure what that doc array reference is about or what the values is referring to. Also, I gather that there may be some issues using scripts in general (aside from the security concerns) for something like this so any suggestions on how to do this in the most performant way would be appreciated. Thanks.

Kevin

Camilo_Sierra · October 26, 2016, 2:20pm

it will be easy and faster if you stock the lengh of your text in the same document but new field "length_doc" ( int value) , and in your query you filter >> range "length_doc" > 50 !

i hope it helps

krmathieu · October 26, 2016, 2:32pm

Thanks for the suggestion. We have 1.8 million docs in our cluster and it took quite a while to get them in because of all the nested objects and we don't want to have to reprocess them if we can avoid it. Is there a way to add the field to the index and then dynamically update all length fields with a script call or something?

Camilo_Sierra · October 26, 2016, 2:43pm

you have the reindex API https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-reindex.html#_reindex_to_change_the_name_of_a_field , you can put your script and create a new field with the new count content !
I know that is not the most easy way but think that you made your script only once (at the reindex time) and not in each search query !

krmathieu · October 26, 2016, 3:33pm

Ok, thanks for the additional info, we will check it out.

xavierfacq · October 27, 2016, 7:12am

Hello,

I think you can do it using a script.

Create a simple file in the script directory : ordered_by_length.groovy
contains

        _source.your_field_name.length()

In you query you can use it:

    {
    "query": {
        "match_all": {}
    },
    "sort": {
        "_script": {
        "type": "number",
        "script": {
            "lang": "groovy",
            "file": "ordered_by_length"
        },
        "order": "asc"
        }
    },
    "from": 0,
    "size": 10,
    "aggs": {}
    }

I seems to work, but i will not be very fast I think... If you need to add a limit it's also possible
to pass the value as param to the script.

Bye,
Xavier

krmathieu · October 28, 2016, 3:37pm

Thanks for the suggestion, we will check it out.

Kevin

Topic		Replies	Views
How to filter docs by text field length? Elasticsearch	2	9280	August 31, 2020
Scripting Query for returning cases where array size is larger than X Elasticsearch	3	974	February 11, 2020
Getting documents based on field length Elasticsearch	5	541	March 13, 2019
How to get a field value greater than 50 characters Elasticsearch	2	4205	August 6, 2018
Elasticsearch Query: Array field length mismatch Elasticsearch	2	310	June 26, 2023

Search By Document Length

Related topics