Search By Document Length


(Kevin M.) #1

I have a field called document.content that contains the overall text of the document that I am indexing and I want to return only those records where document.content is greater than a certain length. How would I go about doing that? I did come across some examples using scripts but really didn't understand the format of the query. Here is one example I saw and I was expecting to be able to do something more like document.content.length() > 50:

{"query": {"constant_score": {"filter" : {"script" : {"script": "doc["text"].values.size() > 50" } } } }}}'

In this case above, I am not sure what that doc array reference is about or what the values is referring to. Also, I gather that there may be some issues using scripts in general (aside from the security concerns) for something like this so any suggestions on how to do this in the most performant way would be appreciated. Thanks.

Kevin


(Camilo Sierra) #2

it will be easy and faster if you stock the lengh of your text in the same document but new field "length_doc" ( int value) , and in your query you filter >> range "length_doc" > 50 !

i hope it helps


(Kevin M.) #3

Thanks for the suggestion. We have 1.8 million docs in our cluster and it took quite a while to get them in because of all the nested objects and we don't want to have to reprocess them if we can avoid it. Is there a way to add the field to the index and then dynamically update all length fields with a script call or something?


(Camilo Sierra) #4

you have the reindex API https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-reindex.html#_reindex_to_change_the_name_of_a_field , you can put your script and create a new field with the new count content !
I know that is not the most easy way but think that you made your script only once (at the reindex time) and not in each search query !


(Kevin M.) #5

Ok, thanks for the additional info, we will check it out.


(Xavier Facq) #6

Hello,

I think you can do it using a script.

Create a simple file in the script directory : ordered_by_length.groovy
contains

        _source.your_field_name.length()

In you query you can use it:

    {
    "query": {
        "match_all": {}
    },
    "sort": {
        "_script": {
        "type": "number",
        "script": {
            "lang": "groovy",
            "file": "ordered_by_length"
        },
        "order": "asc"
        }
    },
    "from": 0,
    "size": 10,
    "aggs": {}
    }

I seems to work, but i will not be very fast I think... If you need to add a limit it's also possible
to pass the value as param to the script.

Bye,
Xavier


(Kevin M.) #7

Thanks for the suggestion, we will check it out.

Kevin


(system) #8