Slow handling of documents when large text in a field

Hi,
We have problems when handling for documents with large content in a single field.

We have index with mapping like this:

"mappings" : {
  "properties" : {
    "properties" : {
      "name" : {
        "type" : "text"
      },
      "searchContentHTML" : {
        "type" : "text"
      }
    }
  }

The problem is that inside searchContentHTML can be quite a long text (MBs) that we are basically using for fulltext search only. (Hardly to be usefull for returning to clients)
When the text is about 14MB long the getById query (with _source_excludes=searchContentHTML parameter) takes about 100ms and when the text is small it takes 30ms.

It means it is 3 times longer to simply get one field whenone filed is long..!

Are there any good practices how to handle such document with Elasticsearch?

Did you look a this

You can set

"_source": false

and then just select the fields you want to return

"fields": [ "name"]

GET my-index-000001/_search
{
  "query": {
....
  },
  "fields": [ "name"]
  "_source": false
}

That may be faster....

Thanks for reply @stephenb
I've tried

POST quickassets/_search
{
  "query": {
    "match": {
      "_id": "-YFy-n8BGruhv47CIXSL"
    }
  }, 
  "fields": [ "name"],
  "_source": false
}

And the response took 90-110ms, so the same time :frowning:

Currently I see only one option. Move the searchContentHTML field to another index and handle fulltext search a different way - search multiple indices.

Hmmm interesting I would have expected that to be much faster...

When you used _source ... you used exclude could you just try

"_source": "name",

Sure
with

POST quickassets/_search
{
  "query": {
    "match": {
      "_id": "-YFy-n8BGruhv47CIXSL"
    }
  }, 
  "fields": [ "name"],
  "_source": "name"
}

I'm getting the same times. It looks ES has troubles to parse such a huge documents..
BTW here is the response I can see in DevTools in Kibana (for the long) document:

{
  "took" : 83,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "-YFy-n8BGruhv47CIXSL",
        "_score" : 1.0,
        "_source" : {
          "name" : "long_text"
        },
        "fields" : {
          "name" : [
            "long_text"
          ]
        }
      }
    ]
  }
}

And for the short one

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "sOjP_n8BLbh09t6Ko7bv",
        "_score" : 1.0,
        "_source" : {
          "name" : "short_text"
        },
        "fields" : {
          "name" : [
            "short_text"
          ]
        }
      }
    ]
  }
}

When you just did this... it was still long?
Avoiding source all-together

When I do this, it takes ages in Kibana, it returns the whole document including the searchContentHTML field. I tried it with the short document.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "sOjP_n8BLbh09t6Ko7bv",
        "_score" : 1.0,
        "_source" : {
          "searchContentHTML" : "Deutsches Ipsum Dolor deserunt dissentias zu spät et. Tollit argumentum ius an. Kartoffelkopf lobortis elaboraret per ne, nam Schnaps probatus pertinax, impetus eripuit aliquando Guten Tag sea. Diam scripserit no vis, Hockenheim meis suscipit ea. Eam ea Freude schöner Götterfunken eleifend, ad blandit voluptatibus sed, Zauberer eius consul sanctus vix. Cu Freude schöner Götterfunken legimus veritus vim",
          "name" : "short_text"
        },
        "fields" : {
          "name" : [
            "short_text"
          ]
        }
      }
    ]
  }
}

Apologies I left the

"_source": false

Out of that query... Typo

Should have been

POST quickassets/_search
{
  "query": {
    "match": {
      "_id": "-YFy-n8BGruhv47CIXSL"
    }
  }, 
  "fields": [ "name"],
  "_source": false
}

I would think this would be the fastest option

makes no difference

{
  "took" : 90,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "-YFy-n8BGruhv47CIXSL",
        "_score" : 1.0,
        "fields" : {
          "name" : [
            "long_text"
          ]
        }
      }
    ]
  }
}

:frowning:

Have you forced merge the index?

Have you run in query profiler to see what's taking so long?

Ohh also why are you not using a term filter instead of a match?

With a filter no scoring is performed...

POST filebeat-7.15.2-2022.04.20-000142/_search
{
  "_source" : false,
  "fields": [
    "host.name"
  ], 
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "_id": "Fh89R4ABxTyfpfWFmza8"
          }
        }
      ]
    }
  }
}

Yes, with no impact on performance

Well, in fact in production I'm using getById query, something like this: But it takes approximately the same time.
GET quickassets/_doc/-YFy-n8BGruhv47CIXSL?_source_excludes=searchContentHTML or
GET quickassets/_doc/-YFy-n8BGruhv47CIXSL?_source_includes=name
It led me to the conclusion that the problem is not in the query itself, but somewhere else..
I tried the profiler, but there is nothing interesting. Most time is spend by
build_scorer 23.2µs 48.5%, which makes no sense as the query takes much longer

As I wrote, I'm using get document by id..

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.