Slow handling of documents when large text in a field

lubosvr · April 19, 2022, 7:31am

Hi,
We have problems when handling for documents with large content in a single field.

We have index with mapping like this:

"mappings" : {
  "properties" : {
    "properties" : {
      "name" : {
        "type" : "text"
      },
      "searchContentHTML" : {
        "type" : "text"
      }
    }
  }

The problem is that inside searchContentHTML can be quite a long text (MBs) that we are basically using for fulltext search only. (Hardly to be usefull for returning to clients)
When the text is about 14MB long the getById query (with _source_excludes=searchContentHTML parameter) takes about 100ms and when the text is small it takes 30ms.

It means it is 3 times longer to simply get one field whenone filed is long..!

Are there any good practices how to handle such document with Elasticsearch?

stephenb · April 19, 2022, 3:42pm

Did you look a this

You can set

"_source": false

and then just select the fields you want to return

"fields": [ "name"]

GET my-index-000001/_search
{
  "query": {
....
  },
  "fields": [ "name"]
  "_source": false
}

That may be faster....

lubosvr · April 19, 2022, 4:02pm

Thanks for reply @stephenb
I've tried

POST quickassets/_search
{
  "query": {
    "match": {
      "_id": "-YFy-n8BGruhv47CIXSL"
    }
  }, 
  "fields": [ "name"],
  "_source": false
}

And the response took 90-110ms, so the same time

Currently I see only one option. Move the searchContentHTML field to another index and handle fulltext search a different way - search multiple indices.

stephenb · April 19, 2022, 4:06pm

Hmmm interesting I would have expected that to be much faster...

When you used _source ... you used exclude could you just try

"_source": "name",

lubosvr · April 19, 2022, 8:50pm

Sure
with

POST quickassets/_search
{
  "query": {
    "match": {
      "_id": "-YFy-n8BGruhv47CIXSL"
    }
  }, 
  "fields": [ "name"],
  "_source": "name"
}

I'm getting the same times. It looks ES has troubles to parse such a huge documents..
BTW here is the response I can see in DevTools in Kibana (for the long) document:

{
  "took" : 83,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "-YFy-n8BGruhv47CIXSL",
        "_score" : 1.0,
        "_source" : {
          "name" : "long_text"
        },
        "fields" : {
          "name" : [
            "long_text"
          ]
        }
      }
    ]
  }
}

And for the short one

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "sOjP_n8BLbh09t6Ko7bv",
        "_score" : 1.0,
        "_source" : {
          "name" : "short_text"
        },
        "fields" : {
          "name" : [
            "short_text"
          ]
        }
      }
    ]
  }
}

stephenb · April 19, 2022, 8:55pm

When you just did this... it was still long?
Avoiding source all-together

lubosvr · April 20, 2022, 8:13am

When I do this, it takes ages in Kibana, it returns the whole document including the searchContentHTML field. I tried it with the short document.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "sOjP_n8BLbh09t6Ko7bv",
        "_score" : 1.0,
        "_source" : {
          "searchContentHTML" : "Deutsches Ipsum Dolor deserunt dissentias zu spät et. Tollit argumentum ius an. Kartoffelkopf lobortis elaboraret per ne, nam Schnaps probatus pertinax, impetus eripuit aliquando Guten Tag sea. Diam scripserit no vis, Hockenheim meis suscipit ea. Eam ea Freude schöner Götterfunken eleifend, ad blandit voluptatibus sed, Zauberer eius consul sanctus vix. Cu Freude schöner Götterfunken legimus veritus vim",
          "name" : "short_text"
        },
        "fields" : {
          "name" : [
            "short_text"
          ]
        }
      }
    ]
  }
}

stephenb · April 20, 2022, 1:01pm

Apologies I left the

"_source": false

Out of that query... Typo

Should have been

POST quickassets/_search
{
  "query": {
    "match": {
      "_id": "-YFy-n8BGruhv47CIXSL"
    }
  }, 
  "fields": [ "name"],
  "_source": false
}

I would think this would be the fastest option

lubosvr · April 20, 2022, 1:38pm

makes no difference

{
  "took" : 90,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "quickassets",
        "_type" : "_doc",
        "_id" : "-YFy-n8BGruhv47CIXSL",
        "_score" : 1.0,
        "fields" : {
          "name" : [
            "long_text"
          ]
        }
      }
    ]
  }
}

stephenb · April 20, 2022, 1:57pm

Have you forced merge the index?

Have you run in query profiler to see what's taking so long?

Ohh also why are you not using a term filter instead of a match?

With a filter no scoring is performed...

POST filebeat-7.15.2-2022.04.20-000142/_search
{
  "_source" : false,
  "fields": [
    "host.name"
  ], 
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "_id": "Fh89R4ABxTyfpfWFmza8"
          }
        }
      ]
    }
  }
}

lubosvr · April 20, 2022, 2:45pm

Yes, with no impact on performance

Well, in fact in production I'm using getById query, something like this: But it takes approximately the same time.
GET quickassets/_doc/-YFy-n8BGruhv47CIXSL?_source_excludes=searchContentHTML or
GET quickassets/_doc/-YFy-n8BGruhv47CIXSL?_source_includes=name
It led me to the conclusion that the problem is not in the query itself, but somewhere else..
I tried the profiler, but there is nothing interesting. Most time is spend by
build_scorer 23.2µs 48.5%, which makes no sense as the query takes much longer

As I wrote, I'm using get document by id..

system · May 18, 2022, 2:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow handling of documents when large text in a field Logstash	1	192	May 12, 2022
Very bad performance with large text field Elasticsearch	11	6033	July 27, 2017
Queries with large character counts in fields Elasticsearch	6	936	August 26, 2019
_source.excludes/includes makes query 2 times slower Elasticsearch	3	1417	April 2, 2020
Possible optimisations for large _source documents Elasticsearch	7	595	July 5, 2017

Slow handling of documents when large text in a field

Related topics