Very bad performance with large text field

At one of my customer projects we work with documents containing a very large text-field (content of eBooks....).
We saw that queries slow down more then 100 x times when such documents are queried. Even if we use the source-filter and exclude this field from the query! The only solution is to exclude the text-field from the _source at index-time from the documents:

PUT my_index
{
  "mappings": {
    "_default_": {
      "_source": {
        "excludes": [
          "mylargeTextField"
        ]
      },
....

I found the following blog post that explains the issue in detail:

What I don't understand:
Why is the search so slow even when using the source filtering in the query? There should be no need to fetch, retrieve and merge the excluded fields? I was expecting when using something like

GET /_search
{
    "_source": {
        "excludes": [ "mylargeTextField" ]
    },
    "query" : {
        "term" : { "otherField" : "something" }
    }
}

the large text-field shouldn't impact the performance at all and should be ignored for this specific query?

Currently we are using Elasticsearch also as a datastore. If such large documents are slow things down, Elasticsearch seems not to be a perfect datastore (in contrast to MongoDB)?

(We are using the latest ES 5.4.x)

Anyone from Elastic?

There are two potential reasons:

  • CPU overhead: the json parser still needs to skip over the large text field in order to exclude it from the _source, which is linear with the size of your json doc.
  • Disk overhead, since those large fields make the index larger and thus the filesystem cache can only hold a smaller ratio of the total index size.

Thanks @jpountz!
It seems not to be the "json-parsing".
If we use the search without any searchterms it's fast as hell (2 ms):

GET index_with_large_text/_search

If we use a simple searchterm, there is the performance problem (152 ms):

GET index_with_large_text/_search?q=any_field:something

So JSON parsing seems not be the problem. Must be something with "Disk overhead" during the search- or merge-phase , right? Do you thing increasing RAM/Heap-Size can fix the problem?

How large is your index (the size of the data dir) and how much do you give to the filesystem cache?

  • The "data dir" is about 490MB.
  • Disk caching is on 10G (page cache via free -mh)
  • Java Heap: "heap_max": "30.7gb" / "heap_used": "2.1gb"

It seems to be no memory/caching issue when looking on this sizing.

We tried the query in a cluster of 4 node (same params as above) and on a single node with just one shard.
The bad performance remains unchanged.

The content of the large-text is around 700KB for a single document.

Any suggestion what we can test or do? Is it possible that Elasticsearch is not practical usable on such large text fields in the _source object?

@jpountz Do you have additional ideas?

This query executes scoring on any_field. Maybe you see the effect because of missing or bad stop word analyzing, or a large number of segments.

Can you confirm you are not using really large size values? Also when you say, 100x slower, what is the order of magnitude of the response times we are talking about? Is it 100% reproducible?

When indexed without storing the large text-field in _source the "took"-time is around: 1-2ms
With this field in _source: 90ms -120ms (around 100x slower)
Yes, it's always reproducible.

For our tests we are using the default size of 10 hits that should be returned.
If using size=1 it's much faster; a size of 10000 is slowing things much more down

Currently we thinking about not storing those large texts in Elasticsearch, but using MongoDB for this. But we will loose highlighting features and some nice-to-have functions like "reindexing" and "updates" within Elasticsearch.....

It means Elasticsearch is taking about 100ms to do the source filtering for only 10 documents, which is puzzling.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.