Very bad performance with large text field

mos · June 19, 2017, 12:06pm

At one of my customer projects we work with documents containing a very large text-field (content of eBooks....).
We saw that queries slow down more then 100 x times when such documents are queried. Even if we use the source-filter and exclude this field from the query! The only solution is to exclude the text-field from the _source at index-time from the documents:

PUT my_index
{
  "mappings": {
    "_default_": {
      "_source": {
        "excludes": [
          "mylargeTextField"
        ]
      },
....

I found the following blog post that explains the issue in detail:

What I don't understand:
Why is the search so slow even when using the source filtering in the query? There should be no need to fetch, retrieve and merge the excluded fields? I was expecting when using something like

GET /_search
{
    "_source": {
        "excludes": [ "mylargeTextField" ]
    },
    "query" : {
        "term" : { "otherField" : "something" }
    }
}

the large text-field shouldn't impact the performance at all and should be ignored for this specific query?

Currently we are using Elasticsearch also as a datastore. If such large documents are slow things down, Elasticsearch seems not to be a perfect datastore (in contrast to MongoDB)?

(We are using the latest ES 5.4.x)

mos · June 20, 2017, 9:27am

Anyone from Elastic?

jpountz · June 20, 2017, 2:40pm

There are two potential reasons:

CPU overhead: the json parser still needs to skip over the large text field in order to exclude it from the _source, which is linear with the size of your json doc.
Disk overhead, since those large fields make the index larger and thus the filesystem cache can only hold a smaller ratio of the total index size.

mos · June 21, 2017, 3:34pm

Thanks @jpountz!
It seems not to be the "json-parsing".
If we use the search without any searchterms it's fast as hell (2 ms):

GET index_with_large_text/_search

If we use a simple searchterm, there is the performance problem (152 ms):

GET index_with_large_text/_search?q=any_field:something

So JSON parsing seems not be the problem. Must be something with "Disk overhead" during the search- or merge-phase , right? Do you thing increasing RAM/Heap-Size can fix the problem?

jpountz · June 21, 2017, 4:25pm

How large is your index (the size of the data dir) and how much do you give to the filesystem cache?

mos · June 23, 2017, 3:44pm

The "data dir" is about 490MB.
Disk caching is on 10G (page cache via free -mh)
Java Heap: "heap_max": "30.7gb" / "heap_used": "2.1gb"

It seems to be no memory/caching issue when looking on this sizing.

We tried the query in a cluster of 4 node (same params as above) and on a single node with just one shard.
The bad performance remains unchanged.

The content of the large-text is around 700KB for a single document.

Any suggestion what we can test or do? Is it possible that Elasticsearch is not practical usable on such large text fields in the _source object?

mos · June 26, 2017, 1:13pm

@jpountz Do you have additional ideas?

jprante · June 27, 2017, 1:44pm

This query executes scoring on any_field. Maybe you see the effect because of missing or bad stop word analyzing, or a large number of segments.

jpountz · June 29, 2017, 10:18am

Can you confirm you are not using really large size values? Also when you say, 100x slower, what is the order of magnitude of the response times we are talking about? Is it 100% reproducible?

mos · June 29, 2017, 11:04am

When indexed without storing the large text-field in _source the "took"-time is around: 1-2ms
With this field in _source: 90ms -120ms (around 100x slower)
Yes, it's always reproducible.

For our tests we are using the default size of 10 hits that should be returned.
If using size=1 it's much faster; a size of 10000 is slowing things much more down

Currently we thinking about not storing those large texts in Elasticsearch, but using MongoDB for this. But we will loose highlighting features and some nice-to-have functions like "reindexing" and "updates" within Elasticsearch.....

jpountz · June 29, 2017, 1:09pm

It means Elasticsearch is taking about 100ms to do the source filtering for only 10 documents, which is puzzling.

system · July 27, 2017, 1:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow handling of documents when large text in a field Elasticsearch	11	819	May 18, 2022
_source.excludes/includes makes query 2 times slower Elasticsearch	3	1417	April 2, 2020
Possible optimisations for large _source documents Elasticsearch	7	595	July 5, 2017
Elastic Search performance issues when searching on docs with large field data Elasticsearch	6	1180	October 22, 2018
Performance issues around _source and large page size Elasticsearch	5	1001	July 5, 2017

Very bad performance with large text field

Related topics