Queries with large character counts in fields

jmb13562 · July 26, 2019, 2:04pm

Hi all,

I've run into an issue where searching through a relatively small data set is hitting some pretty slow performance. We're running queries for text matches like "Person Name" against a dataset of around 8GB. The problem is that this data is ingested documents like word, pdf's, or excel docs. Some of the content fields have 16+ million characters in them. Does anybody have advice on how to handle fields with such a large character count?

Any help or advice is much appreciated!
Thanks,
Jason

cjcenizal · July 26, 2019, 6:22pm

Hi Jason, thanks for posting your question! Text fields should be tokenized by default which will optimize search performance to the point where 16M characters in a text field shouldn't cause a problem.

Would you mind sharing one of the queries that's slow so we can get a better idea of what's happening? Could you please also share your mapping for the index you're searching?

Thanks,
CJ

jmb13562 · July 26, 2019, 6:45pm

Thanks for replying!

Here is the query that is being issued.

gist.github.com

https://gist.github.com/jmb13562/f8f61dcf91fda92ac0f85bb7a657c43b

Query

GET sharepointschematest/_search
{
  "aggs": {},
  "from": 0,
  "highlight": {
    "fields": {
      "FileContent": {},
      "FileContent.author": {},
      "FileContent.title": {}.
      "FileContent.content"

This file has been truncated. show original

Here are the mappings of the index. Some searches will result in query times upwards of 7-10 minutes.

gist.github.com

https://gist.github.com/jmb13562/294a40060a835c7ec0a5001a714eb8de

indexmappings

{
  "sharepointschematest" : {
    "mappings" : {
      "Elastic Document" : {
        "properties" : {
          "ESCBASE_MIMETYPE" : {
            "type" : "keyword",
            "fields" : {
              "keyword" : {
                "type" : "text",

This file has been truncated. show original

cjcenizal · July 26, 2019, 9:58pm

Thanks Jason! I believe the highlighter is the culprit here. Using highlighter with large fields will always incur performance considerations because it has to load the entire text, analyze it, and search it.

Here's a thread from a user with a similar problem as yours: Highlighting takes long time for large documents.

Here are some solutions to consider:

Set index_options to offsets in the mapping. You'll have to reindex your data to try this solution. This will allow the default unified highlighter to skip the analysis step. See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#offsets-strategy for more info.
Set term_vector to with_position_offsets in the mapping. This also requires reindexing. It will also increase the size of the index significantly. See the above links for more info on this option.
Use the fast vector highlighter by setting your highlighter type to fvh. Use this option in conjunction with the term_vector option. Comes with the same tradeoff. See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#fast-vector-highlighter.

jmb13562 · July 26, 2019, 11:42pm

Thanks CJ!

So if I'm understanding this correctly then I'd need to set these options per field within the mappings?

jmb13562 · July 29, 2019, 12:26pm

Answered my own question there. Thanks again for the help!

system · August 26, 2019, 12:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query document with very large text field Elasticsearch	1	898	July 5, 2017
Elastic Search performance issues when searching on docs with large field data Elasticsearch	6	1350	October 22, 2018
Highlighting just fields? Elasticsearch	3	284	July 6, 2017
Bad performance with large text field Elasticsearch	2	816	August 3, 2018
Very bad performance with large text field Elasticsearch	11	6238	July 27, 2017

Queries with large character counts in fields

Related topics