Speeding up _updates on docs with many small fields and 1 large field

Hi,

Representative Index:

PUT test_index
{
  "mappings": {
    "_source": {
      "excludes": [
        "bigtext"
      ]
    },
    
    "properties": {
      "bigtext": {
        "type": "text"
      },
      "orgids": {
        "type": "integer"
      },
      "folder": {
        "type": "keyword"
      },
      "docname": {
        "type": "keyword"
      },
      "docgroupid": {
        "type": "keyword"
      }
    }
  }
}

Representative search query is below.

The "filter" is a complex "bool" query based on our organization hierarchy and our various user role assignments on the organizations where each role has different folder permissions. Under certain organizations are document groups. Users can also be assigned direct access to document group folders.

GET /test_index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "folder": {
                    "value": "A"
                  }
                }
              },
              {
                "bool": {
                  "must": [
                    {
                      "term": {
                        "folder": {
                          "value": "B"
                        }
                      }
                    },
                    {
                      "term": {
                        "orgids": {
                          "value": 111
                        }
                      }
                    }
                  ]
                }
              },
              {
                "bool": {
                  "must": [
                    {
                      "term": {
                        "folder": {
                          "value": "C"
                        }
                      }
                    },
                    {
                      "term": {
                        "orgids": {
                          "value": 123
                        }
                      }
                    }
                  ]
                }
              },
              {
                "bool": {
                  "must": [
                    {
                      "term": {
                        "folder": {
                          "value": "B"
                        }
                      }
                    },
                    {
                      "terms": {
                        "docgroupid": [
                          "XXX",
                          "YYY"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ],
      "must": [
        {
          "simple_query_string": {
            "query": "***USER_SEARCH_INPUT***",
            "fields": [
              "bigtext"
            ]
          }
        }
      ]
    }
  }
}

The problem I am having is the _update_by_query command below is extremely slow. The "orgids" have to be updated when a document group is moved to another organization. The "bigtext" never changes but _update_by_query has to read the source of every document in the document group and re-index all the "bigtext"s.

In my 1 node test, a document group with 75 documents (10 are large - averaging 8 million characters) takes 90 seconds to complete the _update_by_query on the 75 docs.

POST /test_index/_update_by_query
{
  "script": {
    "source": "ctx._source.orgids=[ 843, 43, 974 ]",
    "lang": "painless"
  },
  "query": {
    "term": {
      "docgroupid": "ZZZ"
    }
  }
}

Questions: Is there any way around this unnecessary reading and re-indexing of "bigtext"? I also need highlighting of "bigtext" so any proposed solution needs to support this.

Is there some other technique like multiple indices or child documents that would allow fast updating of "orgids"?

I tried increasing the refresh_interval, but the _update_by_query is still slow.

I tried excluding "bigtext" from _source as you will notice from the representative mapping. This sped up the _update_by_query drastically, but then "bigtext" is gone and can't be searched.

Thanks,
Jeff

If the additional information is related to permissions and these are used for filtering, one way to organize this might be to use a parent-child relationship where the large document is the parent and the privileges children. This way you can add, delete or update children separately. One downside might be that it makes querying more complicated.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.