What's the best way to improve performance if thousands of filters?

The situation is about document level security. A sample document would be like the following

{
          "create_time": 1500000000,
          "title": "xxxxxxxxxxxxxxxxxxx",
          "access_group": ["g1", "g2716", "g3018"]
}

Say we got ten thousands groups.

It's easy for a super-admin (no group filter) or normal user (several group filters) searching documents, but when it comes to some special users with access to thousands of groups, the search performance declined significantly.

Is there any suggestions to improve performance of this situation? Thank you for help!

To be clear, a sample query would be like:

{
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "access_group": [
            "g1",
            ....
            "g10000"
          ]
        }
      },
      "should": [
        ...
      ]
    }
  }
}

Hi @Morriaty

One suggestion can be to have a second index if you expect to have thousands of records and often changes.

You keep your user index as it with create_time and title field but you remove access_group

{
          "id": 123456
          "create_time": 1500000000,
          "title": "xxxxxxxxxxxxxxxxxxx",
}

You will save your group and user relation in a different index.

Something like database but without relation constraint.

access_group_index/doc/1
{
       "user_id": 123456,
       "group": "g1"
}

access_group_index/doc/2
{
       "user_id": 123456,
       "group": "g2716"
}
etc...

Merit: you can list all the group and paginate, you can search and more easily and it will be faster (depend on your request).
Demerit: you may need to make 2 requests, one to check the group and one to get the detail of the user, depends on the context.

You can also duplicate your data and keep the list in the access_group field same as you have now and have the list in parallel for the other search, but you need to be careful and maintain 2 index. It can work depend on your constraint and your code.

I use this way to manage tags in blogs and so far I didn't have problem.

What is the size of your data set? How many users do you have? How many distinct groups are there? How frequentt try ly do you update or change group membership? Which version are you on?

sorry, don't understand how can two indices help. Doesn't it have to perform thousands of group filters in access_group_index

Hi, here is details

  • document size: 1 billion
  • users: 200, 000 groups: 1,500,000
  • update frequency: Group membership changes are not frequent. No exact statistics, but could say it no more than 10 tps.
  • ES version: 5.3.0
  • hardware: Three master nodes with 8 cores and 16GB memory, assigned 8GB to jvm. Nine data nodes with 16 cores and 64GB memory, assigned 32GB to jvm. No SSD.

I was thinking about an alternate way to implement the logic by moving a lot of the work to indexing time rather than search time, but do not think it will work at that scale. I am also not aware of any way to improve the performance of terms queries with a large number of terms so will need to leave this for someone else.

Maybe there is something that can be done by reorganizing how your data is indexed though? How many indices and shards do you have in the data spread across? How many queries are you serving per second? Do all queries always address all indices?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.