Ilm document count inconsistent rollover

Testing a simple rollover on document count policy and am getting inconsistent results.

2 cpus, 8 GB ram, 1.9 GB heap, 2 nodes (a hot and warm), 0 replicas on a virtual machine
note: I just starting looking into cluster management and understand that I am not running my cluster using best practices. I feel like this will work for my general tests, but if this is the problem, then obviously I will have to adjust.

Policy:

PUT _ilm/policy/hot-warm-delete
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_docs": 5
          },
          "set_priority": {
            "priority": 50
          }
        }
      }
    }
  }
}

For testing I adjusted the poll interval:
"indices.lifecycle.poll_interval": "1s"

When I first start testing everything seems to be working great. I will send 20 documents with a simple kibana POST and everything works as intended. New shards are created and all contain 5-6 docs. I am assuming that there will be some margin of error when sending them as fast as I am and the 1 sec poll interval.

The problem ... when I stop to cat the indices and then return to try and send another 20 documents, the rollover seems to freeze. All the documents will all get indexed on the same shard. I have tried to add ?refresh to the request and that seems to fix the problem, but also slows down the process. And if I eventually am going to add millions of documents, the extra time will start to matter. The logs show the following:

[elasticsearch.server][INFO] moving index [count-test-2020.04.21-000027] from [{"phase":"hot","action":"unfollow","name":"wait-for-yellow-step"}] to [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}] in policy [hot-warm-delete]

But I am not seeing any errors. Hope this gives enough information to troubleshoot, but of course ask for more details. Thanks in advance for the time and help.

Would it be possible to share the output of GET _cat/indices?s=index and the full policy?
Also what is the name of the rollover alias?

1 Like

GET _cat/indices?s=index&v

health status index                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .apm-agent-configuration         SzbE_fa8QbC4gMPZVDE8tA   1   1          0            0       566b           283b
green  open   .kibana_1                        UCzOfktnRYq60gQFJHuY6w   1   1         82           22    763.4kb        394.7kb
green  open   .kibana_task_manager_1           fqg2_pSaTt2f_MUXoixTZQ   1   1          2            2     41.9kb           21kb
green  open   .monitoring-es-7-2020.04.16      EwLofvWxSbGJAj1wOlGPOg   1   1     162520        94158    183.6mb         91.7mb
green  open   .monitoring-es-7-2020.04.17      82DdY4yYSqiy3bigzahG8w   1   1     173883       275586    213.8mb        106.9mb
green  open   .monitoring-es-7-2020.04.18      8PrQTEUDQ7GGhCqepX7MmQ   1   1     168733       267464    206.6mb        103.3mb
green  open   .monitoring-es-7-2020.04.19      rOzlRcetT1Cthg8u02-HGA   1   1     176696       280048    214.1mb          107mb
green  open   .monitoring-es-7-2020.04.20      80JBwxqKQHmRgTctAHS_mA   1   1     176965            0    182.4mb         91.2mb
green  open   .monitoring-es-7-2020.04.21      4KXWLJvGTvuo8LLnx63Yqw   1   1     195144       295283    249.6mb        124.6mb
green  open   .monitoring-es-7-2020.04.22      itsoLTKhRlmkToOuV1BVQw   1   1     255922        73102    523.5mb        261.3mb
green  open   .monitoring-kibana-7-2020.04.16  95pKz_EyTryfhrlPNnnANw   1   1       5871            0      2.8mb          1.4mb
green  open   .monitoring-kibana-7-2020.04.17  aiQNHvXqSUm4FAHHyTL1pQ   1   1       5992            0      2.5mb          1.3mb
green  open   .monitoring-kibana-7-2020.04.18  SNdIWScuT1-Dlg9laY8RQw   1   1       5798            0      2.4mb          1.2mb
green  open   .monitoring-kibana-7-2020.04.19  waQvq35ySMm0J2lQeiUb3w   1   1       6090            0      2.4mb          1.2mb
green  open   .monitoring-kibana-7-2020.04.20  NH6kocfATLWdsre0g6pJFA   1   1       5465            0      2.3mb          1.1mb
green  open   .monitoring-kibana-7-2020.04.21  OyZbgy7CQF-7OladEyNfRw   1   1       5998            0      2.6mb          1.3mb
green  open   .monitoring-kibana-7-2020.04.22  oaE07g5fQmCg6wzjbWKmcQ   1   1       6680            0        3mb          1.4mb
green  open   count-test-2020.04.21-000001     7LrOIkvgTIC07DV4QdBngw   1   0          5            0     13.6kb         13.6kb
green  open   count-test-2020.04.21-000002     KCV69x8UQOKiZKf3-QOW8w   1   0          6            0     20.5kb         20.5kb
green  open   count-test-2020.04.21-000003     eN6gfjRwSc-MMKV1_AfLBQ   1   0          7            0     21.3kb         21.3kb
green  open   count-test-2020.04.21-000004     0VEgWFLcQBOA69ikV3inSw   1   0          5            0     11.3kb         11.3kb
green  open   count-test-2020.04.21-000005     _BHnmp5YSPqszKpT-qwLNw   1   0         27            0     22.2kb         22.2kb
green  open   count-test-2020.04.21-000006     M74fkyTRTvSsMEYDV3PZfQ   1   0          0            0       230b           230b
green  open   file_path                        bxFDIsDZT0WK4qo3Yse4Sw   1   1    1079172            0    640.9mb        320.4mb
green  open   file_path_timeseries             njKvoThHQzm7_mZagBSlKw   1   1         13            0     23.2kb         11.6kb
green  open   filebeat-7.6.0-2020.04.10-000001 woUZY8W0TYO_B7UULxWoSw   1   1      97801            0     49.3mb         24.7mb
green  open   ilm-history-1-000001             5fnOgHAbRPObfqgJ2q0d_g   1   1       3305            0      1.2mb        655.2kb
green  open   rdbms_sync_idx                   eWlE2mqlS7qkribdbGcXHg   1   1    1491000            2    508.1mb          254mb
green  open   rdbms_url_sync_idx               AHLvIEJ_STuZoWK9_OftDQ   1   1          1            0      8.7kb          4.3kb

My policy only rolls over to a new shard with no warm, cold, or delete phase. So I think that this is the full policy:

PUT _ilm/policy/hot-warm-delete
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_docs": 5
          },
          "set_priority": {
            "priority": 50
          }
        }
      }
    }
  }
}
PUT _template/hot-warm-delete-temp
{
  "index_patterns": ["count-test-*"], 
  "settings": {
    "index.lifecycle.name": "hot-warm-delete", 
    "index.lifecycle.rollover_alias": "count-test-alias",
    "index.routing.allocation.require.box_type": "hot",
    "number_of_replicas" : 0
  }
}

Initializing the process with a PUT:

PUT count-test-2020.04.21-000001
{
  "aliases": {
    "count-test-alias":{
      "is_write_index": true 
    }
  }
} 

Adding some sample docs (showing just one as an example):

POST count-test-alias/_doc
{
  "name": "count-test-alias test 1"
}

Something that I just noticed. When executing GET _cat/indices?s=index&v after the first run, I can see everything worked and the indices contain ~5 docs. When I come back and try to add some more, that is when the rollover seems to freeze. It just indexes all docs to the newest shard and when I try GET _cat/indices?s=index&v again, the new index where all the documents stored shows 0 documents. If I run:

POST count-test-alias/_doc?refresh
{
  "name": "count-test-alias test 1"
}

and then GET _cat/indices?s=index&v again, it shows the correct number of documents and that there is a new shard ready for indexing.

Yes, the checks on the index size or the number of documents are based on the data available for search, meaning on the segments open for searching after a refresh.

In normal conditions and using defaults, an index is refreshed every second, except if becomes search idle (see doc).
Even if search idle, Elasticsearch will under some conditions trigger a refresh behind the scenes (e.g. if a segment becomes bigger than a certain size).

Your test case is a "extreme" as the poll interval of ILM you've set is triggered potentially more frequently than the refresh interval of the index.

If you trigger search requests (GET count-test-alias/_search) while indexing new documents you'll notice the refresh will occur and the number of documents will show up correctly.

The other approach would be to set a refresh interval on the index to 30s and set the ILM poll interval to 1m.

1 Like

thanks for the clarification and advice! So some my options are handle some of the logic on the client side where I could send a GET request to "force" a refresh in between/during indexing, bulk/individually insert docs with a refresh query parameter (thinking this might be my first approach), or adjust the refresh interval and ILM interval and lower my max_count to account for how many could be inserted during the time between index refresh and ilm interval time. If there is something I missed please let me know, but otherwise, thanks again for you time and help/advice!

1 Like

Well in normal conditions you should not change anything.

ILM works with default values.

You've "detected" this behavior just because you set the ILM poll interval to an extremely low value and the index max_docs in the ILM policy doesn't make sense in real life scenarios.

Do not call refresh explicitly as it's really bad both for performance and disk. Let Elasticsearch handle the refresh or eventually set the refresh_interval explicitly to 30s or 60s.

My suggestion to send a search requests while indexing was just to show the behavior of search_idle indices.
In real life, users navigating on Kibana and searching the data will indirectly make the index no more search idle, so it will trigger a refresh.

ILM is triggered by default every 10 minutes and indices will probability trigger a refresh with default configuration before the 10 minutes.

1 Like

just to clarify ... in real development or production, I could do something like set max_docs to ~1 million and max_size to ~10 GB and max_age to whatever makes sense for the index?

everything else makes perfect sense. thanks again!

ILM has been designed to keep indices and consequently shards of the same size and avoid rotating indices only by a time based criteria.

A typical issue without using ILM rollover is the fact an index (and consequently shards) can become huge because of an unexpected rise in the number of documents being sent. A common case is ingesting logs and someone switches the logs to DEBUG or an e-commerce shop having a huge increase of customers visiting the site.

Imagine you want shards having a size of 30gb each and you want to have at maximum one index per week.

You can configure the index template of the index to have a single primary shard and set max_size to 30g and max_age to 7d.

Or you could have 2 primary shards and you can set a max_size of 60g.

Yes, in general is to enforce a rule.
The number of shards / indices in a cluster under control is important.

For few tips on how many shards you might have, please check How many shards should I have in my Elasticsearch cluster? | Elastic Blog

Those numbers are just "references". You should benchmark your cluster to see what is the best shard size to cope with your indexing rate and search performances.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.