How to reindex a single index into several time-indexed indexes

My company is running elastic 6.5.1 on Elastic Cloud. In the last couple of months our memory pressure maxed out our 2GB nodes, so we resized to 4GB. Very recently our memory pressure has maxed out again at 4GB and we're going to need to resize again.

We are mostly tracking time-series data and our workload is mostly aggregation for Kibana charts and almost no searching, and from my research so far it seems that using time-indexing would be much better to constrain memory usage for aggregation workloads. Also, it seems like we can reduce each index from 5 shards 1 replica to 1 shard 1 replica.

My plan is to configure our mappings for newly ingested data to use time-indexing and 1S1R and then to reindex our existing data into time-indexed indexes configured to 1S1R. My goal is to stabilize memory pressure and prevent us from having to buy larger and larger nodes.

I'm a relative beginner with managing an ELK stack, so firstly it would be great to hear from someone if this plan makes sense, and what steps I need to take to reindex this data. I've never performed any reindexing and I haven't been able to find any good step-by-step examples of how to approach this reindexing task.

Thanks in advance. If any more information would be helpful I'll try my best to provide it.

How many indices do you have in the cluster? How many shards? What is the average size?

We have 1180 shards (average size ~47mb) and 290 indexes (average size ~143mb).

These numbers are based on my latest output from:
GET _cat/shards?v
GET _cat/indices?v

That is a lot of shards for a small cluster. Please read this blog post for some practical guidance.

How many indices were you going to reindex this into?

That is a lot of shards for a small cluster.

I imagine that's true. We only recently started using elasticsearch to aid in tracking business metrics in the last few months and just got up and running with default values, so clearly our config isn't optimal.

I'll be reading the article you linked over my lunch break. Thanks for posting it.

How many indices were you going to reindex this into?

My plan was to use daily indexes so that as we scale up operations our indexing is granular enough, but that might be overdoing it. We have data for basically every day going back to the beginning of 2018 and we have about 25 monolithic indexes that contain most of the data for that entire time range. So if my math is right, I guess that translates into something like 500 days * 25 indexes = 13300 indexes.

My expectation is that most of these indexes/shards will not have to be used most of the time since aggregations are generally for the current month (with some exceptions), but I'm sure there are other considerations I'm not aware of.

As explained in the blog post having lots of small indices and shards like you are suggesting will cause performance problems and waste resources. Having 13300 indices in such a small cluster sounds like a very, very bad idea.

I think you should go the other way and reduce the number of shards you currently have in the cluster. If you are going to use time based indices, go for monthly rather than daily ones with a single primary shard.

That's why I wanted to run this by someone.

What are my alternatives? What can I do to prevent heap utilization from getting so large, or is that just unavoidable? Is this just a matter of tuning the period over which I'm time-indexing data, or am I looking at this the wrong way?

Use fewer, larger shards and have a look at this webinar as well.

Thanks for the resources. I'll have to spend some time reading these and reformulate my plan.

The trouble is that after this discussion, I don't feel like I even have a vague direction for actually solving my problem. It's unclear to me whether any of the methods discussed will have a positive impact and I feel as though I'm just back to the drawing board.

For all I know, time-indexing over indexes that are mostly in the hundred-ish megabyte range is only going to make things worse.

The webinar discusses ways to reduce heap usage while the blog post primarily deals with sharing best practices.

Well, I'll just spend some time reading. Thanks fo taking the time.

Okay, so I've had the chance to do some reading and I've watched the webinar and I definitely have a better grasp of the factors that matter to my use case. However, after checking on the monitoring and cluster stats, I'm confused as to where all my memory is going.

As for what will likely improve my situation:

  1. It's clear that my indexes are over-sharded. Reducing the number is going to help at least a bit.
  2. I have several indexes that have high-cardinality keyword fields (our internal UUIDs for documents) that can be remapped to text fields or something to avoid wasting memory.

But as for my questions... Based on my node monitoring advanced panel, heap usage is being reported at 3gb of 4gb, but all of the graphs for lucene memory usage are at least an order of magnitude smaller. Where does all of this heap size come from that's going unreported?

...continued because I'm too long-winded...

Similarly, if I check my cluster stats, with GET /_cluster/stats?human&pretty, the amount of memory usage reported by indexes.fielddata, indexes.query_cache and indexes.segments doesn't even break 1gb when I total it all up.

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "",
  "cluster_uuid" : "",
  "timestamp" : 1560963842218,
  "status" : "green",
  "indices" : {
    "count" : 398,
    "shards" : {
      "total" : 2900,
      "primaries" : 1450,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 10,
          "avg" : 7.28643216080402
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 3.64321608040201
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 138518010,
      "deleted" : 70416
    },
    "store" : {
      "size" : "85.5gb",
      "size_in_bytes" : 91821712421
    },
    "fielddata" : {
      "memory_size" : "8.7mb",
      "memory_size_in_bytes" : 9199404,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "55.2mb",
      "memory_size_in_bytes" : 57936447,
      "total_count" : 666875,
      "hit_count" : 110716,
      "miss_count" : 556159,
      "cache_size" : 3867,
      "cache_count" : 6026,
      "evictions" : 2159
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 17365,
      "memory" : "318.1mb",
      "memory_in_bytes" : 333577036,
      "terms_memory" : "214.8mb",
      "terms_memory_in_bytes" : 225249046,
      "stored_fields_memory" : "53.5mb",
      "stored_fields_memory_in_bytes" : 56119088,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "3.7mb",
      "norms_memory_in_bytes" : 3980288,
      "points_memory" : "16.4mb",
      "points_memory_in_bytes" : 17212474,
      "doc_values_memory" : "29.5mb",
      "doc_values_memory_in_bytes" : 31016140,
      "index_writer_memory" : "12.9mb",
      "index_writer_memory_in_bytes" : 13540456,
      "version_map_memory" : "3.7mb",
      "version_map_memory_in_bytes" : 3905041,
      "fixed_bit_set" : "3.7mb",
      "fixed_bit_set_memory_in_bytes" : 3909600,
      "max_unsafe_auto_id_timestamp" : 1560902407122,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "data" : 2,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 2
    },
    "versions" : [
      "6.5.1"
    ],
    "os" : {
      "available_processors" : 68,
      "allocated_processors" : 6,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "mem" : {
        "total" : "510.1gb",
        "total_in_bytes" : 547729575936,
        "free" : "4.6gb",
        "free_in_bytes" : 4971696128,
        "used" : "505.4gb",
        "used_in_bytes" : 542757879808,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 5
      },
      "open_file_descriptors" : {
        "min" : 335,
        "max" : 5792,
        "avg" : 3959
      }
    },
    "jvm" : {
      "max_uptime" : "39.7d",
      "max_uptime_in_millis" : 3432942854,
      "versions" : [
        {
          "version" : "1.8.0_144",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.144-b01",
          "vm_vendor" : "Oracle Corporation",
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used" : "5.7gb",
        "heap_used_in_bytes" : 6200750368,
        "heap_max" : "8.5gb",
        "heap_max_in_bytes" : 9181462528
      },
      "threads" : 498
    },
    "fs" : {
      "total" : "516gb",
      "total_in_bytes" : 554050781184,
      "free" : "426.5gb",
      "free_in_bytes" : 458021961728,
      "available" : "426.5gb",
      "available_in_bytes" : 458021961728
    },
    "plugins" : [
      {
        "name" : "ingest-user-agent",
        "version" : "6.5.1",
        "elasticsearch_version" : "6.5.1",
        "java_version" : "1.8",
        "description" : "Ingest processor that extracts information from a user agent",
        "classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "ingest-geoip",
        "version" : "6.5.1",
        "elasticsearch_version" : "6.5.1",
        "java_version" : "1.8",
        "description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
        "classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "repository-s3",
        "version" : "6.5.1",
        "elasticsearch_version" : "6.5.1",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "found-elasticsearch",
        "version" : "6.5.1",
        "elasticsearch_version" : "6.5.1",
        "java_version" : "1.8",
        "description" : "Elasticsearch plugin for Found",
        "classname" : "org.elasticsearch.plugin.found.FoundPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : { },
      "http_types" : { }
    }
  }
}

All of my largest indexes on disk are for monitoring (1.5gb each for 3 indexes), filebeat (400-700mb each for 60 indexes) and metricbeat (50-70mb each for 70 indexes). The application event data I'm indexing is tiny in comparison. The 2 largest indexes for my app data area only about 70mb each.

I've gained some new tools to understand the issues I'm facing, but I still don't really understand the best way forward.

Any ideas? Thanks again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.