How to choose/change the maximum size of a segment?

Hello all:
I found the maximum size of segments in es5.0/lucene6 is 5GB.
Since my index size is more than 500GB, i have 100 segments now.
I think this should slow down my search speed.

But i'm not sure if i should change this size, and also how to change it?

Additionally, if i call a forcemerge on the index, all .cfs will be merged into none compound segments(.doc, .fdt etc.).

So should i call forcemerge periodically?

Thanks

How many shards does the 500GB index have?

My index size is 2.6TB, each shard 500GB.
Each month about 250GB data is added to the index.
I stored the original html in the index(not indexed), it seems to be a bad choice now.
The htmls costs about 1.5TB.

As each search is executed single-threaded against each shard (although multiple shards are queried in parallel), the shard size will have an impact on query performance. As you have large amounts of data that is not indexed, you may be able to get away with larger shards than we usually recommend. In order to ensure recovery in a cluster does not get bogged down by overly large shards, we usually recommend keeping shards below 50GB. This naturally is less relevant if you just run on a single node.

In order to be able to scale out as data grows further, it would help to have a larger number of smaller shards, so I would recommend reindexing in order to reduce the shard size.

Forcemerging is an expensive operation, so t is hard for me to tell whether it would be worthwhile doing this periodically for your use case.

In my opinion, increasing shard number may solve the problem for a short time, but it's not the final solution.
For example, I increase the shard number to 10, than what should i do when my index size is 5TB?

To conclude, the problem for me is:

  1. I saved a huge, not indexed field in es, so my index becomes 2 times larger than normal.
  2. Lucene use a compound file to store the segments, it increases rapidly because the first point.
  3. The maximum size of such cfs is 5GB, so the number of segments also increase rapidly.
  4. The search speed slow down since there are so many segments.
  5. Some of my programs needs to search over the whole index 2000 times per min, since the search speed is too slow, my machines becomes extremely busy and it becomes worse and worse.

I think the best solution is to change the merge policy, which works like this:

  1. Merge .cfs segments like the default policy
  2. Automatically merge .cfs created 1 day (or some other period) ago into none compound segments.
    But is that possible...?

And I suggest to remind users do not put large, not-indexed data into the cluster.

Anyway, hope forcemerging is worthwhile...

It is strange that in my previous cluster, i did not find this problem. So may be forcemerging is worthwhile, because i call optimize every mid-night in the 1.3 cluster.

Are you updating documents in the index or just indexing new entries? If you are not updating documents, you may be able to switch to time based indices, e.g. monthly, which will make it easier to manage shard size and adjust to changing volumes over time.

Do you mean create one index each month, and use the same alias for the newest one?
I update documents in less than one hour, it is possible for me to use such index.

I'm just wondering about the index number. If i switch each month, it will be still costly after two years.
So in this case, i need another mechanism which merge old indexes together.

It seems to be the best choice. Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.