Is there a plan for lucene to expand max doc size per index?

makeyang · November 2, 2016, 9:39am

init design is 2^32. but as the data is growing fast and growing big, is there a plan to make it 2^64 or something bigger than current value

warkolm · November 2, 2016, 9:41am

It's actually a per shard limit, so you can have lots more than 2^32 per index

But, @mikemccand, any thoughts on this?

Christian_Dahlqvist · November 2, 2016, 9:58am

The limit is as Mark pointed out per shard, and not per index. As queries are executed in parallel across shards, but are single threaded for each shard, query performance tend to depend on the shard size. Having more than 2^32 documents in a shard would probably result in very large shards, which can also cause problems during recovery, when shards need to be moved around in the cluster.

How many documents do you currently have in the your shards? How large are your shards?

makeyang · November 2, 2016, 10:15am

the total doc is about 14,000,000,000
actually I shard index based on certain key and some shard will over 2^32.
in this case, more shard would mitigate the problem rather than resolve it.

warkolm · November 2, 2016, 10:28pm

It sounds a bit like you are working against ES here, why not let it shard things itself?

makeyang · November 3, 2016, 2:58am

it's bounded to businuss logic, man

warkolm · November 3, 2016, 5:27am

That doesn't really it though.

jprante · November 3, 2016, 9:29am

If you have a key to break down a single index into shards that become too large with regard to document count, consider settiing up the key as an index name and provide an entire index per key. Small indices may get small shard count, while large indices may get higher shard count.

makeyang · November 3, 2016, 9:41am

sure. there are solutions to resolve my issue.
but I wonder why expand max doc size for lucene isn't the option for u guys?

jprante · November 3, 2016, 2:09pm

The status of Lucene as of Elasticsearch 2.x is:

http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/codecs/lucene50/package-summary.html#Limitations

Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit.

For ES 5 which uses Lucene 6.2, the API did not change much

http://lucene.apache.org/core/6_2_0/core/org/apache/lucene/index/SegmentInfo.html

There is maxDoc() still returning a Java int which is to be interpreted as a uint32.

You could change the Lucene 6.2 source code, rewrite SegmentInfo, and recompile a custom Lucene togetheer with a recompiled ES 5 on top of it.

The Lucene developers noted they intend to encode the document numbers no longer as int32 but as a VInt so I think there is no reason why not to do this.

makeyang · November 4, 2016, 2:39am

thanks man. this is exactly what I want.

Topic		Replies	Views
Index Max Size Elasticsearch	6	31766	July 5, 2017
Signs of too few shards? Number of documents per shard? Elasticsearch	3	539	July 23, 2021
Sharding in ES Elasticsearch	5	357	June 8, 2018
Limit for shard size? Elasticsearch	2	3701	July 5, 2017
How to decide storage size of an index in Elasticsearch Elasticsearch	2	700	June 23, 2021

Is there a plan for lucene to expand max doc size per index?

Related topics