Is there a plan for lucene to expand max doc size per index?


(Makeyang) #1

init design is 2^32. but as the data is growing fast and growing big, is there a plan to make it 2^64 or something bigger than current value


(Mark Walkom) #2

It's actually a per shard limit, so you can have lots more than 2^32 per index :slight_smile:

But, @mikemccand, any thoughts on this?


(Christian Dahlqvist) #3

The limit is as Mark pointed out per shard, and not per index. As queries are executed in parallel across shards, but are single threaded for each shard, query performance tend to depend on the shard size. Having more than 2^32 documents in a shard would probably result in very large shards, which can also cause problems during recovery, when shards need to be moved around in the cluster.

How many documents do you currently have in the your shards? How large are your shards?


(Makeyang) #4

the total doc is about 14,000,000,000
actually I shard index based on certain key and some shard will over 2^32.
in this case, more shard would mitigate the problem rather than resolve it.


(Mark Walkom) #5

It sounds a bit like you are working against ES here, why not let it shard things itself?


(Makeyang) #6

it's bounded to businuss logic, man


(Mark Walkom) #7

That doesn't really it though.


(Jörg Prante) #8

If you have a key to break down a single index into shards that become too large with regard to document count, consider settiing up the key as an index name and provide an entire index per key. Small indices may get small shard count, while large indices may get higher shard count.


(Makeyang) #9

sure. there are solutions to resolve my issue.
but I wonder why expand max doc size for lucene isn't the option for u guys?


(Jörg Prante) #10

The status of Lucene as of Elasticsearch 2.x is:

http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/codecs/lucene50/package-summary.html#Limitations

Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit.

For ES 5 which uses Lucene 6.2, the API did not change much

http://lucene.apache.org/core/6_2_0/core/org/apache/lucene/index/SegmentInfo.html

There is maxDoc() still returning a Java int which is to be interpreted as a uint32.

You could change the Lucene 6.2 source code, rewrite SegmentInfo, and recompile a custom Lucene togetheer with a recompiled ES 5 on top of it.

The Lucene developers noted they intend to encode the document numbers no longer as int32 but as a VInt so I think there is no reason why not to do this.


(Makeyang) #11

thanks man. this is exactly what I want.


(system) #12