A custom Codecs for an Updateable Field


(aditya tripathi) #1

Hi,
I wanted to know if anyone has tried providing an updateable field using Codecs provided from Lucene 4.0 onwards. Or, if someone has used this https://issues.apache.org/jira/browse/LUCENE-5189 for a numeric Doc Value based updateable field.

An approach using postingsFormat to write the field's postings to a key-value store is possible but gets into some problems eventually.

Some outline of this approach is as follows:(I also came across this page recently and found the same approach mentioned here: http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec)

  • Use Lucene's default Codec but a custom PerFieldPostingsFormat for the updateable field. Let's call this custom postingsFormat as UpdatePostingsFormat. And the corresponding fieldsConsumer/fieldsProducer as UpdateConsumer/UpdateProducer.

-The UpdateConsumers write directly to the key-value store. The key being segment_field_term.They do not buffer anything. The reason is that Lucene's Indexing chain invokes these PerFieldPostingsFormat only at flush() or merge(). And since in both cases a segment is flushed, we thought it is appropriate for this UpdateConsumer to write directly to the key-value store.

-Provide your own merge function in the UpdateTermsConsumer - basically document renumbering map.
Take care of removing deleted docs here. Does not take care of the case when all the docs in a segment are deleted and the segment itself is dropped at the next commit. Can only do things available in the MergeState passed to these custom consumers.

-With this in place, a partial update can directly update the key-value store.

This approach has following problems:
a) Merge is a problem as the merged segment is only checkpointed and not committed. In the custom consumers (UpdateConsumer/UpdateTermsConsumer/UpdatePostingsConsumer) if the new merged state is written directly then there is an inconsistency in the commited data of updateable fields and other fields and this leads to many problems including search failure.

  • We tackled this problem by putting the new merged state in an in-memory structure in the UpdateConsumers. Since Lucene only invokes these consumers at flush time, we flushed this in-memory merge info at the next flush. However, there is a problem here as well, as Lucene can commit it's checkpointed merged segment without any docs to be flushed. In this case, the checkpointed merged segment gets committed without these custom consumers being invoked.

b)If the main document and it's partial update come in the same segment, the update can not be applied. This is because the updateable field for the main document is not yet written anywhere. Since they get written (consumed) only at flush time by Lucene's design.

c)As mentioned above, the case where all documents gets deleted in a segment and that segment is dropped before any merge takes place. This dropped segment is not available to these custom Consumers, and this problem can not be solved by implementing codecs alone.

d)LiveDocs issues - Since they work out of BufferredDeleteStreams after every commit, and livedocs info is not passed in a SegmentWriteState which is passed to these custom consumers, we can't sync up with liveDocs at every commit but only at merge time.

There may be more issues, but we have encountered mostly these.

Wanted to check if anyone has tried this or can offer any comments.

Currently, I am trying to augment this solution by SegmentInfosFormat, LiveDocsFormat and a DirectoryWrapper but do not know if I will be able to solve these problems even with that.

Thanks for your patience in reading this long post.

-Aditya.


(system) #2