Skip pattern-replace character filter on large tokens

drs · September 7, 2017, 2:00pm

I'm indexing some documents that occasionally have really large tokens (greater than 300,000 characters). I don't care about searching against these large tokens, but my use of a pattern-replace char filter makes them take more than 30 minutes to be indexed.

I've tried adding a length token filter, but it looks like that gets applied after the char_filter. Is there any way to prevent the char_filter from analyzing these super long tokens?

spinscale · September 11, 2017, 7:32am

Hey,

the reason why the token filter does not help you here, is, because it is executed after the char filter. The char filter gets executed first, then the tokenizer, then the token filters.

I checked the lucene source very briefly, but I could not find anything regarding a size limitation in char filters (Note, I'm not a lucene expert).

--Alex

system · October 9, 2017, 7:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.