I'm indexing some documents that occasionally have really large tokens (greater than 300,000 characters). I don't care about searching against these large tokens, but my use of a pattern-replace char filter makes them take more than 30 minutes to be indexed.
I've tried adding a length token filter, but it looks like that gets applied after the char_filter. Is there any way to prevent the char_filter from analyzing these super long tokens?
the reason why the token filter does not help you here, is, because it is executed after the char filter. The char filter gets executed first, then the tokenizer, then the token filters.
I checked the lucene source very briefly, but I could not find anything regarding a size limitation in char filters (Note, I'm not a lucene expert).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.