I'm trying to check if a certain attachment is previously indexed or not to avoid reindexing that document. But the problem is, for longer text fields, the term filter isn't working. I didn't find any mention of this limitation in the documentation. For the same document,
But I'd probably try something else. I'd compute a signature for your file which I'd store instead of searching for a term which can be very very very big. I don't think that's a good idea to index the base64 content.
I don't think it's a good idea to store it in elasticsearch.
In FSCrawler project, I'm computing such a signature for every file I'm sending to elasticsearch. And I'm only comparing signatures.
Thanks a lot. That base64 value is actually always going to be 404 characters long. But it's probably a bad idea nevertheless. I'll be mindful of it when implementing it. For now, I was just doing a POC.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.