Hello Elasticsearch Community,
We index content of some files. We use apache tika to extract the content.
What I'm worried about is that some of the documents contain "junk"
content, like a lot of numbers in excel. In such a case we'll pollute the
indexing with many tokens, but they'll no useful at all as nobody will
search for them. Similar thing if someone pastes binary data into a text
Is there a good way (in es or external) to detect if a content may be
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24e84e90-0569-45d9-ba6f-1974970bc0da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.