Junk content detection

Hello Elasticsearch Community,

We index content of some files. We use apache tika to extract the content.
What I'm worried about is that some of the documents contain "junk"
content, like a lot of numbers in excel. In such a case we'll pollute the
indexing with many tokens, but they'll no useful at all as nobody will
search for them. Similar thing if someone pastes binary data into a text
file.

Is there a good way (in es or external) to detect if a content may be
"junk"?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24e84e90-0569-45d9-ba6f-1974970bc0da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Possibly useful:

On Thursday, October 23, 2014 3:58:03 PM UTC+1, Igor Kupczyński wrote:

Hello Elasticsearch Community,

We index content of some files. We use apache tika to extract the content.
What I'm worried about is that some of the documents contain "junk"
content, like a lot of numbers in excel. In such a case we'll pollute the
indexing with many tokens, but they'll no useful at all as nobody will
search for them. Similar thing if someone pastes binary data into a text
file.

Is there a good way (in es or external) to detect if a content may be
"junk"?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a0690d2-dbe8-424e-8c31-28b340035c5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

We'd love to offer a junk detector w/in Tika see TIKA-1443. However, the challenge of making something work across all languages, file formats and genres is daunting.

In our soon to be released tika-eval module, we've added "common words" by taking the top 20k most frequent words per language in wikipedia dumps. If you divide that by the number of words with alphabetic/ideographic characters, that offers some insight into the junk-i-ness of the text.

So, junk is hard.

Numbers should be easy. PatternReplaceFilter, perhaps?

Hi Tim,

Thanks for the answer. It is 2 years after the original question and I've moved to a different opportunity, But the project I had in mind at that time is up and running so I'll forward TIKA-1443 and your message to them.

Thanks,
Igor