Junk content detection

Igor_Kupczynski · October 23, 2014, 2:58pm

Hello Elasticsearch Community,

We index content of some files. We use apache tika to extract the content.
What I'm worried about is that some of the documents contain "junk"
content, like a lot of numbers in excel. In such a case we'll pollute the
indexing with many tokens, but they'll no useful at all as nobody will
search for them. Similar thing if someone pastes binary data into a text
file.

Is there a good way (in es or external) to detect if a content may be
"junk"?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24e84e90-0569-45d9-ba6f-1974970bc0da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · October 23, 2014, 3:17pm

Possibly useful:

On Thursday, October 23, 2014 3:58:03 PM UTC+1, Igor Kupczyński wrote:

Hello Elasticsearch Community,

We index content of some files. We use apache tika to extract the content.
What I'm worried about is that some of the documents contain "junk"
content, like a lot of numbers in excel. In such a case we'll pollute the
indexing with many tokens, but they'll no useful at all as nobody will
search for them. Similar thing if someone pastes binary data into a text
file.

Is there a good way (in es or external) to detect if a content may be
"junk"?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a0690d2-dbe8-424e-8c31-28b340035c5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

tallison · March 28, 2017, 11:22am

We'd love to offer a junk detector w/in Tika see TIKA-1443. However, the challenge of making something work across all languages, file formats and genres is daunting.

In our soon to be released tika-eval module, we've added "common words" by taking the top 20k most frequent words per language in wikipedia dumps. If you divide that by the number of words with alphabetic/ideographic characters, that offers some insight into the junk-i-ness of the text.

So, junk is hard.

Numbers should be easy. PatternReplaceFilter, perhaps?

igor_k · March 30, 2017, 8:04am

Hi Tim,

Thanks for the answer. It is 2 years after the original question and I've moved to a different opportunity, But the project I had in mind at that time is up and running so I'll forward TIKA-1443 and your message to them.

Thanks,
Igor

Topic		Replies	Views
Automatic Keywords extraction in ElasticSearch Elasticsearch	15	6477	July 6, 2017
Auto detect language Elasticsearch	2	1267	July 5, 2017
Indexing pdf, word, text, image files Elasticsearch	2	703	April 27, 2017
Parsing and indexing documents with Apache Tika Elasticsearch	11	20281	July 5, 2017
Indexing of HTML content Elasticsearch	12	3281	July 6, 2017

Junk content detection

Related topics