You may enjoy(?) the lists of obscene words I've gathered here:
http://www.infochimps.com/datasets/list-of-dirty-
obscene-banned-and-otherwise-**unacceptable-wordshttp://www.infochimps.com/datasets/list-of-dirty-obscene-banned-and-otherwise-unacceptable-words
I believe this kind of thing -- for which regexps and lookup tables fall
short, yet for which proper NLP is too much work -- is perfect for the
percolate feature.
As you index each document, percolate against rule sets as complex or
simple-term-matchy as you like, and tag documents with a "probably
offensive" flag. Now exclude such altogether, or let visitors opt in/out to
flagged documents.
Flip
Sent from my iPad
On Mar 14, 2013, at 9:32 AM, Michael Sick <michae...@serenesoftware.**com>
wrote:
Also you should consider the impacts of false positives on your system.
Take the following phrase from The Hobbit - "The faggots are reeking".
Perhaps the elves are homophobic but research shows that they are just
admiring burning wood.
Google Books**
PA48#v=onepage&q&f=falsehttp://books.google.com/books?id=hFfhrCWiLSMC&pg=PA48&lpg=PA48#v=onepage&q&f=false
Since analysis for context and sentiment is difficult, you might setup a
system for review where the words that you are trying to exclude change a
state filed, something like: censorStatus=ok,review,notOk so that on most
reads you only retrieve the "ok" value and some stewards review the posts
that require it and either allow or disallow. Without knowing the context
of your system, not sure how likely it is that you need to care but if you
do you'll find that being "smart" about the exclusions can be a pain.
On Thu, Mar 14, 2013 at 9:21 AM, David G Ortega g.orteg...@gmail.comwrote:
When you set a multi_field and you send a document with that field name
ES internally creates the multi_field using the mapping deffinition. What
is going to happen is this:
You send:
{text: "this is a badword1 text"}
In ES:
{text.text: [this, is, a, badword1, text]}
{text.isBad: [this, is, a, bad, text]}
Oviously "bad" is so much generic word, is better to have something like
tagFlagged instead of "bad" like in the mapping so in another example with
this tagFlagged this is going to happen
You send:
{text: "this is a badword1, badword2, badword3 text"}
In ES:
{text.text: [this, is, a, badword1, badword2, badword3, text]}
{text.isBad: [this, is, a, tagFlagged, text]} (lowercase, unique)
since you are filtering in the search to not have the term tagFlagged in
text.isBad, no posts flagged are going to appear
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.