Incorrect query matches using NOT

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking for
the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"} },
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not containing
the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen: "du
hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an option
to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

What's the curl command to create your index and the interaction.content
mapping?

On Sunday, December 2, 2012 6:09:18 AM UTC-7, Anton wrote:

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking for
the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"}
},
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not
containing the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen:
"du hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an
option to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

This is the mapping for interaction.content:

curl -XPUT http://localhost:9200/media/raw/_mapping -d '
{
"raw": {
"properties": {
"interaction": {
"dynamic": "true",
"properties": {
"content": {
"type": "multi_field",
"index_name": "customtag",
"index": "analyzed",
"index_analyzer": "index_analyzer",
"search_analyzer": "search_analyzer",
"store": "yes",
"fields": {
"content": {
"type": "string",
"index": "analyzed"
},
"original": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
'

In addition to the index_analyzer, I also specify an analyzer per language
(using _analyzer) when I add the content. For English this would be
analyzer_en.

curl -XPUT http://localhost:9200/fedex_socmedia/ -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"index_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "keyword_marker", "standard",
"asciifolding", "my_delimiter"],
"char_filter": "html_strip"
},
"search_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "keyword_marker", "standard", "stop",
"asciifolding", "snowball"]
},
"analyzer_en": {
"type": "snowball",
"language": "English",
"stopwords_path": "/opt/elasticsearch/stopwords_en.txt"
}
},
"filter": {
"my_delimiter": {
"type": "word_delimiter",
"generate_word_parts": true,
"catenate_words": true,
"catenate_numbers": true,
"preserve_original": true,
"split_on_numerics": true,
"stem_english_possessive": true
},
"keyword_marker": {
"type": "keyword_marker",
"keywords_path": "/opt/elasticsearch/protectedwords.txt"
}
}
}
}
}
}
'

I don't have "ups" in my list of stopwords, but I do have it in my list of
protected words.

And this is an example of an added article:

{
"interaction": {
"content": "Twitter meet ups are probably the gayest stupidest things
ever",
},
"_analyzer": "analyzer_en"
}

Anton.

On Monday, December 3, 2012 4:41:53 AM UTC+1, smonasco wrote:

What's the curl command to create your index and the interaction.content
mapping?

On Sunday, December 2, 2012 6:09:18 AM UTC-7, Anton wrote:

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking
for the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"}
},
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not
containing the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen:
"du hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an
option to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

Lucene, and therefore ElasticSearch, does not handle NOT queries when they
are the only term:

http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#NOT

Not sure if ElasticSearch does anything different to help support this use
case. One workaround in Lucene is to explicitly ask for everything in a
query with the NOT query:

: AND NOT interaction.content:'ups'

Of course, this query is highly inefficient. It is easier to rethink the
query and avoiding using only a single NOT.

Cheers,

Ivan

On Sun, Dec 2, 2012 at 5:09 AM, Anton anton@365analytics.com wrote:

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking for
the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"}
},
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not
containing the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen:
"du hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an
option to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

--