Incorrect query matches using NOT

Anton · December 2, 2012, 1:09pm

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking for
the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"} },
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not containing
the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen: "du
hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an option
to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

smonasco_2 · December 3, 2012, 3:41am

What's the curl command to create your index and the interaction.content
mapping?

On Sunday, December 2, 2012 6:09:18 AM UTC-7, Anton wrote:

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking for
the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"}
},
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not
containing the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen:
"du hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an
option to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

Anton · December 4, 2012, 7:09am

This is the mapping for interaction.content:

curl -XPUT http://localhost:9200/media/raw/_mapping -d '
{
"raw": {
"properties": {
"interaction": {
"dynamic": "true",
"properties": {
"content": {
"type": "multi_field",
"index_name": "customtag",
"index": "analyzed",
"index_analyzer": "index_analyzer",
"search_analyzer": "search_analyzer",
"store": "yes",
"fields": {
"content": {
"type": "string",
"index": "analyzed"
},
"original": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
'

In addition to the index_analyzer, I also specify an analyzer per language
(using _analyzer) when I add the content. For English this would be
analyzer_en.

curl -XPUT http://localhost:9200/fedex_socmedia/ -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"index_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "keyword_marker", "standard",
"asciifolding", "my_delimiter"],
"char_filter": "html_strip"
},
"search_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "keyword_marker", "standard", "stop",
"asciifolding", "snowball"]
},
"analyzer_en": {
"type": "snowball",
"language": "English",
"stopwords_path": "/opt/elasticsearch/stopwords_en.txt"
}
},
"filter": {
"my_delimiter": {
"type": "word_delimiter",
"generate_word_parts": true,
"catenate_words": true,
"catenate_numbers": true,
"preserve_original": true,
"split_on_numerics": true,
"stem_english_possessive": true
},
"keyword_marker": {
"type": "keyword_marker",
"keywords_path": "/opt/elasticsearch/protectedwords.txt"
}
}
}
}
}
}
'

I don't have "ups" in my list of stopwords, but I do have it in my list of
protected words.

And this is an example of an added article:

{
"interaction": {
"content": "Twitter meet ups are probably the gayest stupidest things
ever",
},
"_analyzer": "analyzer_en"
}

Anton.

On Monday, December 3, 2012 4:41:53 AM UTC+1, smonasco wrote:

What's the curl command to create your index and the interaction.content
mapping?

On Sunday, December 2, 2012 6:09:18 AM UTC-7, Anton wrote:

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking
for the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"}
},
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not
containing the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen:
"du hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an
option to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

Ivan · December 4, 2012, 5:49pm

Lucene, and therefore Elasticsearch, does not handle NOT queries when they
are the only term:

http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#NOT

Not sure if Elasticsearch does anything different to help support this use
case. One workaround in Lucene is to explicitly ask for everything in a
query with the NOT query:

: AND NOT interaction.content:'ups'

Of course, this query is highly inefficient. It is easier to rethink the
query and avoiding using only a single NOT.

Cheers,

Ivan

On Sun, Dec 2, 2012 at 5:09 AM, Anton anton@365analytics.com wrote:

Hi all,

I have an indexing problem I can't seem to figure out. Say I'm looking for
the keyword "UPS" in data from Twitter. I have many of them:

{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}

Returns: 133025 results found.

When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:

{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"}
},
"size": 0
}

Returns: 272485 results found.

However, when I take a closer look at the results supposedly not
containing the keyword "UPS", I find:

["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"

["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"

["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen:
"du hast was vor? Vergiss es!""

["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"

["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an
option to track on long press? That?s cool."

Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.

Any ideas on what could be happening here?

Thanks,

Anton.

--

--

Topic		Replies	Views
Filter on not_analyzed field with whitespace/hyphen not working Elasticsearch	11	3027	July 6, 2017
Search with white space Elasticsearch	7	7583	July 6, 2017
QueryString query on not_analyzed field Elasticsearch	4	1303	July 6, 2017
Help with Elastic search multi clause query Elasticsearch	7	441	July 6, 2017
NOT OR TERM issue? Elasticsearch	3	304	July 6, 2017

Incorrect query matches using NOT

Related topics