This is the mapping for interaction.content:
curl -XPUT http://localhost:9200/media/raw/_mapping -d '
{
"raw": {
"properties": {
"interaction": {
"dynamic": "true",
"properties": {
"content": {
"type": "multi_field",
"index_name": "customtag",
"index": "analyzed",
"index_analyzer": "index_analyzer",
"search_analyzer": "search_analyzer",
"store": "yes",
"fields": {
"content": {
"type": "string",
"index": "analyzed"
},
"original": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
'
In addition to the index_analyzer, I also specify an analyzer per language
(using _analyzer) when I add the content. For English this would be
analyzer_en.
curl -XPUT http://localhost:9200/fedex_socmedia/ -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"index_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "keyword_marker", "standard",
"asciifolding", "my_delimiter"],
"char_filter": "html_strip"
},
"search_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "keyword_marker", "standard", "stop",
"asciifolding", "snowball"]
},
"analyzer_en": {
"type": "snowball",
"language": "English",
"stopwords_path": "/opt/elasticsearch/stopwords_en.txt"
}
},
"filter": {
"my_delimiter": {
"type": "word_delimiter",
"generate_word_parts": true,
"catenate_words": true,
"catenate_numbers": true,
"preserve_original": true,
"split_on_numerics": true,
"stem_english_possessive": true
},
"keyword_marker": {
"type": "keyword_marker",
"keywords_path": "/opt/elasticsearch/protectedwords.txt"
}
}
}
}
}
}
'
I don't have "ups" in my list of stopwords, but I do have it in my list of
protected words.
And this is an example of an added article:
{
"interaction": {
"content": "Twitter meet ups are probably the gayest stupidest things
ever",
},
"_analyzer": "analyzer_en"
}
Anton.
On Monday, December 3, 2012 4:41:53 AM UTC+1, smonasco wrote:
What's the curl command to create your index and the interaction.content
mapping?
On Sunday, December 2, 2012 6:09:18 AM UTC-7, Anton wrote:
Hi all,
I have an indexing problem I can't seem to figure out. Say I'm looking
for the keyword "UPS" in data from Twitter. I have many of them:
{
"query" : { "query_string" : {"query" : "interaction.content:'ups'"} },
"size": 0
}
Returns: 133025 results found.
When I do the inverse, and look for content not mentioning the keyword
"UPS", I get even more:
{
"query" : { "query_string" : {"query" : "NOT interaction.content:'ups'"}
},
"size": 0
}
Returns: 272485 results found.
However, when I take a closer look at the results supposedly not
containing the keyword "UPS", I find:
["content"]=> "Twitter meet ups are probably the gayest stupidest things
ever"
["content"]=> "aiiii beaaa dejameeee jopetaaaas!!!!! Ups ostia pa mi:o
(cara whats app) JAJAJA tiamo"
["content"]=> ""Im Fahrzeug für Zustellung" ist UPS's Art dir zu sagen:
"du hast was vor? Vergiss es!""
["content"]=> "Bon UPS tu te dépêches!!! Je dois sortir -_-»"
["content"]=> "Huh, iOS recognises UPS tracking numbers and gives an
option to track on long press? That?s cool."
Clearly, these all contain the keyword "UPS", but for some reason haven't
been indexed with that token. My first guess would be that my analyzers are
causing this, but in that case the results correctly matching the keyword
(in the same languages, using the same analyzers) would have been affected
as well. I also have the keyword "UPS" in my list of protected words, so it
should not be stemmed or changed during indexing.
Any ideas on what could be happening here?
Thanks,
Anton.
--