Hi Ivan,
thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that has
the said analyzer defined via template it doesn't seem to apply:
This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET
'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d 'Service=MyMDB.onMessage
appId=cs Times=Me:22/Total:22 (updated attributes=gps_lng: 183731222/
gps_lat: 289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1'
-d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : 14,
"end_offset" : 23,
"type" : "word",
"position" : 3
}, {
"token" : "onmessage appid",
"start_offset" : 14,
"end_offset" : 29,
"type" : "shingle",
"position" : 3
}, {
"token" : "appid",
"start_offset" : 24,
"end_offset" : 29,
"type" : "word",
"position" : 4
}, {
"token" : "appid cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "shingle",
"position" : 4
}, {
"token" : "cs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 5
}, {
"token" : "cs times",
"start_offset" : 30,
"end_offset" : 38,
"type" : "shingle",
"position" : 5
}, {
"token" : "times",
"start_offset" : 33,
"end_offset" : 38,
"type" : "word",
"position" : 6
}, {
"token" : "times me",
"start_offset" : 33,
"end_offset" : 41,
"type" : "shingle",
"position" : 6
}, {
"token" : "me",
"start_offset" : 39,
"end_offset" : 41,
"type" : "word",
"position" : 7
}, {
"token" : "me 22",
"start_offset" : 39,
"end_offset" : 44,
"type" : "shingle",
"position" : 7
}, {
"token" : "22",
"start_offset" : 42,
"end_offset" : 44,
"type" : "word",
"position" : 8
}, {
"token" : "22 total",
"start_offset" : 42,
"end_offset" : 50,
"type" : "shingle",
"position" : 8
}, {
"token" : "total",
"start_offset" : 45,
"end_offset" : 50,
"type" : "word",
"position" : 9
}, {
"token" : "total 22",
"start_offset" : 45,
"end_offset" : 53,
"type" : "shingle",
"position" : 9
}, {
"token" : "22",
"start_offset" : 51,
"end_offset" : 53,
"type" : "word",
"position" : 10
}, {
"token" : "22 updated",
"start_offset" : 51,
"end_offset" : 62,
"type" : "shingle",
"position" : 10
}, {
"token" : "updated",
"start_offset" : 55,
"end_offset" : 62,
"type" : "word",
"position" : 11
}, {
"token" : "updated attributes",
"start_offset" : 55,
"end_offset" : 73,
"type" : "shingle",
"position" : 11
}, {
"token" : "attributes",
"start_offset" : 63,
"end_offset" : 73,
"type" : "word",
"position" : 12
}, {
"token" : "attributes gps",
"start_offset" : 63,
"end_offset" : 77,
"type" : "shingle",
"position" : 12
}, {
"token" : "gps",
"start_offset" : 74,
"end_offset" : 77,
"type" : "word",
"position" : 13
}, {
"token" : "gps lng",
"start_offset" : 74,
"end_offset" : 81,
"type" : "shingle",
"position" : 13
}, {
"token" : "lng",
"start_offset" : 78,
"end_offset" : 81,
"type" : "word",
"position" : 14
}, {
"token" : "lng 183731222",
"start_offset" : 78,
"end_offset" : 92,
"type" : "shingle",
"position" : 14
}, {
"token" : "183731222",
"start_offset" : 83,
"end_offset" : 92,
"type" : "word",
"position" : 15
}, {
"token" : "183731222 gps",
"start_offset" : 83,
"end_offset" : 97,
"type" : "shingle",
"position" : 15
}, {
"token" : "gps",
"start_offset" : 94,
"end_offset" : 97,
"type" : "word",
"position" : 16
}, {
"token" : "gps lat",
"start_offset" : 94,
"end_offset" : 101,
"type" : "shingle",
"position" : 16
}, {
"token" : "lat",
"start_offset" : 98,
"end_offset" : 101,
"type" : "word",
"position" : 17
}, {
"token" : "lat 289309222",
"start_offset" : 98,
"end_offset" : 112,
"type" : "shingle",
"position" : 17
}, {
"token" : "289309222",
"start_offset" : 103,
"end_offset" : 112,
"type" : "word",
"position" : 18
} ]
}
So it seems the template is not used?! Any obvious reason/mistakes?
Thx,
Marc
On Thursday, August 28, 2014 6:17:08 PM UTC+2, Ivan Brusic wrote:
Use the Analyze API to view what tokens are being generated? Keep it
simple at first (maybe remove shingles) and build up as you encounter more
edge-cases. What kind of query are you using?
--
Ivan
On Thu, Aug 28, 2014 at 2:05 AM, Marc <mn.o...@googlemail.com
<javascript:>> wrote:
Hi Ivan,
thanks for the help. Now it works almost...
I have used the following:
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["split-up",
"lowercase",
"shingle",
"ascii-folding"]
}
},
"filter": {
"split-up": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_all": "true",
"type_table": {
"$": "DIGIT",
"%": "DIGIT",
".": "DIGIT",
",": "DIGIT",
":": "DIGIT",
"/": "DIGIT",
"\": "DIGIT",
"=": "DIGIT",
"&": "DIGIT",
"(": "DIGIT",
")": "DIGIT",
"<": "DIGIT",
">": "DIGIT",
"\U+000A": "DIGIT"
}
},
"ascii-folding": {
"type": "asciifolding",
"preserve_original": true
}
}
If the above is wrong or not reasonable, please feel free to criticize!
Now the only thing that does not work is searching for subwords of
concatenations with".".
Having log Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
(updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ ) I cannot
search for MyMDB or onMessage; only MyMDB.onMessage will work.
Anymore Ideas?
Cheers,
Marc
On Wednesday, August 27, 2014 9:20:49 AM UTC+2, Ivan Brusic wrote:
Off the top of my head, I would use a custom analyzer with a whitespace
tokenizer and a word delimiter filter (preserving the original tokens as
well). Perhaps a shingle filter to create bigrams. Or better yet a pattern
tokenizer with spaces and parenthesis.
Cheers,
Ivan
On Tue, Aug 26, 2014 at 11:57 PM, Marc mn.o...@googlemail.com wrote:
Hi,
I have quiet a simple scenario that already gives me a headache for
quiet a while.
I have one Field which is quiet big and full of special characters like
(,),=,:,",' digits and text.
Example:
"msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
(updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
I essentially want to be able to search this things using text,
wildcards etc.
So far I have tried not analyzing the content and using the wildcard
search and it doesn't work very well.
Using different tokenizers and the query_string query also only works
to a certain degree.
For example I want to be able to serach for following expressions:
Service
MyMDB
onMessage
MyMDB.onMessage
appId=cs AND Times=Me:22
and other possible permutations.
What is a correct setup?! I simply can't find a solution...
ps.: the data is imported to elasticsearch using logstash. We do acces
the data using the java api (all software latest versions).
Cheeers,
Marc
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.