EL setup for fulltext search

Marc_2 · August 27, 2014, 6:57am

Hi,

I have quiet a simple scenario that already gives me a headache for quiet a
while.
I have one Field which is quiet big and full of special characters like
(,),=,:,",' digits and text.
Example:
"msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
I essentially want to be able to search this things using text, wildcards
etc.
So far I have tried not analyzing the content and using the wildcard search
and it doesn't work very well.
Using different tokenizers and the query_string query also only works to a
certain degree.
For example I want to be able to serach for following expressions:
Service
MyMDB
onMessage
MyMDB.onMessage
appId=cs AND Times=Me:22

and other possible permutations.
What is a correct setup?! I simply can't find a solution...

ps.: the data is imported to elasticsearch using logstash. We do acces the
data using the java api (all software latest versions).

Cheeers,
Marc

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · August 27, 2014, 7:20am

Off the top of my head, I would use a custom analyzer with a whitespace
tokenizer and a word delimiter filter (preserving the original tokens as
well). Perhaps a shingle filter to create bigrams. Or better yet a pattern
tokenizer with spaces and parenthesis.

Cheers,

Ivan

On Tue, Aug 26, 2014 at 11:57 PM, Marc mn.offman@googlemail.com wrote:

Hi,

I have quiet a simple scenario that already gives me a headache for quiet
a while.
I have one Field which is quiet big and full of special characters like
(,),=,:,",' digits and text.
Example:
"msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
I essentially want to be able to search this things using text, wildcards
etc.
So far I have tried not analyzing the content and using the wildcard
search and it doesn't work very well.
Using different tokenizers and the query_string query also only works to a
certain degree.
For example I want to be able to serach for following expressions:
Service
MyMDB
onMessage
MyMDB.onMessage
appId=cs AND Times=Me:22

and other possible permutations.
What is a correct setup?! I simply can't find a solution...

ps.: the data is imported to elasticsearch using logstash. We do acces the
data using the java api (all software latest versions).

Cheeers,
Marc

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBP26Q-H1Am3LBDXn6uLhg20tLregFdNLUan1Z8J2yTKg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marc_2 · August 28, 2014, 9:05am

Hi Ivan,

thanks for the help. Now it works almost...
I have used the following:
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["split-up",
"lowercase",
"shingle",
"ascii-folding"]
}
},
"filter": {
"split-up": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_all": "true",
"type_table": {
"$": "DIGIT",
"%": "DIGIT",
".": "DIGIT",
",": "DIGIT",
":": "DIGIT",
"/": "DIGIT",
"\": "DIGIT",
"=": "DIGIT",
"&": "DIGIT",
"(": "DIGIT",
")": "DIGIT",
"<": "DIGIT",
">": "DIGIT",
"\U+000A": "DIGIT"
}
},
"ascii-folding": {
"type": "asciifolding",
"preserve_original": true
}
}
If the above is wrong or not reasonable, please feel free to criticize!

Now the only thing that does not work is searching for subwords of
concatenations with".".
Having log Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ ) I cannot search for
MyMDB or onMessage; only MyMDB.onMessage will work.

Anymore Ideas?

Cheers,
Marc

On Wednesday, August 27, 2014 9:20:49 AM UTC+2, Ivan Brusic wrote:

Off the top of my head, I would use a custom analyzer with a whitespace
tokenizer and a word delimiter filter (preserving the original tokens as
well). Perhaps a shingle filter to create bigrams. Or better yet a pattern
tokenizer with spaces and parenthesis.

Cheers,

Ivan

On Tue, Aug 26, 2014 at 11:57 PM, Marc <mn.o...@googlemail.com
<javascript:>> wrote:

Hi,

I have quiet a simple scenario that already gives me a headache for quiet
a while.
I have one Field which is quiet big and full of special characters like
(,),=,:,",' digits and text.
Example:
"msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
I essentially want to be able to search this things using text, wildcards
etc.
So far I have tried not analyzing the content and using the wildcard
search and it doesn't work very well.
Using different tokenizers and the query_string query also only works to
a certain degree.
For example I want to be able to serach for following expressions:
Service
MyMDB
onMessage
MyMDB.onMessage
appId=cs AND Times=Me:22

and other possible permutations.
What is a correct setup?! I simply can't find a solution...

ps.: the data is imported to elasticsearch using logstash. We do acces
the data using the java api (all software latest versions).

Cheeers,
Marc

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · August 28, 2014, 4:16pm

Use the Analyze API to view what tokens are being generated? Keep it simple
at first (maybe remove shingles) and build up as you encounter more
edge-cases. What kind of query are you using?

--
Ivan

On Thu, Aug 28, 2014 at 2:05 AM, Marc mn.offman@googlemail.com wrote:

Hi Ivan,

thanks for the help. Now it works almost...
I have used the following:
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["split-up",
"lowercase",
"shingle",
"ascii-folding"]
}
},
"filter": {
"split-up": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_all": "true",
"type_table": {
"$": "DIGIT",
"%": "DIGIT",
".": "DIGIT",
",": "DIGIT",
":": "DIGIT",
"/": "DIGIT",
"\": "DIGIT",
"=": "DIGIT",
"&": "DIGIT",
"(": "DIGIT",
")": "DIGIT",
"<": "DIGIT",
">": "DIGIT",
"\U+000A": "DIGIT"
}
},
"ascii-folding": {
"type": "asciifolding",
"preserve_original": true
}
}
If the above is wrong or not reasonable, please feel free to criticize!

Now the only thing that does not work is searching for subwords of
concatenations with".".
Having log Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ ) I cannot search for
MyMDB or onMessage; only MyMDB.onMessage will work.

Anymore Ideas?

Cheers,
Marc

On Wednesday, August 27, 2014 9:20:49 AM UTC+2, Ivan Brusic wrote:

Off the top of my head, I would use a custom analyzer with a whitespace
tokenizer and a word delimiter filter (preserving the original tokens as
well). Perhaps a shingle filter to create bigrams. Or better yet a pattern
tokenizer with spaces and parenthesis.

Cheers,

Ivan

On Tue, Aug 26, 2014 at 11:57 PM, Marc mn.o...@googlemail.com wrote:

Hi,

I have quiet a simple scenario that already gives me a headache for
quiet a while.
I have one Field which is quiet big and full of special characters like
(,),=,:,",' digits and text.
Example:
"msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
(updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
I essentially want to be able to search this things using text,
wildcards etc.
So far I have tried not analyzing the content and using the wildcard
search and it doesn't work very well.
Using different tokenizers and the query_string query also only works to
a certain degree.
For example I want to be able to serach for following expressions:
Service
MyMDB
onMessage
MyMDB.onMessage
appId=cs AND Times=Me:22

and other possible permutations.
What is a correct setup?! I simply can't find a solution...

ps.: the data is imported to elasticsearch using logstash. We do acces
the data using the java api (all software latest versions).

Cheeers,
Marc

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBbGG0fZ%2BGgwpcRqdmtnEeFoOFUu53P%3DZ34sLBrq39Lbw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marc_2 · August 29, 2014, 8:48am

Hi Ivan,

thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that has
the said analyzer defined via template it doesn't seem to apply:

This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET
'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d 'Service=MyMDB.onMessage
appId=cs Times=Me:22/Total:22 (updated attributes=gps_lng: 183731222/
gps_lat: 289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1'
-d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : 14,
"end_offset" : 23,
"type" : "word",
"position" : 3
}, {
"token" : "onmessage appid",
"start_offset" : 14,
"end_offset" : 29,
"type" : "shingle",
"position" : 3
}, {
"token" : "appid",
"start_offset" : 24,
"end_offset" : 29,
"type" : "word",
"position" : 4
}, {
"token" : "appid cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "shingle",
"position" : 4
}, {
"token" : "cs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 5
}, {
"token" : "cs times",
"start_offset" : 30,
"end_offset" : 38,
"type" : "shingle",
"position" : 5
}, {
"token" : "times",
"start_offset" : 33,
"end_offset" : 38,
"type" : "word",
"position" : 6
}, {
"token" : "times me",
"start_offset" : 33,
"end_offset" : 41,
"type" : "shingle",
"position" : 6
}, {
"token" : "me",
"start_offset" : 39,
"end_offset" : 41,
"type" : "word",
"position" : 7
}, {
"token" : "me 22",
"start_offset" : 39,
"end_offset" : 44,
"type" : "shingle",
"position" : 7
}, {
"token" : "22",
"start_offset" : 42,
"end_offset" : 44,
"type" : "word",
"position" : 8
}, {
"token" : "22 total",
"start_offset" : 42,
"end_offset" : 50,
"type" : "shingle",
"position" : 8
}, {
"token" : "total",
"start_offset" : 45,
"end_offset" : 50,
"type" : "word",
"position" : 9
}, {
"token" : "total 22",
"start_offset" : 45,
"end_offset" : 53,
"type" : "shingle",
"position" : 9
}, {
"token" : "22",
"start_offset" : 51,
"end_offset" : 53,
"type" : "word",
"position" : 10
}, {
"token" : "22 updated",
"start_offset" : 51,
"end_offset" : 62,
"type" : "shingle",
"position" : 10
}, {
"token" : "updated",
"start_offset" : 55,
"end_offset" : 62,
"type" : "word",
"position" : 11
}, {
"token" : "updated attributes",
"start_offset" : 55,
"end_offset" : 73,
"type" : "shingle",
"position" : 11
}, {
"token" : "attributes",
"start_offset" : 63,
"end_offset" : 73,
"type" : "word",
"position" : 12
}, {
"token" : "attributes gps",
"start_offset" : 63,
"end_offset" : 77,
"type" : "shingle",
"position" : 12
}, {
"token" : "gps",
"start_offset" : 74,
"end_offset" : 77,
"type" : "word",
"position" : 13
}, {
"token" : "gps lng",
"start_offset" : 74,
"end_offset" : 81,
"type" : "shingle",
"position" : 13
}, {
"token" : "lng",
"start_offset" : 78,
"end_offset" : 81,
"type" : "word",
"position" : 14
}, {
"token" : "lng 183731222",
"start_offset" : 78,
"end_offset" : 92,
"type" : "shingle",
"position" : 14
}, {
"token" : "183731222",
"start_offset" : 83,
"end_offset" : 92,
"type" : "word",
"position" : 15
}, {
"token" : "183731222 gps",
"start_offset" : 83,
"end_offset" : 97,
"type" : "shingle",
"position" : 15
}, {
"token" : "gps",
"start_offset" : 94,
"end_offset" : 97,
"type" : "word",
"position" : 16
}, {
"token" : "gps lat",
"start_offset" : 94,
"end_offset" : 101,
"type" : "shingle",
"position" : 16
}, {
"token" : "lat",
"start_offset" : 98,
"end_offset" : 101,
"type" : "word",
"position" : 17
}, {
"token" : "lat 289309222",
"start_offset" : 98,
"end_offset" : 112,
"type" : "shingle",
"position" : 17
}, {
"token" : "289309222",
"start_offset" : 103,
"end_offset" : 112,
"type" : "word",
"position" : 18
} ]
}

So it seems the template is not used?! Any obvious reason/mistakes?

Thx,
Marc

On Thursday, August 28, 2014 6:17:08 PM UTC+2, Ivan Brusic wrote:

Use the Analyze API to view what tokens are being generated? Keep it
simple at first (maybe remove shingles) and build up as you encounter more
edge-cases. What kind of query are you using?

--
Ivan

On Thu, Aug 28, 2014 at 2:05 AM, Marc <mn.o...@googlemail.com
<javascript:>> wrote:

Hi Ivan,

thanks for the help. Now it works almost...
I have used the following:
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["split-up",
"lowercase",
"shingle",
"ascii-folding"]
}
},
"filter": {
"split-up": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_all": "true",
"type_table": {
"$": "DIGIT",
"%": "DIGIT",
".": "DIGIT",
",": "DIGIT",
":": "DIGIT",
"/": "DIGIT",
"\": "DIGIT",
"=": "DIGIT",
"&": "DIGIT",
"(": "DIGIT",
")": "DIGIT",
"<": "DIGIT",
">": "DIGIT",
"\U+000A": "DIGIT"
}
},
"ascii-folding": {
"type": "asciifolding",
"preserve_original": true
}
}
If the above is wrong or not reasonable, please feel free to criticize!

Now the only thing that does not work is searching for subwords of
concatenations with".".
Having log Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
(updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ ) I cannot
search for MyMDB or onMessage; only MyMDB.onMessage will work.

Anymore Ideas?

Cheers,
Marc

On Wednesday, August 27, 2014 9:20:49 AM UTC+2, Ivan Brusic wrote:

Off the top of my head, I would use a custom analyzer with a whitespace
tokenizer and a word delimiter filter (preserving the original tokens as
well). Perhaps a shingle filter to create bigrams. Or better yet a pattern
tokenizer with spaces and parenthesis.

Cheers,

Ivan

On Tue, Aug 26, 2014 at 11:57 PM, Marc mn.o...@googlemail.com wrote:

Hi,

I have quiet a simple scenario that already gives me a headache for
quiet a while.
I have one Field which is quiet big and full of special characters like
(,),=,:,",' digits and text.
Example:
"msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
(updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
I essentially want to be able to search this things using text,
wildcards etc.
So far I have tried not analyzing the content and using the wildcard
search and it doesn't work very well.
Using different tokenizers and the query_string query also only works
to a certain degree.
For example I want to be able to serach for following expressions:
Service
MyMDB
onMessage
MyMDB.onMessage
appId=cs AND Times=Me:22

and other possible permutations.
What is a correct setup?! I simply can't find a solution...

ps.: the data is imported to elasticsearch using logstash. We do acces
the data using the java api (all software latest versions).

Cheeers,
Marc

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · August 29, 2014, 4:49pm

That output does not look like the something generated from the standard
analyzer since it contains uppercase letters and various non-word
characters such as '='.

Your two analysis requests will differ since the second one contains the
default word_delimiter filter instead of your custom my_word_delimiter.
What you are trying to achieve is somewhat difficult, but you can get there
if you keep on tweaking. Try using a pattern tokenizer instead of the
whitespace tokenizer if you want more control over word boundaries.

--
Ivan

On Fri, Aug 29, 2014 at 1:48 AM, Marc mn.offman@googlemail.com wrote:

Hi Ivan,

thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that has
the said analyzer defined via template it doesn't seem to apply:

This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET
'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d
'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1'
-d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : 14,
"end_offset" : 23,
"type" : "word",
"position" : 3
}, {
"token" : "onmessage appid",
"start_offset" : 14,
"end_offset" : 29,
"type" : "shingle",
"position" : 3
}, {
"token" : "appid",
"start_offset" : 24,
"end_offset" : 29,
"type" : "word",
"position" : 4
}, {
"token" : "appid cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "shingle",
"position" : 4
}, {
"token" : "cs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 5
}, {
"token" : "cs times",
"start_offset" : 30,
"end_offset" : 38,
"type" : "shingle",
"position" : 5
}, {
"token" : "times",
"start_offset" : 33,
"end_offset" : 38,
"type" : "word",
"position" : 6
}, {
"token" : "times me",
"start_offset" : 33,
"end_offset" : 41,
"type" : "shingle",
"position" : 6
}, {
"token" : "me",
"start_offset" : 39,
"end_offset" : 41,
"type" : "word",
"position" : 7
}, {
"token" : "me 22",
"start_offset" : 39,
"end_offset" : 44,
"type" : "shingle",
"position" : 7
}, {
"token" : "22",
"start_offset" : 42,
"end_offset" : 44,
"type" : "word",
"position" : 8
}, {
"token" : "22 total",
"start_offset" : 42,
"end_offset" : 50,
"type" : "shingle",
"position" : 8
}, {
"token" : "total",
"start_offset" : 45,
"end_offset" : 50,
"type" : "word",
"position" : 9
}, {
"token" : "total 22",
"start_offset" : 45,
"end_offset" : 53,
"type" : "shingle",
"position" : 9
}, {
"token" : "22",
"start_offset" : 51,
"end_offset" : 53,
"type" : "word",
"position" : 10
}, {
"token" : "22 updated",
"start_offset" : 51,
"end_offset" : 62,
"type" : "shingle",
"position" : 10
}, {
"token" : "updated",
"start_offset" : 55,
"end_offset" : 62,
"type" : "word",
"position" : 11
}, {
"token" : "updated attributes",
"start_offset" : 55,
"end_offset" : 73,
"type" : "shingle",
"position" : 11
}, {
"token" : "attributes",
"start_offset" : 63,
"end_offset" : 73,
"type" : "word",
"position" : 12
}, {
"token" : "attributes gps",
"start_offset" : 63,
"end_offset" : 77,
"type" : "shingle",
"position" : 12
}, {
"token" : "gps",
"start_offset" : 74,
"end_offset" : 77,
"type" : "word",
"position" : 13
}, {
"token" : "gps lng",
"start_offset" : 74,
"end_offset" : 81,
"type" : "shingle",
"position" : 13
}, {
"token" : "lng",
"start_offset" : 78,
"end_offset" : 81,
"type" : "word",
"position" : 14
}, {
"token" : "lng 183731222",
"start_offset" : 78,
"end_offset" : 92,
"type" : "shingle",
"position" : 14
}, {
"token" : "183731222",
"start_offset" : 83,
"end_offset" : 92,
"type" : "word",
"position" : 15
}, {
"token" : "183731222 gps",
"start_offset" : 83,
"end_offset" : 97,
"type" : "shingle",
"position" : 15
}, {
"token" : "gps",
"start_offset" : 94,
"end_offset" : 97,
"type" : "word",
"position" : 16
}, {
"token" : "gps lat",
"start_offset" : 94,
"end_offset" : 101,
"type" : "shingle",
"position" : 16
}, {
"token" : "lat",
"start_offset" : 98,
"end_offset" : 101,
"type" : "word",
"position" : 17
}, {
"token" : "lat 289309222",
"start_offset" : 98,
"end_offset" : 112,
"type" : "shingle",
"position" : 17
}, {
"token" : "289309222",
"start_offset" : 103,
"end_offset" : 112,
"type" : "word",
"position" : 18
} ]
}

So it seems the template is not used?! Any obvious reason/mistakes?

Thx,
Marc

On Thursday, August 28, 2014 6:17:08 PM UTC+2, Ivan Brusic wrote:

Use the Analyze API to view what tokens are being generated? Keep it
simple at first (maybe remove shingles) and build up as you encounter more
edge-cases. What kind of query are you using?

--
Ivan

On Thu, Aug 28, 2014 at 2:05 AM, Marc mn.o...@googlemail.com wrote:

Hi Ivan,

thanks for the help. Now it works almost...
I have used the following:
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["split-up",
"lowercase",
"shingle",
"ascii-folding"]
}
},
"filter": {
"split-up": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_all": "true",
"type_table": {
"$": "DIGIT",
"%": "DIGIT",
".": "DIGIT",
",": "DIGIT",
":": "DIGIT",
"/": "DIGIT",
"\": "DIGIT",
"=": "DIGIT",
"&": "DIGIT",
"(": "DIGIT",
")": "DIGIT",
"<": "DIGIT",
">": "DIGIT",
"\U+000A": "DIGIT"
}
},
"ascii-folding": {
"type": "asciifolding",
"preserve_original": true
}
}
If the above is wrong or not reasonable, please feel free to criticize!

Now the only thing that does not work is searching for subwords of
concatenations with".".
Having log Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
(updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ ) I cannot
search for MyMDB or onMessage; only MyMDB.onMessage will work.

Anymore Ideas?

Cheers,
Marc

On Wednesday, August 27, 2014 9:20:49 AM UTC+2, Ivan Brusic wrote:

Off the top of my head, I would use a custom analyzer with a whitespace
tokenizer and a word delimiter filter (preserving the original tokens as
well). Perhaps a shingle filter to create bigrams. Or better yet a pattern
tokenizer with spaces and parenthesis.

Cheers,

Ivan

On Tue, Aug 26, 2014 at 11:57 PM, Marc mn.o...@googlemail.com wrote:

Hi,

I have quiet a simple scenario that already gives me a headache for
quiet a while.
I have one Field which is quiet big and full of special characters
like (,),=,:,",' digits and text.
Example:
"msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
(updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
I essentially want to be able to search this things using text,
wildcards etc.
So far I have tried not analyzing the content and using the wildcard
search and it doesn't work very well.
Using different tokenizers and the query_string query also only works
to a certain degree.
For example I want to be able to serach for following expressions:
Service
MyMDB
onMessage
MyMDB.onMessage
appId=cs AND Times=Me:22

and other possible permutations.
What is a correct setup?! I simply can't find a solution...

ps.: the data is imported to elasticsearch using logstash. We do acces
the data using the java api (all software latest versions).

Cheeers,
Marc

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAU6AchiAU6F%2BbTf0LyOecL2YjpLY%2B_e_KF2SjuQWKL0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marc_2 · September 1, 2014, 11:15am

Hi Ivan,

Using a test index and the analyze API, I was no able to create a config,
which is fine for me... theoretically.
{
"template": "logstash-*",
"settings": {
"analysis": {
"filter": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
}
},
"analyzer": {
"b2v_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard",
"lowercase",
"stop",
"my_word_delimiter",
"asciifolding"]
}
}
}
},
"mappings": {
"default": {
"properties": {
"excp": {
"type": "string",
"index": "analyzed",
"analyzer": "b2v_analyzer"
},
"msg": {
"type": "string",
"index": "not_analyzed",
"analyzer": "b2v_analyzer"
}
}
}
}
}
The problem now is, as soon as I activate this for the two fields and have
a new logstash index created I cannot use a simpleQueryString query to
retrieve any results.
It won't find anything via the REST api. Using the standard logstash
template and mapping it works fine.
Have you observed anything simililar?

Thx
Marc

On Friday, August 29, 2014 6:49:41 PM UTC+2, Ivan Brusic wrote:

That output does not look like the something generated from the standard
analyzer since it contains uppercase letters and various non-word
characters such as '='.

Your two analysis requests will differ since the second one contains the
default word_delimiter filter instead of your custom my_word_delimiter.
What you are trying to achieve is somewhat difficult, but you can get there
if you keep on tweaking. Try using a pattern tokenizer instead of the
whitespace tokenizer if you want more control over word boundaries.

--
Ivan

On Fri, Aug 29, 2014 at 1:48 AM, Marc <mn.o...@googlemail.com
<javascript:>> wrote:

Hi Ivan,

thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that has
the said analyzer defined via template it doesn't seem to apply:

This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET
'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d
'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1'
-d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : 14,
"end_offset" : 23,
"type" : "word",
"position" : 3
}, {
"token" : "onmessage appid",
"start_offset" : 14,
"end_offset" : 29,
"type" : "shingle",
"position" : 3
}, {
"token" : "appid",
"start_offset" : 24,
"end_offset" : 29,
"type" : "word",
"position" : 4
}, {
"token" : "appid cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "shingle",
"position" : 4
}, {
"token" : "cs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 5
}, {
"token" : "cs times",
"start_offset" : 30,
"end_offset" : 38,
"type" : "shingle",
"position" : 5
}, {
"token" : "times",
"start_offset" : 33,
"end_offset" : <span style="color:#06

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ddd5fc06-e4e1-41e7-9810-cdc60a2c9aea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marc_2 · September 1, 2014, 11:16am

Hi Ivan,

Using a test index and the analyze API, I was no able to create a config,
which is fine for me... theoretically.
{
"template": "logstash-*",
"settings": {
"analysis": {
"filter": {

            "my_word_delimiter": {
                "type": "word_delimiter",
                "preserve_original": "true"
            }
        },

        "analyzer": {
            "my_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard",
                "lowercase",
                "stop",
                "my_word_delimiter",
                "asciifolding"]
            }
        }
    }
},
"mappings": {
    "_default_": {
        "properties": {

            "excp": {
                "type": "string",
                "index": "analyzed",

                "analyzer": "my_analyzer"
            },
            "msg": {
                "type": "string",
                "index": "not_analyzed",
                "analyzer": "my_analyzer"
            }
        }
    }
}

}
The problem now is, as soon as I activate this for the two fields and have
a new logstash index created I cannot use a simpleQueryString query to
retrieve any results.
It won't find anything via the REST api. Using the standard logstash
template and mapping it works fine.
Have you observed anything simililar?

Thx
Marc

On Friday, August 29, 2014 6:49:41 PM UTC+2, Ivan Brusic wrote:

That output does not look like the something generated from the standard
analyzer since it contains uppercase letters and various non-word
characters such as '='.

Your two analysis requests will differ since the second one contains the
default word_delimiter filter instead of your custom my_word_delimiter.
What you are trying to achieve is somewhat difficult, but you can get there
if you keep on tweaking. Try using a pattern tokenizer instead of the
whitespace tokenizer if you want more control over word boundaries.

--
Ivan

On Fri, Aug 29, 2014 at 1:48 AM, Marc <mn.o...@googlemail.com
<javascript:>> wrote:

Hi Ivan,

thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that has
the said analyzer defined via template it doesn't seem to apply:

This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET
'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d
'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1'
-d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : 14,
"end_offset" : 23,
"type" : "word",
"position" : 3
}, {
"token" : "onmessage appid",
"start_offset" : 14,
"end_offset" : 29,
"type" : "shingle",
"position" : 3
}, {
"token" : "appid",
"start_offset" : 24,
"end_offset" : 29,
"type" : "word",
"position" : 4
}, {
"token" : "appid cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "shingle",
"position" : 4
}, {
"token" : "cs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 5
}, {
"token" : "cs times",
"start_offset" : 30,
"end_offset" : 38,
"type" : "shingle",
"position" : 5
}, {
"token" : "times",
"start_offset" : 33,
"end_offset" : <span style="color:#06

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9bf32a56-a490-44e9-8efd-676587c22621%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marc_2 · September 1, 2014, 11:17am

Hi Ivan,

Using a test index and the analyze API, I was no able to create a config,
which is fine for me... theoretically.
{
"template": "logstash-*",
"settings": {
"analysis": {
"filter": {

            "my_word_delimiter": {
                "type": "word_delimiter",
                "preserve_original": "true"
            }
        },

        "analyzer": {
            "my_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard",
                "lowercase",
                "stop",
                "my_word_delimiter",
                "asciifolding"]
            }
        }
    }
},
"mappings": {
    "_default_": {
        "properties": {

            "excp": {
                "type": "string",
                "index": "analyzed",
                "analyzer": "my_analyzer"
            },
            "msg": {
                "type": "string",
                "index": "analyzed",
                "analyzer": "my_analyzer"
            }
        }
    }
}

}
The problem now is, as soon as I activate this for the two fields and have
a new logstash index created I cannot use a simpleQueryString query to
retrieve any results.
It won't find anything via the REST api. Using the standard logstash
template and mapping it works fine.
Have you observed anything simililar?

Thx
Marc

On Friday, August 29, 2014 6:49:41 PM UTC+2, Ivan Brusic wrote:

That output does not look like the something generated from the standard
analyzer since it contains uppercase letters and various non-word
characters such as '='.

Your two analysis requests will differ since the second one contains the
default word_delimiter filter instead of your custom my_word_delimiter.
What you are trying to achieve is somewhat difficult, but you can get there
if you keep on tweaking. Try using a pattern tokenizer instead of the
whitespace tokenizer if you want more control over word boundaries.

--
Ivan

On Fri, Aug 29, 2014 at 1:48 AM, Marc <mn.o...@googlemail.com
<javascript:>> wrote:

Hi Ivan,

thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that has
the said analyzer defined via template it doesn't seem to apply:

This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET
'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d
'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1'
-d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : 14,
"end_offset" : 23,
"type" : "word",
"position" : 3
}, {
"token" : "onmessage appid",
"start_offset" : 14,
"end_offset" : 29,
"type" : "shingle",
"position" : 3
}, {
"token" : "appid",
"start_offset" : 24,
"end_offset" : 29,
"type" : "word",
"position" : 4
}, {
"token" : "appid cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "shingle",
"position" : 4
}, {
"token" : "cs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 5
}, {
"token" : "cs times",
"start_offset" : 30,
"end_offset" : 38,
"type" : "shingle",
"position" : 5
}, {
"token" : "times",
"start_offset" : 33,
"end_offset" : <span style="color:#06

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8ef1a8eb-3e4b-413a-b751-6e0d84bfca6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · September 2, 2014, 5:54pm

Hard to say without looking at your query, but perhaps you are experiencing
query parser issues. The query string query uses the standard query parser,
which might does not tokenize terms in the way your custom tokenizer might.
Try using match queries, which does not use the query parser to see if it
"fixes" the problem. Of course, you will have have the query syntax at your
disposal, but you can find workarounds.

--
Ivan

On Mon, Sep 1, 2014 at 4:17 AM, Marc mn.offman@googlemail.com wrote:

Hi Ivan,

Using a test index and the analyze API, I was no able to create a config,
which is fine for me... theoretically.
{
"template": "logstash-*",
"settings": {
"analysis": {
"filter": {
            "my_word_delimiter": {

                "type": "word_delimiter",
                "preserve_original": "true"
            }
        },

        "analyzer": {

            "my_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard",
                "lowercase",
                "stop",
                "my_word_delimiter",
                "asciifolding"]
            }
        }
    }
},
"mappings": {
    "_default_": {
        "properties": {

            "excp": {

                "type": "string",
                "index": "analyzed",
                "analyzer": "my_analyzer"
            },
            "msg": {

                "type": "string",
                "index": "analyzed",

                "analyzer": "my_analyzer"
            }
        }
    }
}
}
The problem now is, as soon as I activate this for the two fields and have
a new logstash index created I cannot use a simpleQueryString query to
retrieve any results.
It won't find anything via the REST api. Using the standard logstash
template and mapping it works fine.
Have you observed anything simililar?

Thx
Marc

On Friday, August 29, 2014 6:49:41 PM UTC+2, Ivan Brusic wrote:
That output does not look like the something generated from the standard
analyzer since it contains uppercase letters and various non-word
characters such as '='.

Your two analysis requests will differ since the second one contains the
default word_delimiter filter instead of your custom my_word_delimiter.
What you are trying to achieve is somewhat difficult, but you can get there
if you keep on tweaking. Try using a pattern tokenizer instead of the
whitespace tokenizer if you want more control over word boundaries.

--
Ivan

On Fri, Aug 29, 2014 at 1:48 AM, Marc mn.o...@googlemail.com wrote:

Hi Ivan,

thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that
has the said analyzer defined via template it doesn't seem to apply:

This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET 'localhost:9200/bogstash-1/analyze?analyzer=msg_excp
analyzer&pretty=1' -d 'Service=MyMDB.onMessage appId=cs
Times=Me:22/Total:22 (updated attributes=gps_lng: 183731222/ gps_lat:
289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=
lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1' -d 'Service=MyMDB.onMessage
appId=cs Times=Me:22/Total:22 (updated attributes=gps_lng: 183731222/
gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : 14,
"end_offset" : 23,
"type" : "word",
"position" : 3
}, {
"token" : "onmessage appid",
"start_offset" : 14,
"end_offset" : 29,
"type" : "shingle",
"position" : 3
}, {
"token" : "appid",
"start_offset" : 24,
"end_offset" : 29,
"type" : "word",
"position" : 4
}, {
"token" : "appid cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "shingle",
"position" : 4
}, {
"token" : "cs",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 5
}, {
"token" : "cs times",
"start_offset" : 30,
"end_offset" : 38,
"type" : "shingle",
"position" : 5
}, {
"token" : "times",
"start_offset" : 33,
"end_offset" : <span style="color:#06
...
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8ef1a8eb-3e4b-413a-b751-6e0d84bfca6a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8ef1a8eb-3e4b-413a-b751-6e0d84bfca6a%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDctYOw_3uqc0zgQAfaQpjTGUhYWPB03v0HYy1PsLT9_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marc_2 · September 3, 2014, 9:33am

Hi Ivan,

I have resolved the problem. It works fine now. The template was wrong. The
simpleQueryString works fine now too.

Cheers,
Marc

On Tuesday, September 2, 2014 7:54:27 PM UTC+2, Ivan Brusic wrote:

Hard to say without looking at your query, but perhaps you are
experiencing query parser issues. The query string query uses the standard
query parser, which might does not tokenize terms in the way your custom
tokenizer might. Try using match queries, which does not use the query
parser to see if it "fixes" the problem. Of course, you will have have the
query syntax at your disposal, but you can find workarounds.

--
Ivan

On Mon, Sep 1, 2014 at 4:17 AM, Marc <mn.o...@googlemail.com <javascript:>

wrote:

Hi Ivan,

Using a test index and the analyze API, I was no able to create a config,
which is fine for me... theoretically.
{
"template": "logstash-*",
"settings": {
"analysis": {
"filter": {
            "my_word_delimiter": {

                "type": "word_delimiter",
                "preserve_original": "true"
            }
        },

        "analyzer": {

            "my_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard",
                "lowercase",
                "stop",
                "my_word_delimiter",
                "asciifolding"]
            }
        }
    }
},
"mappings": {
    "_default_": {
        "properties": {

            "excp": {

                "type": "string",
                "index": "analyzed",
                "analyzer": "my_analyzer"
            },
            "msg": {

                "type": "string",
                "index": "analyzed",

                "analyzer": "my_analyzer"
            }
        }
    }
}
}
The problem now is, as soon as I activate this for the two fields and have
a new logstash index created I cannot use a simpleQueryString query to
retrieve any results.
It won't find anything via the REST api. Using the standard logstash
template and mapping it works fine.
Have you observed anything simililar?

Thx
Marc

On Friday, August 29, 2014 6:49:41 PM UTC+2, Ivan Brusic wrote:

That output does not look like the something generated from the standard
analyzer since it contains uppercase letters and various non-word
characters such as '='.

Your two analysis requests will differ since the second one contains the
default word_delimiter filter instead of your custom my_word_delimiter.
What you are trying to achieve is somewhat difficult, but you can get there
if you keep on tweaking. Try using a pattern tokenizer instead of the
whitespace tokenizer if you want more control over word boundaries.

--
Ivan

On Fri, Aug 29, 2014 at 1:48 AM, Marc mn.o...@googlemail.com wrote:

Hi Ivan,

thanks again. I have tried so and found a reasonable combination.
Nevertheless, when I now try to use the analyze api with an index that has
the said analyzer defined via template it doesn't seem to apply:

This is the complete template:
{
"template": "bogstash-*",
"settings": {
"index.number_of_replicas": 0,
"analysis": {
"analyzer": {
"msg_excp_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": ["word_delimiter",
"lowercase",
"asciifolding",
"shingle",
"standard"]
}
},
"filters": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"default": {
"properties": {
"@excp": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
},
"@msg": {
"type": "string",
"index": "analyzed",
"analyzer": "msg_excp_analyzer"
}
}
}
}
}
I create the index bogstash-1.
Now I test the following:
curl -XGET 'localhost:9200/bogstash-1/analyze?analyzer=msg_excp
analyzer&pretty=1' -d 'Service=MyMDB.onMessage appId=cs
Times=Me:22/Total:22 (updated attributes=gps_lng: 183731222/ gps_lat:
289309222/ )'
and it returns:
{
"tokens" : [ {
"token" : "Service=MyMDB.onMessage",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 1
}, {
"token" : "appId=cs",
"start_offset" : 24,
"end_offset" : 32,
"type" : "word",
"position" : 2
}, {
"token" : "Times=Me:22/Total:22",
"start_offset" : 33,
"end_offset" : 53,
"type" : "word",
"position" : 3
}, {
"token" : "(updated",
"start_offset" : 54,
"end_offset" : 62,
"type" : "word",
"position" : 4
}, {
"token" : "attributes=gps_lng:",
"start_offset" : 63,
"end_offset" : 82,
"type" : "word",
"position" : 5
}, {
"token" : "183731222/",
"start_offset" : 83,
"end_offset" : 93,
"type" : "word",
"position" : 6
}, {
"token" : "gps_lat:",
"start_offset" : 94,
"end_offset" : 102,
"type" : "word",
"position" : 7
}, {
"token" : "289309222/",
"start_offset" : 103,
"end_offset" : 113,
"type" : "word",
"position" : 8
}, {
"token" : ")",
"start_offset" : 114,
"end_offset" : 115,
"type" : "word",
"position" : 9
} ]
}
Which is the output of a standard analyzer.
Giving the tokenizer and filters in the analyze API directly works fine:
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=
lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1' -d 'Service=MyMDB.onMessage
appId=cs Times=Me:22/Total:22 (updated attributes=gps_lng: 183731222/
gps_lat: 289309222/ )'
This results in:
{
"tokens" : [ {
"token" : "service",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "service mymdb",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 1
}, {
"token" : "mymdb",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 2
}, {
"token" : "mymdb onmessage",
"start_offset" : 8,
"end_offset" : 23,
"type" : "shingle",
"position" : 2
}, {
"token" : "onmessage",
"start_offset" : <span style="colo

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8b242461-9899-4de7-8ad3-da6645be9947%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Search string with space in a long text Elasticsearch	11	23286	December 20, 2018
Phrases with special characters Elasticsearch	1	1386	July 6, 2017
Searching word with special characters Elasticsearch	7	1824	November 4, 2020
Search Special char support Elasticsearch	3	139	February 27, 2024
Search with special caracters Elasticsearch	6	368	July 6, 2017

EL setup for fulltext search

Related topics