Can anyone please explain the mapping in Elasticsearch?


(dark_shadow) #1

Hi,

I have an existing mapping with me but I'm not able to understand it fully.

curl -XPUT 'http://localhost:9200/auto_index/http://localhost:9200/acqindex/'
-d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1,
"analysis" : {
"analyzer" : {
"str_search_analyzer" : {
"tokenizer" : "standard",
"filter" :
["lowercase","asciifolding","suggestion_shingle","edgengram"]
},
"str_index_analyzer" : {
"tokenizer" : "standard",
"filter" :
["lowercase","asciifolding","suggestions_shingle","edgengram"]
}
},
"filter" : {
"suggestions_shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 5
},
"edgengram" : {
"type" : "edgeNGram",
"min_gram" : 2,
"max_gram" : 30,
"side" : "front"
}
}
},
"similarity" : {
"index": {
"type": "default"
},
"search": {
"type": "default"
}
}
}
}

curl -XPUT 'localhost:9200/auto_index/autocomplete/_mapping' -d '{
"autocomplete":{
"_boost" : {
"name" : "po",
"null_value" : 4.0
},
"properties": {
"ad": {
"type": "string",
"search_analyzer" : "str_search_analyzer",
"index_analyzer" : "str_index_analyzer",
"omit_norms": "true",
"similarity": "index"
},
"category": {
"type": "string",
"include_in_all" : false
},
"cn": {
"type": "string",
"search_analyzer" : "str_search_analyzer",
"index_analyzer" : "str_index_analyzer",
"omit_norms": "true",
"similarity": "index"
},
"ctype": {
"type": "string",
"search_analyzer" : "keyword",
"index_analyzer" : "keyword",
"omit_norms": "true",
"similarity": "index"
},
"eid": {
"type": "string",
"include_in_all" : false
},
"st": {
"type": "string",
"search_analyzer" : "str_search_analyzer",
"index_analyzer" : "str_index_analyzer",
"omit_norms": "true",
"similarity": "index"
},
"co": {
"type": "string",
"include_in_all" : false
},
"st": {
"type": "string",
"search_analyzer" : "str_search_analyzer",
"index_analyzer" : "str_index_analyzer",
"omit_norms": "true",
"similarity": "index"
},
"co": {
"type": "string",
"search_analyzer" : "str_search_analyzer",
"index_analyzer" : "str_index_analyzer",
"omit_norms": "true",
"similarity": "index"
},
"po": {
"type": "double",
"boost": 4.0
},
"en":{
"type": "boolean"
},
"_oid":{
"type": "long"
},
"text": {
"type": "string",
"search_analyzer" : "str_search_analyzer",
"index_analyzer" : "str_index_analyzer",
"omit_norms": "true",
"similarity": "index"
},
"url": {
"type": "string"
}
}
}
}'

I'm not able to understand how the index and search analyzer works exactly.
Let's say I have following documents with me:-

{text:hotels in hosur}
{text:hotels in innsburg}
{text: hotels in mp}
{text: hotels in ink}
{text:hotels in ranchi}

Now If I query for 'hotels in' then I get hotels in hosur on the top. Is it
because I have more occurrences of ho i the doc.

Can anyone please explain me with some sample sentence like how exactly my
query string is getting analyzed ?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b02e1c23-18d2-4adc-b50b-8eecc00fac22%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

Coder,

The best way to understand what an analyzer is doing is by using the
_analyze api. For example if you do something like this:

curl -XGET
'http://localhost:9200/auto_index/_analyze?analyzer=str_search_analyzer&pretty&text=hotels%20in%20hosur'

It will tell you how that text is analyzed. In your mapping, the analyzer
does suggestions_shingle and edgengram. The suggestions_shingle does the
shingle token filter
(http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html)
so for example:

"hotels in hosur" becomes "hotels", "in", "hosur", "hotels in", "hotels in
hosur", "in hosur"

Then your edgengram does the edge ngram token filter
(http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html)
so for example:

"hotels" becomes "ho", "hot", "hote", etc...
"in" becomes "in"
"hosur" becomes "ho", "hos", "hosu", etc...
etc...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1e8826a3-ccb9-4987-9372-88c3967b7d68%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(dark_shadow) #3

Binh,

I have a doubt in the above explanation. You mentioned that after
suggestions_shingle:
"hotels in hosur" becomes "hotels", "in", "hosur", "hotels in", "hotels in
hosur", "in hosur" but shouldn't be "hotels" , "in", "hosur" should be
removed since my min_shingle_size is 2 or is it like the original tokens
will stay always.

Also, after edgengram-tokenizer I'm getting "ho" many times so in my final
output will there be only one "ho" ? or multiple "ho" because by default
after every filter "unique" token filter is used by ElasticSearch, please
correct me if I'm wrong ? How can I use unique token filter to remove the
repeated tokens after final processing ? I tried adding "unique" after
"edgengram" but it is not working ?

On Wed, Jan 29, 2014 at 11:52 PM, Binh Ly binh@hibalo.com wrote:

Coder,

The best way to understand what an analyzer is doing is by using the
_analyze api. For example if you do something like this:

curl -XGET '
http://localhost:9200/auto_index/_analyze?analyzer=str_search_analyzer&pretty&text=hotels%20in%20hosur
'

It will tell you how that text is analyzed. In your mapping, the analyzer
does suggestions_shingle and edgengram. The suggestions_shingle does the
shingle token filter (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html)
so for example:

"hotels in hosur" becomes "hotels", "in", "hosur", "hotels in", "hotels in
hosur", "in hosur"

Then your edgengram does the edge ngram token filter (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html)
so for example:

"hotels" becomes "ho", "hot", "hote", etc...
"in" becomes "in"
"hosur" becomes "ho", "hos", "hosu", etc...
etc...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1e8826a3-ccb9-4987-9372-88c3967b7d68%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAVTvp6HK4_iWQC_hMwKijLBGNRXUa3CcHW_YeTFSH1MYBDEyw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #4

Coder,

  1. For the shingle token filter, there is a property called output_unigrams
    which is true by default. What it does is it will output the single tokens:
    "hotels", "in", etc in addition to the shingle tokens. If you set it to
    false, it will remove them.

  2. Correct, it will output "ho" many times because for each token that
    comes out of the shingles, it will do a edge ngram on it. I'm not familiar
    with the unique token filter but I will check and get back to you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/08d19dd6-896c-40d2-a27a-7554f216ca4c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(dark_shadow) #5

Binh,

I have one more doubt regarding matching of index time tokens with query
time tokens. Can you please tell me how elasticsearch does this type of
matching ? Is it like it only matches the token or it also matches other
information like start_offset, end_offset, type and position. ?

Thanks

On Thu, Jan 30, 2014 at 12:16 AM, Binh Ly binh@hibalo.com wrote:

Coder,

  1. For the shingle token filter, there is a property called
    output_unigrams which is true by default. What it does is it will output
    the single tokens: "hotels", "in", etc in addition to the shingle tokens.
    If you set it to false, it will remove them.

  2. Correct, it will output "ho" many times because for each token that
    comes out of the shingles, it will do a edge ngram on it. I'm not familiar
    with the unique token filter but I will check and get back to you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/08d19dd6-896c-40d2-a27a-7554f216ca4c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAVTvp6ng-q%2BAUe8vC2HpTp6-bMsdgXStAFzG4HYzDTVX2E9rg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(dark_shadow) #6

Also, are there any conditional token filtering can be done in
Elasticsearch. Let's say I want some of my documents to be tokenized with a
certain set of token filters whereas others to be indexed using some other
set of token filters.

Thanks

On Thu, Jan 30, 2014 at 1:48 AM, Mukul Gupta mukulnitkkr@gmail.com wrote:

Binh,

I have one more doubt regarding matching of index time tokens with query
time tokens. Can you please tell me how elasticsearch does this type of
matching ? Is it like it only matches the token or it also matches other
information like start_offset, end_offset, type and position. ?

Thanks

On Thu, Jan 30, 2014 at 12:16 AM, Binh Ly binh@hibalo.com wrote:

Coder,

  1. For the shingle token filter, there is a property called
    output_unigrams which is true by default. What it does is it will output
    the single tokens: "hotels", "in", etc in addition to the shingle tokens.
    If you set it to false, it will remove them.

  2. Correct, it will output "ho" many times because for each token that
    comes out of the shingles, it will do a edge ngram on it. I'm not familiar
    with the unique token filter but I will check and get back to you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/08d19dd6-896c-40d2-a27a-7554f216ca4c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAVTvp5Yxrc0rAc0iqNXugEXmY%2BRYH3zuqDA69Akn4acSBmsBg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #7

Coder,

In general, full-text searches match tokens. If you're doing phrase or span
queries, it will also look at positions. If you're doing range queries,
like date-range, or numeric range, then it will also look at the actual
types.

On Wednesday, January 29, 2014 3:18:18 PM UTC-5, coder wrote:

Binh,

I have one more doubt regarding matching of index time tokens with query
time tokens. Can you please tell me how elasticsearch does this type of
matching ? Is it like it only matches the token or it also matches other
information like start_offset, end_offset, type and position. ?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0ea8ec61-7040-447e-b75f-77f0513f3cc9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #8

Analysis is done on a per-field basis. I do not believe you can
conditionally change the analyzer for a single field depending on the
nature of your data. However, you may be interested in the multi_field type
which allows you define multiple analyzers for a single field.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

On Wednesday, January 29, 2014 3:19:46 PM UTC-5, coder wrote:

Also, are there any conditional token filtering can be done in
Elasticsearch. Let's say I want some of my documents to be tokenized with a
certain set of token filters whereas others to be indexed using some other
set of token filters.

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6bb01fbb-3394-4558-92e4-79aa8d7a84ea%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(dark_shadow) #9

Binh,

I want to use conditional analysis for one set of documents not the single
field in those documents. So, it's like I'm analyzing my type A documents
with a set of tokens and rest of the documents with a different set of
tokens and filters. Is there any way by which I can do it ?

Thanks

On Thu, Jan 30, 2014 at 2:46 AM, Binh Ly binh@hibalo.com wrote:

Analysis is done on a per-field basis. I do not believe you can
conditionally change the analyzer for a single field depending on the
nature of your data. However, you may be interested in the multi_field type
which allows you define multiple analyzers for a single field.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

On Wednesday, January 29, 2014 3:19:46 PM UTC-5, coder wrote:

Also, are there any conditional token filtering can be done in
Elasticsearch. Let's say I want some of my documents to be tokenized with a
certain set of token filters whereas others to be indexed using some other
set of token filters.

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6bb01fbb-3394-4558-92e4-79aa8d7a84ea%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAVTvp7rjxsOTJupZPq%3DesUtG0BcabQdRdyF7X7qrz%2BqnuAK1g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #10

You'll probably want two different types with different mappings. So for
example some documents will go under "http://localhost:9200/yourdata/typea"
and other documents will go under "http://localhost:9200/yourdata/typeb".
Each one of these types can have its own independent mapping. You can query
all your documents using "http://localhost:9200/yourdata/_search".

On Wednesday, January 29, 2014 10:09:02 PM UTC-5, coder wrote:

Binh,

I want to use conditional analysis for one set of documents not the single
field in those documents. So, it's like I'm analyzing my type A documents
with a set of tokens and rest of the documents with a different set of
tokens and filters. Is there any way by which I can do it ?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e1cbbb71-930e-4ad9-b037-f6daed1ee3d0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #11