Using shingle

Petr_Jansky · February 20, 2015, 2:29pm

Hi there,

I've tried to use shingle for getting bigrams and trigrams

curl -X POST 'localhost:9200/idnes/' -d '{
"settings" : {
"analysis" : {
"filter": {
"czech_stop": {
"type": "stop",
"stopwords": "czech",
"ignore_case": "true",
"remove_trailing": "false"
},
"czech_stop_ngram": {
"type": "stop",
"stopwords" : ["a", "i", "k", "o", "s", "u", "v", "z", "do",
"co", "by", "do", "je", "mu", "mi", "mě", "mně", "mne", "na", "ne", "ní,
"si", "se", "ta", "to", "té", "ti", "ty", "už", "ve", "za", "že", "aby",
"ani", "ale", "byl", "jak", "jen", "jde", "kdo", "kdy", "kde", "něm",
"nich", "něj", "než", "pro", "tak", "ten", "tam", "tady", "těch", "jsou",
"jsem", "není", "nyní", "nimi", "jako", "jaká", "jaké", "jaká", "právě",
"který", "která", "které", "jeho", "její", "nebo", "jako", "toho", "kdyby",
"takový", "taková", "takové", "czech" ],
"ignore_case": "true",
"remove_trailing": "false"
},
"czech_keywords": {
"type": "keyword_marker",
"keywords": ["že"]
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
},
"shingle2_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
},
"shingle3_filter": {
"type": "shingle",
"min_shingle_size": 3,
"max_shingle_size": 3,
*"output_unigrams": false *
}
},
"analyzer": {
....
"shingle2s_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "czech_stop_ngram",
"shingle2_filter"]
},
"shingle3s_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["czech_stop_ngram", "shingle3_filter" ]
}
}
}
},

"mappings" : {
"article" : {
"_id" : {
"path" : "reference"
},

"properties" : {
    .....
    "content2"   : { "type":"string", "analyzer": "shingle2_analyzer"},
    "content3"   : { "type":"string", "analyzer": "shingle3_analyzer"},
    "content4"   : { "type":"string", "analyzer": "shingle2s_analyzer"},
    "content5"   : { "type":"string", "analyzer": "shingle3s_analyzer"},
    ......

If I try my analysers using by calling:

curl -X GET
'localhost:9200/idnes/_analyze?analyzer=shingle3s_analyzer&pretty' -d 'a e
i o u s k z na ke ze nad pod za před Norská strana zatím dostatečně
nevyhodnotila, jak citlivou otázkou je pro Česko případ synů Evy
Michalákové. Tak popisuje současnou situaci premiér Bohuslav Sobotka. Ten
již dostal odpověď na dopis od premiérky Norska Erny Solbergové. S obecnými
odpověďmi není spokojen a zvažuje do Norska další psaní.' | grep "token"

It works fine. In results there are only trigrams
"tokens" : [ {
"token" : "_ e ",
"token" : "e _ ",
"token" : " _ Norská",
"token" : " Norská ",
"token" : "Norská _ zatím",
"token" : " zatím dostatečně",
"token" : "zatím dostatečně nevyhodnotila",
"token" : "dostatečně nevyhodnotila ",
"token" : "nevyhodnotila _ citlivou",
"token" : " citlivou otázkou",
"token" : "citlivou otázkou _",
"token" : "otázkou _ _",
....

But there is an issue if I use it on indexed data
POST idnes/_search?pretty=true
{
"query": {
"match": {
"content_type": "Article"
}
},
"facets" : {
"tag" : {
"terms" : {
"fields" : ["content5"],
"size" : 20
}
}
}
}

In the response there are also unigrams.
"facets": {
"tag": {
"_type": "terms",
"missing": 452,
"total": 926077,
"other": 762645,
"terms": [
{
"term": "a",
"count": 18150
},
{
"term": "to",
"count": 17131
},
{
"term": "je",
"count": 14090
},
{
"term": "se",
"count": 13621
},
{
"term": "na",
"count": 12285
},
......
{
"term": "korun _ ",
"count": 551
},
{
"term": " _ případě",
"count": 499
},
{
"term": "zobrazení videa musíte",
"count": 449
}
.....

Why does it happen?
Is there any other way how to skip "_" from stopword than http://www.elasticsearch.org/blog/searching-with-shingles/
that doesn't work for Lucene 4.4+?

Thanks
Petr

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0d2aa0fb-2a12-404d-bdf4-bb09b970cb5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Petr_Jansky · March 17, 2015, 2:00pm

Noone?

Petr

Dne pátek 20. února 2015 15:29:15 UTC+1 Petr Janský napsal(a):

Hi there,

I've tried to use shingle for getting bigrams and trigrams

curl -X POST 'localhost:9200/idnes/' -d '{
"settings" : {
"analysis" : {
"filter": {
"czech_stop": {
"type": "stop",
"stopwords": "czech",
"ignore_case": "true",
"remove_trailing": "false"
},
"czech_stop_ngram": {
"type": "stop",
"stopwords" : ["a", "i", "k", "o", "s", "u", "v", "z", "do",
"co", "by", "do", "je", "mu", "mi", "mě", "mně", "mne", "na", "ne", "ní,
"si", "se", "ta", "to", "té", "ti", "ty", "už", "ve", "za", "že", "aby",
"ani", "ale", "byl", "jak", "jen", "jde", "kdo", "kdy", "kde", "něm",
"nich", "něj", "než", "pro", "tak", "ten", "tam", "tady", "těch", "jsou",
"jsem", "není", "nyní", "nimi", "jako", "jaká", "jaké", "jaká", "právě",
"který", "která", "které", "jeho", "její", "nebo", "jako", "toho", "kdyby",
"takový", "taková", "takové", "czech" ],
"ignore_case": "true",
"remove_trailing": "false"
},
"czech_keywords": {
"type": "keyword_marker",
"keywords": ["že"]
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
},
"shingle2_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
},
"shingle3_filter": {
"type": "shingle",
"min_shingle_size": 3,
"max_shingle_size": 3,
*"output_unigrams": false *
}
},
"analyzer": {
....
"shingle2s_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "czech_stop_ngram",
"shingle2_filter"]
},
"shingle3s_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["czech_stop_ngram", "shingle3_filter" ]
}
}
}
},

"mappings" : {
"article" : {
"_id" : {
"path" : "reference"
},
"properties" : {
    .....
    "content2"   : { "type":"string", "analyzer": "shingle2_analyzer"},
    "content3"   : { "type":"string", "analyzer": "shingle3_analyzer"},
    "content4"   : { "type":"string", "analyzer": 
"shingle2s_analyzer"},
"content5" : { "type":"string", "analyzer":
"shingle3s_analyzer"},
......

If I try my analysers using by calling:

curl -X GET
'localhost:9200/idnes/_analyze?analyzer=shingle3s_analyzer&pretty' -d 'a e
i o u s k z na ke ze nad pod za před Norská strana zatím dostatečně
nevyhodnotila, jak citlivou otázkou je pro Česko případ synů Evy
Michalákové. Tak popisuje současnou situaci premiér Bohuslav Sobotka. Ten
již dostal odpověď na dopis od premiérky Norska Erny Solbergové. S obecnými
odpověďmi není spokojen a zvažuje do Norska další psaní.' | grep "token"

It works fine. In results there are only trigrams
"tokens" : [ {
"token" : "_ e ",
"token" : "e _ ",
"token" : " _ Norská",
"token" : " Norská ",
"token" : "Norská _ zatím",
"token" : " zatím dostatečně",
"token" : "zatím dostatečně nevyhodnotila",
"token" : "dostatečně nevyhodnotila ",
"token" : "nevyhodnotila _ citlivou",
"token" : " citlivou otázkou",
"token" : "citlivou otázkou _",
"token" : "otázkou _ _",
....

But there is an issue if I use it on indexed data
POST idnes/_search?pretty=true
{
"query": {
"match": {
"content_type": "Article"
}
},
"facets" : {
"tag" : {
"terms" : {
"fields" : ["content5"],
"size" : 20
}
}
}
}

In the response there are also unigrams.
"facets": {
"tag": {
"_type": "terms",
"missing": 452,
"total": 926077,
"other": 762645,
"terms": [
{
"term": "a",
"count": 18150
},
{
"term": "to",
"count": 17131
},
{
"term": "je",
"count": 14090
},
{
"term": "se",
"count": 13621
},
{
"term": "na",
"count": 12285
},
......
{
"term": "korun _ ",
"count": 551
},
{
"term": " _ případě",
"count": 499
},
{
"term": "zobrazení videa musíte",
"count": 449
}
.....

Why does it happen?

Is there any other way how to skip "_" from stopword than
Elasticsearch Platform — Find real-time answers at scale | Elastic that
doesn't work for Lucene 4.4+?

Thanks
Petr

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/378228d7-3d93-4248-9728-2d441ecace91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Issue when combining shingle filter and stopwords Elasticsearch	1	980	March 16, 2016
Querying shingles Elasticsearch	0	328	June 8, 2020
Shingle analyzer Вопросы на русском языке	6	1128	March 13, 2018
How does shingle filter work on match_phrase in query phase? Elasticsearch	4	1654	June 20, 2014
Using shingle and stop filters Elasticsearch	1	404	June 8, 2020

Using shingle

Related topics