Search query does'nt find the token

Giannis · November 19, 2019, 4:07pm

I have the sentence "Is with any job to take courses" in title field. If i query with token "cours" doesnt match anything. I must query with token "courses" and then is ok! I use the english analyser https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

Take a look.
GET enarticles/_analyze
{
"analyzer" : "englishtoken",
"text" : "courses"
}
Output:
{
"tokens" : [
{
"token" : "cours",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 0
}
]
}

GET enarticles/_search
{
"query": {
"match" : {
"title" : {
"query" : "cours"
}
}
}
}
Output:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" :
}
}
Nothing is matched!

I will use the token analyzer output for searching the database if a token exists. This must be 100% working. Any ideas?

spinscale · November 21, 2019, 12:29pm

please share your full setup, as it's not clear here what the englishtoken analyzer consists off in terms of tokenizer and token filters. This will prevent guessing on our side

Giannis · November 21, 2019, 3:07pm

This is the full analyzer for enarticles!
$params = [
'index' => 'enarticles',
'body' => [
'settings' => [
'number_of_shards' => 3,
'number_of_replicas' => 2,

        "analysis"=>[
            "filter"=> [
                "english_stop"=> [
                    "type"=>      "stop",
                    "stopwords"=>   "_english_"
                ],
                 
                
                "english_keywords"=> [
                    "type"=>       "keyword_marker",
                    "keywords"=>   ["example"]
                ],
                "english_stemmer"=> [
                    "type"=>      "stemmer",
                    "language"=>   "english"
                ],
                "english_possessive_stemmer"=> [ 
                "type"=>    "stemmer",
                "language"=>   "possessive_english" ],
                
                 
                
                
                
                
            ],
            "analyzer"=> [
                "englishtoken"=> [
                    "tokenizer"=>  "standard",
                    "filter"=>[
                         
                        "english_stop",
                        "english_keywords",
                        "english_stemmer",
                        "english_possessive_stemmer"
                    ] 
                 ] 
                    
                
            ]
        ]
        
    ],
    'mappings' => [
         
            'properties' => [
                'etext' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                'gtext' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                'html' => [
                    'type' => 'text',
                    'analyzer' => 'standard'
                ],
                
                'title' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                'keywords' => [
                    'type' => 'text',
                    'analyzer' => 'standard'
                ],
                'url' => [
                    'type' => 'text',
                    'analyzer' => 'standard'
                ],
                "date"=> [
                    "type"=> "date"
                ],
                'category' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                "author"=> [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ] ,
                "sentiment"=> [
                    'type' => 'float',
                ] ,
                "ranking"=> [
                    'type' => 'float',
                ]
            ]
        ]
    ]

];

Glen_Smith · November 21, 2019, 8:39pm

Just quickly chiming in that what you're running into is the challenge of how stemming treats plurals. The term courses, as you've demonstrated, emits the token cours.

But then, the token that's emitted for the term cours is actually cour, as the stemmer sees the term as a plural rather than as a the stem of course.

There's a github issue open with great discussion and analysis of the topic.

I'm not certain what the prescribed approach is to work around this; search fuzzification occurs to me, but applied systematically to your queries, that could increase the number of false positive results you retrieve.

Giannis · November 23, 2019, 9:54pm

Solved. Thank you so much!

system · December 21, 2019, 9:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.