Search query does'nt find the token

I have the sentence "Is with any job to take courses" in title field. If i query with token "cours" doesnt match anything. I must query with token "courses" and then is ok! I use the english analyser https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

Take a look.
GET enarticles/_analyze
{
"analyzer" : "englishtoken",
"text" : "courses"
}
Output:
{
"tokens" : [
{
"token" : "cours",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 0
}
]
}

GET enarticles/_search
{
"query": {
"match" : {
"title" : {
"query" : "cours"
}
}
}
}
Output:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" :
}
}
Nothing is matched!

I will use the token analyzer output for searching the database if a token exists. This must be 100% working. Any ideas?

please share your full setup, as it's not clear here what the englishtoken analyzer consists off in terms of tokenizer and token filters. This will prevent guessing on our side :slight_smile:

This is the full analyzer for enarticles!
$params = [
'index' => 'enarticles',
'body' => [
'settings' => [
'number_of_shards' => 3,
'number_of_replicas' => 2,

        "analysis"=>[
            "filter"=> [
                "english_stop"=> [
                    "type"=>      "stop",
                    "stopwords"=>   "_english_"
                ],
                 
                
                "english_keywords"=> [
                    "type"=>       "keyword_marker",
                    "keywords"=>   ["example"]
                ],
                "english_stemmer"=> [
                    "type"=>      "stemmer",
                    "language"=>   "english"
                ],
                "english_possessive_stemmer"=> [ 
                "type"=>    "stemmer",
                "language"=>   "possessive_english" ],
                
                 
                
                
                
                
            ],
            "analyzer"=> [
                "englishtoken"=> [
                    "tokenizer"=>  "standard",
                    "filter"=>[
                         
                        "english_stop",
                        "english_keywords",
                        "english_stemmer",
                        "english_possessive_stemmer"
                    ] 
                 ] 
                    
                
            ]
        ]
        
    ],
    'mappings' => [
         
            'properties' => [
                'etext' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                'gtext' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                'html' => [
                    'type' => 'text',
                    'analyzer' => 'standard'
                ],
                
                'title' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                'keywords' => [
                    'type' => 'text',
                    'analyzer' => 'standard'
                ],
                'url' => [
                    'type' => 'text',
                    'analyzer' => 'standard'
                ],
                "date"=> [
                    "type"=> "date"
                ],
                'category' => [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ],
                "author"=> [
                    'type' => 'text',
                    'analyzer' => 'englishtoken'
                ] ,
                "sentiment"=> [
                    'type' => 'float',
                ] ,
                "ranking"=> [
                    'type' => 'float',
                ]
            ]
        ]
    ]

];

Just quickly chiming in that what you're running into is the challenge of how stemming treats plurals. The term courses, as you've demonstrated, emits the token cours.

But then, the token that's emitted for the term cours is actually cour, as the stemmer sees the term as a plural rather than as a the stem of course.

There's a github issue open with great discussion and analysis of the topic.

I'm not certain what the prescribed approach is to work around this; search fuzzification occurs to me, but applied systematically to your queries, that could increase the number of false positive results you retrieve.

Solved. Thank you so much!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.