What happens stemming-wise in this situation?

mrodent · April 6, 2024, 5:42pm

I have implemented a simple module where .docx (Word) files in a directory tree are parsed and each one split into multiple (overlapping) 10-line "Lucene documents", and entered into an index using the _bulk endpoint ... and then, currently, a simple query search is done.

I have implemented stemmed fields a bit in the past but I'm trying to understand a bit more about how things work. Supposing I now add a mapping thus (NB this is added to a new, as yet empty index which I've just created with a name based on a particular regex pattern, after examining the current active index used by an alias).

mappings = {
    'properties': {
        'text_content': {
            'type': 'text',
            'term_vector': 'with_positions_offsets',
            'fields': {
                'stemmed': {
                    'type': 'text',
                    'analyzer': 'english',
                    'term_vector': 'with_positions_offsets',
                }
            }
        },
    }
}
headers = {'Content-type': 'application/json'}
success, deliverable = process_json_request(f'{ES_URL}/{new_index_name}/_mapping', 'put', data=json.dumps(mappings), headers=headers)

My code for running the queries is currently pretty much the simplest one I could possibly devise:

simple_query = { 'query': 
    { 'simple_query_string' : 
        { 'query': query_string }}}
headers = { 'Content-type': 'application/json' }
success, deliverable = utilities.process_json_request(f'{ES_URL}/{ALIAS_NAME}/_search', data=json.dumps(simple_query), headers=headers)

... I'm trying to understand: with this current type of query, will it make use of the stemmed field, or the unstemmed text?

The funny thing is, if I do a (random) query string such as "linux set", it seems to be delivering results in which stemming has indeed been applied, e.g. where I can see a word such as "setting" or "sets" is included in the delivered Lucene documents.

But then I do another query string search such as "linux sets" ... and it returns another set of results, apparently also with stemming. And another set of results when I do "linux setting"... etc.

I would have thought that if the query string is subjected to stemming before the search happens that the results would be pretty much deterministic: I would expect all three searches to be done on (as stemmed) "linux set".

In previous experiments I have moved straight on to applying a particular analyzer to my query string explicitly ... but I'm trying to understand better what exactly happens in this situation. Maybe it's explained somewhere in the docs?

Topic		Replies	Views
How to efficiently use stemmer to improve search results? Elasticsearch	5	650	July 6, 2017
Is there any way I can keep original words with the stemmed words? Is it a good idea? Elasticsearch	1	433	July 6, 2017
Regarding porter stemming solution Elasticsearch	1	327	July 6, 2017
Stemming not performed Elasticsearch	4	1698	July 3, 2018
Mappings for stemming Elasticsearch	3	395	July 6, 2017

What happens stemming-wise in this situation?

Related topics