I have implemented a simple module where .docx (Word) files in a directory tree are parsed and each one split into multiple (overlapping) 10-line "Lucene documents", and entered into an index using the _bulk endpoint ... and then, currently, a simple query search is done.
I have implemented stemmed fields a bit in the past but I'm trying to understand a bit more about how things work. Supposing I now add a mapping thus (NB this is added to a new, as yet empty index which I've just created with a name based on a particular regex pattern, after examining the current active index used by an alias).
mappings = {
'properties': {
'text_content': {
'type': 'text',
'term_vector': 'with_positions_offsets',
'fields': {
'stemmed': {
'type': 'text',
'analyzer': 'english',
'term_vector': 'with_positions_offsets',
}
}
},
}
}
headers = {'Content-type': 'application/json'}
success, deliverable = process_json_request(f'{ES_URL}/{new_index_name}/_mapping', 'put', data=json.dumps(mappings), headers=headers)
My code for running the queries is currently pretty much the simplest one I could possibly devise:
simple_query = { 'query':
{ 'simple_query_string' :
{ 'query': query_string }}}
headers = { 'Content-type': 'application/json' }
success, deliverable = utilities.process_json_request(f'{ES_URL}/{ALIAS_NAME}/_search', data=json.dumps(simple_query), headers=headers)
... I'm trying to understand: with this current type of query, will it make use of the stemmed field, or the unstemmed text?
The funny thing is, if I do a (random) query string such as "linux set", it seems to be delivering results in which stemming has indeed been applied, e.g. where I can see a word such as "setting" or "sets" is included in the delivered Lucene documents.
But then I do another query string search such as "linux sets" ... and it returns another set of results, apparently also with stemming. And another set of results when I do "linux setting"... etc.
I would have thought that if the query string is subjected to stemming before the search happens that the results would be pretty much deterministic: I would expect all three searches to be done on (as stemmed) "linux set".
In previous experiments I have moved straight on to applying a particular analyzer to my query string explicitly ... but I'm trying to understand better what exactly happens in this situation. Maybe it's explained somewhere in the docs?