I'm utilizing App Search to implement a search that includes searching ICD 10 codes.
One of the users has reported some slightly strange results. When they send the query of S33 to the App Search engine the majority of the results seem correct but we are seeing some outliers and was wondering if there is some documentation I can review or if some one can explain the reason for this set of results.
"id": "S33", "score": 11.731796
"id": "S33.1", "score": 5.699594
"id": "S06.33", "score": 5.699594 - This is the odd result mainly due to the score and being equal to the result above
"id": "S33.5", "score": 5.6950865
"id": "S38", "score": 3.7110178
My guess is that the default tokenization is struggling to chop up those IDs into tokens the way you're expecting, and so typo-tolerance is taking over, meaning that where in the ID the difference exists matters less than the number of differences. I am surprised that the score is the exact same, though that may be influenced by other fields, if there's more than just the "id" field in your document set.
My suggestion would be to add an extra field to your dataset like "id_prefix" that only contains the id.split('.')[0] (everything before the period, if there is a period), and use weights and/or boosts to weight that field higher than the "id" field. This would help your result ordering to be more like:
The precision is currently set to 5 for that engine I think I have tried it higher and lower and the same thing still happens. For reference these are the searchable fields from the documents in that engine
icdcode - text (searchable, retrievable)
name - text (searchable, retrievable)
section - text (searchable, retrievable)
defaultbodysystems - text (searchable, retrievable)
supertopicterms - text (Array of text, searchable, retrievable)
I updated the engine with a lowercase icd code field and that helps a little. I've been diving into the settings and found the following in the engines analysis.filter.delimiter section, split_on_numerics: true. if this was set to false would this possibly improve my results and if so how would I change this value for my elastic cloud instance?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.