Analyzer for treating "." differently in different fields

telebh · April 23, 2020, 6:55pm

Hi There,

We have an analyzer that leaves "." as is in the case of numeric fields (i.e. 1.4.5), but some other fields (i.e. Abstract) applying this analyzer makes us losing matching documents if Abstract contains words with "." with no space (i.e. allergy.Studies). leaving "." in Abstract field for this case makes us can't run querystring on Abstract:"allergy" nor Abstract: "studies" because allergy.studies indexed with "." What analyzer can be used to leave "." as is in numeric field or other valid fields (i.e. author name, URL,...) but in other fields like Abstract replace "." with space?

Thanks

YvorL · April 23, 2020, 10:17pm

Hi,

Not sure if I understand you correctly, but if you have one analyzer specified in the settings for every field, you can still add a specific one to certain fields (doc).

telebh · April 27, 2020, 1:58pm

the same field could contain any of the combination
i.e. Abstract: can contain numeric value as 10.3
Abstract can contain URL value www.example.com
Abstract can contain reg text i.e. allergy.Studies ( notice no space after .)

For Abstract field, how to index 10.3 as is
URL as is
but for text allergy.Studies token's to allergy , studies?

YvorL · April 27, 2020, 2:43pm

I see, thank you for clearing that up! I guess you can't change the pipeline to create different fields for different types (e.g., in Logstash). If I'd have this issue, I'd try to pre-process the value and assign those to different fields.

Not sure if it's the best way, but you can write a complex regex pattern tokenizer. However, in your case URLs like "order.pizza" are valid and shouldn't be tokenized. So you'll have a hard time to exclude URLs from the tokenizer. Also, I don't know much about the possible inputs so this might not help, sorry.

telebh · May 18, 2020, 3:45pm

Thank you YvorL!

system · June 15, 2020, 3:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.