we have a need to classify every document prior to recording it in our DB. We selected Elasticsearch SYNONYM functionality to do this work for us by building a dictionary of keywords associated to each classification.
Idea is to have high probability classification level and the example would be something like:
girl, female, woman, women, wife => women
lipstick, mascara => women, beauty
flats => women, shoes
accessory => electronics, women, men, kids
....
etc.
question is about how to make a similar type of setup for multi-word plural word handling, ie:
apple accessory => electronics, gadgets
http://localhost:9200/classifier/_analyze?analyzer=synonymclassification&text=apple accessory
above example will render correct results and suggest Electronics and Gadgets as 2 matched tokens, however doing a similar lookup for
http://localhost:9200/classifier/_analyze?analyzer=synonymclassification&text=apple accessories
will return APPLE and ACCESSORY as 2 tokens and will "unmatch" pre-set word dictionaries.
Since we need to achieve highest accuracy in classification we are using kstem as one of the filters on our custom analyzer together with:
"analysis": {
"analyzer": {
"synonymclassification": {
"type": "custom",
"char_filter": "html_strip",
"filter": [
"apostrophe",
"standard",
"elision",
"asciifolding",
"lowercase",
"stop",
"length",
"my_stemmer",
"synonym"
],
"tokenizer": "standard"
},
...
"keywords": {
"type": "string",
"analyzer": "language",
"fields": {
"my_synonym": {
"type": "string",
"analyzer": "synonymclassification"
},
-
Does this appear to be a good approach to the problem?
-
Do we need to add/drop some of the filters to achieve desired behavior or do we instead need to specify each variation of all "search phrases, search phrase" in our synonym dictionary?
-
Is it not too dangerous (resource intensive) to utilize http://localhost:9200/classifier/_analyze?... for these lookups at a very high rate - anywhere from 100 to 500 requests per second during original data discovery and classification?
Thank you for any help and pointers ...