Search keyword classification problem

Dear guys. I have ~10M product index with bunch of fields (mpn, gtin, family, series, brandName, categoryName, summary)

And I'm just trying to find the answer to the next question: what is the best way to parse and claasify user search keyword? I just don't want to search for "Dolce & Gabana" in summary, it must be brandName.

For example how to find brand="My Intelligent Dogs", in keyword "My Intelligent Dogs hut"? Or brand="HP", category="Notebook" in keywird "HP Green Notebook".
It looks like classic ML task for me.
Please share your ideas, what's the best solution to start with?

Hi Dmitry,
I've been thinking about these sorts of problems lately.
Generally the approach is "facet snapping" - the automated application of structured filters given an unstructured piece of query text.
Facet snapping would be a query pre-processing pipeline armed with a set of rules - "if you see X in the incoming query, apply filter Y and remove X from the query string". These rules can be generated from your content but you would have to review them.
Let's take your brand name example - this script can examine which structured brand names are also mentioned in the unstructured text of product names in your product catalog. We can see there are issues with some brand names because they are used in product names where the brand is something different. This can occur because of the following reasons:

  1. The brand name is ambiguous e.g. Jigsaw is a clothing brand and many things in the toy department, "MAC" is makeup and a computer.
  2. The brand name is licensed in many different products made by other brands e.g. Apple iphone cases or Disney lunchboxes

You could automatically generate facet snapping rules where my script shows brand names are reliably used (only ever appear in product names/descriptions that are also tagged with that brand). The more ambiguous brand names will need to be human-reviewed for effectiveness in facet-snapping,

1 Like

Thank you very much Mark,

Fantastic script, simple and obvious. It have to be adapted for my nested supplierLocalNames a bit. Anyway it put some ideas into my head.
I think I got your point. Performed some investigations and see. Looks like it is really necessary to human-review brands like "ME", "WE" and others.
Earlier it caused a problem when I try to auto-filter such brands.

1 Like

FYI - I made a simple update to the script to export a set of brand names that are reliable phrases (i.e are not too ambiguous) so that they could be used as query-rewriting rules.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.