I need a little help on how to approach a problem I'm facing right now.
Imagine I have "categorized-products" index where I store a bunch of products, each one containing a description and a manually selected category.
My hypothesis is that I could extract the most significant terms along the descriptions for each category and that these would help me define a category for a product for which no category has been manually defined.
So I first started creating an aggregation with bucket by "category" and then for each bucket I ran a "significant_text" aggregation:
That's cool. It seems that Elasticsearch managed to select the more significant terms that appear along the product's descriptions for each category.
My questions now are:
1) Use the selected terms to infeer a category
is there a way I could write a query that uses this output as input for a second query? I'm looking for a way to infeer a category for an uncategorized product making use of the previously already categorized products...
The only solution I could come upo with was something like this:
Run the above query
Create a new "categories" index, which would play the role of the "model", this index would contain the category name and then a field with all the significant terms concatenated
For each uncategorized product I would then run a More Like This query comparing its "description" field against the "concatenated" field of the "categories" index.
Does that make sense?
2) Find out in how many categories the selected terms appear
So I got my most significant terms by category, but I want to find out for each one of them in how many categories they appear. It's like the opposit query, meaning I want to bucket first by the terms of the descriptions (cannot do, not fielddata field) and then bucket by category...
With this I could perhaps find out which terms appear only in one category and consider them to be really relevant on defining that category. (I dont think this is really true, but I was asked to try it out)
There's a pattern that I use often called "like this, but not this" to find incorrect or missing categories.
You've done step 1 which is to find the terms that are reminiscent of a category.keyword.
Step 2 is to then use these discovered terms in a bool expression that is something like this:
This find all the docs like category X but not tagged as category X, in relevance-ranked order.
Note - 2 word shingles also work great with significant_text agg - they hold more meaning than single words.
The adjacency_matrix aggregation is also good for understanding how significant keywords cluster together and perhaps which categories they are affiliated with.
I found your answer really usefull for the problem of wrongly categorized products (which I also need to address).
Regarding the bool-query, do you think there's any way to "insert" my first aggregation-query inside the should part of the bool-query to make it only one step? Otherwise I guess the only way is keep this two steps: 1) run the aggregation-query to get the terms, 2) use the terms to manually/programmatically write the should part of the bool-query.
Another thing is, as I understood your approach would help me to check whether an already categorized product is (or is not) in the correct category.
But my first problem is to, based only in the prodeuct's description and the category "model" (an index with the result of the aggregation-query, or someting similar to that) infeer a category.
So at the end I would like to only run query against this model and get back a list of categories by relevance (the more terms the product's description has that are significant to a specific category, the more relevant this category is).
The danger with creating an index with only the significant category keywords in and searching that is that the stats that help with relevance ranking will be screwed up. Each term will likely occur close to once in the index (they are, by design, descriptive of only one category). The significance scores will be lost.
Perhaps a simpler one-pass approach is to take a new unclassified document and use the text description as a "more like this" query on the text of existing classified docs. Use the sampler aggregation to look at only the strongest N matches (say, 100) and under that use a significant_terms agg on the category field. This will help balance the fact that popular categories are likely to produce more matches and tune into the "uncommonly common" classification in the result set's top matches.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.