How can you handle the correct tokenization and indexing of words such as 'gluten-free', 'no gluten', 'without dairy'? The method chosen can have a consequence on the querying part?
What you're bringing up here is one of the problems relating to natural language processing. There are a lot of variants of negation and the variants are language (and sometimes regionally) dependent. For example, "gluten-free" and "no gluten" are two examples that you give, but "not gluten-free" is also a reasonable permutation that's effectively an allowable double negative while "not for gluten-sensitive consumers" is on the other side. There is, unfortunately, no simple answer to this type of problem. The most comprehensive way is to try to write language processors (per-language and potentially per-region) to decompose text into/via some kind of parse tree. If you dive deeply into this, you'll quickly end up in a place where people are talking about computational linguistic processing and how "not" is NegP, which is not inherently bad, but a level of depth many people asking this type of question aren't fully ready to take on.
Realistically, doing this type of parsing can get very complex very quickly. Many practitioners avoid this level of complexity and assume a user is going to read the text rather than trying to interpret it computationally for them. It can be easier and actually less error prone to ask the text submitter this information in many cases.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.