Extracting brands in documents using keyword and shingles

Hi,

I'm trying to use elastic to find brands in documents. So my idea was to create a index containing brands and search with a text to find brands.

So I tried this index and mapping:

PUT brand
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        },
        "my_analyzer_shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "brand": {
      "properties": {
        "keyword": {
          "type": "text",
          "analyzer": "my_analyzer_keyword",
          "search_analyzer": "my_analyzer_shingle"
        }
      }
    }
  }
}

Some documents:

POST /brand/brand/1
{
  "id": 1,
  "keyword": "nike"
}
POST /brand/brand/2
{
  "id": 2,
  "keyword": "adidas originals"
}

I then search like this:

POST /brand/brand/_search
{
  "query": {
    "match": {
      "keyword": "I like nike shoes and adidas originals"
    }
  }
}

I expect to get nike and adidas originals as the result but I don't get anything back.
I'm using elastic 5.4.3.

Is my thinking wrong?

Use the _analyze API to understand what is happening at index time and query time.

You will see what tokens are exactly indexed and what terms are exactly compared to the inverted index at search time

Yes I have done that and from what I can see it should work but it doesn't. Here are my analysis.

Query:

GET _analyze
{
   "tokenizer": "standard",
   "filter": [
      "asciifolding",
      "lowercase",
      "shingle"
   ],
   "char_filter": [
      "html_strip"
   ],
   "text": [
      "I like nike shoes and adidas originals"
   ]
}

Result (shortened for brevity):

{
   "tokens": [
      ...
      {
         "token": "nike",
         "start_offset": 7,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "nike shoes",
         "start_offset": 7,
         "end_offset": 17,
         "type": "shingle",
         "position": 2,
         "positionLength": 2
      },
      ...
      {
         "token": "adidas",
         "start_offset": 22,
         "end_offset": 28,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "adidas originals",
         "start_offset": 22,
         "end_offset": 38,
         "type": "shingle",
         "position": 5,
         "positionLength": 2
      },
      {
         "token": "originals",
         "start_offset": 29,
         "end_offset": 38,
         "type": "<ALPHANUM>",
         "position": 6
      }
   ]
}

Index:

GET _analyze
{
   "tokenizer": "keyword",
   "filter": [
      "asciifolding",
      "lowercase"
   ],
   "char_filter": [
      "html_strip"
   ],
   "text": [
      "adidas originals"
   ]
}

Result:

{
   "tokens": [
      {
         "token": "adidas originals",
         "start_offset": 0,
         "end_offset": 16,
         "type": "word",
         "position": 0
      }
   ]
}

Tried it on elastic version 2.4.4 and there it worked just fine.

Could it be a bug in latest version of elastic?

I got the expected result

{
   "took": 73,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.003867892,
      "hits": [
         {
            "_index": "brand",
            "_type": "brand",
            "_id": "2",
            "_score": 0.003867892,
            "_source": {
               "id": 2,
               "keyword": "adidas originals"
            }
         },
         {
            "_index": "brand",
            "_type": "brand",
            "_id": "1",
            "_score": 0.003867892,
            "_source": {
               "id": 1,
               "keyword": "nike"
            }
         }
      ]
   }
}

I agree that this is looking wrong.

Could you open an issue with this script?

DELETE brand
PUT brand
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "my_analyzer_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        },
        "my_analyzer_shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "brand": {
      "properties": {
        "keyword": {
          "type": "text",
          "analyzer": "my_analyzer_keyword",
          "search_analyzer": "my_analyzer_shingle"
        }
      }
    }
  }
}

POST /brand/brand
{
  "keyword": "nike"
}
POST /brand/brand
{
  "keyword": "adidas originals"
}
GET brand/_search
{
  "query": {
    "match": {
      "keyword": {
        "query": "I like nike shoes and adidas originals"
      }
    }
  }
}

Unfortunately this is a side-effect of improvements that we made around handling query-time synonyms. The shingle filter generates a graph that confuses the query parser. See for instance the output of

GET brand/_validate/query?rewrite=true
{
  "query": {
    "match": {
      "keyword": {
        "query": "I like nike shoes and adidas originals"
      }
    }
  }
}

+1 to opening an issue

Ok I'll open an Issue

1 Like

Awesome! And thanks for the detailed script. It helped a lot.

1 Like

Bug filed!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.