Extracting brands in documents using keyword and shingles

fhelje · July 5, 2017, 10:02am

Hi,

I'm trying to use elastic to find brands in documents. So my idea was to create a index containing brands and search with a text to find brands.

So I tried this index and mapping:

PUT brand
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        },
        "my_analyzer_shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "brand": {
      "properties": {
        "keyword": {
          "type": "text",
          "analyzer": "my_analyzer_keyword",
          "search_analyzer": "my_analyzer_shingle"
        }
      }
    }
  }
}

Some documents:

POST /brand/brand/1
{
  "id": 1,
  "keyword": "nike"
}
POST /brand/brand/2
{
  "id": 2,
  "keyword": "adidas originals"
}

I then search like this:

POST /brand/brand/_search
{
  "query": {
    "match": {
      "keyword": "I like nike shoes and adidas originals"
    }
  }
}

I expect to get nike and adidas originals as the result but I don't get anything back.
I'm using elastic 5.4.3.

Is my thinking wrong?

dadoonet · July 5, 2017, 10:41am

Use the _analyze API to understand what is happening at index time and query time.

You will see what tokens are exactly indexed and what terms are exactly compared to the inverted index at search time

fhelje · July 5, 2017, 11:29am

Yes I have done that and from what I can see it should work but it doesn't. Here are my analysis.

Query:

GET _analyze
{
   "tokenizer": "standard",
   "filter": [
      "asciifolding",
      "lowercase",
      "shingle"
   ],
   "char_filter": [
      "html_strip"
   ],
   "text": [
      "I like nike shoes and adidas originals"
   ]
}

Result (shortened for brevity):

{
   "tokens": [
      ...
      {
         "token": "nike",
         "start_offset": 7,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "nike shoes",
         "start_offset": 7,
         "end_offset": 17,
         "type": "shingle",
         "position": 2,
         "positionLength": 2
      },
      ...
      {
         "token": "adidas",
         "start_offset": 22,
         "end_offset": 28,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "adidas originals",
         "start_offset": 22,
         "end_offset": 38,
         "type": "shingle",
         "position": 5,
         "positionLength": 2
      },
      {
         "token": "originals",
         "start_offset": 29,
         "end_offset": 38,
         "type": "<ALPHANUM>",
         "position": 6
      }
   ]
}

Index:

GET _analyze
{
   "tokenizer": "keyword",
   "filter": [
      "asciifolding",
      "lowercase"
   ],
   "char_filter": [
      "html_strip"
   ],
   "text": [
      "adidas originals"
   ]
}

Result:

{
   "tokens": [
      {
         "token": "adidas originals",
         "start_offset": 0,
         "end_offset": 16,
         "type": "word",
         "position": 0
      }
   ]
}

fhelje · July 5, 2017, 11:37am

Tried it on elastic version 2.4.4 and there it worked just fine.

Could it be a bug in latest version of elastic?

I got the expected result

{
   "took": 73,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.003867892,
      "hits": [
         {
            "_index": "brand",
            "_type": "brand",
            "_id": "2",
            "_score": 0.003867892,
            "_source": {
               "id": 2,
               "keyword": "adidas originals"
            }
         },
         {
            "_index": "brand",
            "_type": "brand",
            "_id": "1",
            "_score": 0.003867892,
            "_source": {
               "id": 1,
               "keyword": "nike"
            }
         }
      ]
   }
}

dadoonet · July 5, 2017, 1:11pm

I agree that this is looking wrong.

Could you open an issue with this script?

DELETE brand
PUT brand
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "my_analyzer_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        },
        "my_analyzer_shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "brand": {
      "properties": {
        "keyword": {
          "type": "text",
          "analyzer": "my_analyzer_keyword",
          "search_analyzer": "my_analyzer_shingle"
        }
      }
    }
  }
}

POST /brand/brand
{
  "keyword": "nike"
}
POST /brand/brand
{
  "keyword": "adidas originals"
}
GET brand/_search
{
  "query": {
    "match": {
      "keyword": {
        "query": "I like nike shoes and adidas originals"
      }
    }
  }
}

jpountz · July 5, 2017, 1:42pm

Unfortunately this is a side-effect of improvements that we made around handling query-time synonyms. The shingle filter generates a graph that confuses the query parser. See for instance the output of

GET brand/_validate/query?rewrite=true
{
  "query": {
    "match": {
      "keyword": {
        "query": "I like nike shoes and adidas originals"
      }
    }
  }
}

+1 to opening an issue

fhelje · July 5, 2017, 2:21pm

Ok I'll open an Issue

dadoonet · July 5, 2017, 2:46pm

Awesome! And thanks for the detailed script. It helped a lot.

fhelje · July 5, 2017, 3:21pm

Bug filed!

system · August 2, 2017, 3:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Analyzer, Fuzzy Query? Elasticsearch	7	2349	July 6, 2017
Querying shingles Elasticsearch	1	307	July 6, 2020
Problem understanding phrase matching with stop words Elasticsearch	3	1304	September 21, 2017
How to query on {brand} and/or {category string} efficiently? Elasticsearch	1	461	July 5, 2017
How does shingle filter work on match_phrase in query phase? Elasticsearch	5	1604	July 6, 2017

Extracting brands in documents using keyword and shingles

Related topics