Setting up index for maximum search ability (with front end typeahead)


#1

Most of my experience with elasticsearch has been for storing timebased data that comes from logs we track on the network. We are trying to expand it for use as a database for a web based application being written. I am just looking for any tips on what I can be looking at to try and learn how to do this a little better. Here is the current issue I am working on:

Data being indexed (coming from a different application)
example:

sites = [
{
            sys_id: "13c04e140f92b500d55ae498b1050e8a",
            name: "Receivables Performance Management, LLC"
        },
        {
            sys_id: "13c04e140f92b500d55ae498b1050e8b",
            name: "ROCHESTER NY"
        },
        {
            sys_id: "13c04e140f92b500d55ae498b1050e8c",
            name: "ROSELAND NJ"
        },
        {
            sys_id: "17c04e140f92b500d55ae498b1050e8a",
            name: "LAYTON UT"
        }
]

In an effort to make the data sortable, and searchable, I have chosen to set this up as a type of "nested" with a multi-field purely to use the keyword for sorting. Here is the current mapping and settings:

PUT /sev_sites
{
  "settings": {
    "analysis":{
      "analyzer": { 
      "site_analyzer":{
        "type": "custom",
        "tokenizer": "site_tokenizer",
        "filter": "lowercase"
      }
      },
      "tokenizer": {
        "site_tokenizer":{
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 20
        }
        
      }
    }
  },
  "mappings": {
    "object": {
      "properties": {
        "site": {
          "type": "nested",
          "properties": {
            "sys_id": {
              "type": "text"
            },
            "name": {
              "type": "text",
              "analyzer": "site_analyzer",
              "fields":{
                "raw": {
                  "type": "keyword"
                }
              }
            }
          }
        }
      }
    }
  }
}

So now for the problem; I am trying to set up a query to return partial matches. Can someone explain the "proper" way to setup this index and which way is best to return partial matches on the given data set? I setup the analyzer using the Elasticsearch documentation, but a simple "match" query still only returns exactly what it matches (duh!) . I have been playing with the fuzzy and regex queries, which seem to be working well. If anyone could give me any suggestions on if I am setting this up in a logical way and/or what would work better I would really appreciate it.

Also, I am still trying to get the sorting to work correctly, if anyone could help me on that, to sort by the "raw" keyword I would definitely appreciate it.

Current testing of the fuzzy search are working "ok"...

GET /sev_sites/_search
{
    "query": {
        "fuzzy" : {
            "name" : {
                "value" :         "roseland",
                    "boost" :         1.0,
                    "fuzziness" :     2,
                    "prefix_length" : 2,
                    "max_expansions": 100
            }
        }
    }
}

Returns:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 3.6795852,
    "hits": [
      {
        "_index": "sev_sites",
        "_type": "object",
        "_id": "AV4qoQ8DsabpF_YTeXzo",
        "_score": 3.6795852,
        "_source": {
          "sys_id": "dec04e140f92b500d55ae498b1050e2c",
          "name": "ROCKLAND"
        }
      },
      {
        "_index": "sev_sites",
        "_type": "object",
        "_id": "AV4qoQXDsabpF_YTeXvT",
        "_score": 3.5934165,
        "_source": {
          "sys_id": "13c04e140f92b500d55ae498b1050e8c",
          "name": "ROSELAND NJ"
        }
      }
    ]
  }
}

However, this query does not return what I had hoped:

GET /sev_sites/_search
{
    "query": {
        "fuzzy" : {
            "name" : {
                "value" :         "rose",
                    "boost" :         1.0,
                    "fuzziness" :     2,
                    "prefix_length" : 2,
                    "max_expansions": 100
            }
        }
    }
}

Returns:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.4354112,
    "hits": [
      {
        "_index": "sev_sites",
        "_type": "object",
        "_id": "AV4qoQ77sabpF_YTeXzn",
        "_score": 1.4354112,
        "_source": {
          "sys_id": "dec04e140f92b500d55ae498b1050e2b",
          "name": "ROCK TAVERN NY"
        }
      }
    ]
  }
}

(Ivan Brusic) #2

The issue might be the combination of using a fuzzy query with ngram
tokens. Have you tried to either use a simply match query with the ngram'd
field or applying an analyzer that does not apply ngrams? Easy enough to
change the query or add another multifield to test if reindex is not an
issue.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.