Edge NGrams with Aggregations on Indexed chess games

Jack_Shannon · January 10, 2016, 9:58pm

Hi,

I am indexing chess games to create a game explorer. I have a database of millions of games and I will need to index them.

The aim is to be able to query a some opening moves (e.g. "e4 e5 Nf3 Nc6"), and get a response with the 10 most popular next moves, how often they have been played and the number of times white won, number of times black won and the number of times there was a draw.

Here is what I have so far:

The Analyzer:
PUT game/
{
"settings" : {
"analysis" : {
"analyzer" : {
"moves_analyzer" : {
"tokenizer" : "moves_tokenizer"
}
},
"tokenizer" : {
"moves_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "2",
"max_gram" : "3",
"token_chars": [ "letter", "digit", "whitespace", "punctuation" ]
}
}
}
}
}

The Mapping:
PUT game/game/_mapping
{
"properties": {
"winner":{
"type": "string"
},
"moves":{
"type": "string",
"analyzer": "moves_analyzer"
}
}
}

The Query and Aggregation (hardcoded to search for 'e4 e5'):
GET game/game/_search
{
query : {
query_string : {
query : "moves:e4 e5*",
analyzer : "moves_analyzer"
}
},
"aggs": {
"nextmoves": {
"terms": {
"field": "moves",
"script": "_value.split(' ')[2]",
"size": 10
},
"aggs": {
"winners": {
"terms": {
"field": "winner"
}
}
}
}
}
}

I'm getting some strange results. What I'm looking for is to be able to do pure prefix matching, that doesn't use any fuzziness to account for typos. I also don't like the way I am using scripting in the aggregation, is there another way of doing this?

Any input on a better way to do this would be appreciated.

Jack_Shannon · January 11, 2016, 4:36pm

Please could someone try and help me with this, or let me know if my questions are unclear.

shaunak · January 11, 2016, 4:46pm

Right off the bat I noticed a couple of discrepancies that might be affecting the results:

You have named your custom analyzer, moves_analyzer, but you are referring to move_analyzer in the mapping for the moves field. Notice the difference in names (plural vs. singular).
In your query, you are using m for the field name (e.g. m:e4 e5*). Shouldn't this be moves instead?

Jack_Shannon · January 11, 2016, 5:03pm

Thank you for response!

These are both errors with me translating elastic4s (https://github.com/sksamuel/elastic4s) wrapped queries in an attempt to be readable, I have just checked my code and it is correct there. Sorry, I will edit the post.

I read somewhere that Edge NGrams account for typos, could this be affecting the results?

dadoonet · January 11, 2016, 5:24pm

In case you need some ideas around your project, check http://fr.lichess.org/

He is indexing all games in elasticsearch. Source code here: https://github.com/ornicar/lila

shaunak · January 11, 2016, 6:37pm

I think I found the issue with the prefix matching not working as expected. In the custom tokenizer definition you are setting max_gram to 3. This means, when a string like "e4 e5 Nf3 Nc6" is indexed using this tokenizer, it will create only these two tokens: "e4" and "e4 " (note the trailing whitespace). This can be tested using the _analyze API as mentioned in the previous comment.

I'm assuming what you want for prefix matching is tokens like this to be produced: "e4", "e4 ", "e4 e", "e4 e5", "e4 e5 ", "e4 e5 N", "e4 e5 Nf", "e4 e5 Nf3", etc. To achieve this you will want to increase the max_gram size to the maximum prefix length you plan to search for.

Then, at query time, you can use a filter to perform efficient prefix searches like so:

POST game/_search
{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "term": {
          "moves": "e4 e5"
        }
      }
    }
  }
}

Or, if you are using Elasticsearch 2.0 or newer, you can use the new syntax as the filtered query is deprecated (but will still work if you really want to use it):

POST game/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "moves": "e4 e5"
        }
      }
    }
  }
}

Also, I noticed this setting in the custom tokenizer definition:

"token_chars": [ "letter", "digit", "whitespace", "punctuation" ]

This is equivalent to:

"token_chars": []

... which is another way of saying, "keep all characters in the tokens". See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html for details.

Jack_Shannon · January 11, 2016, 6:41pm

I am actually developing this as a feature for lichess, currently the moves are not indexed. They need to be for a game explorer. I am trying to develop something like this: http://chessforge.com/

Jack_Shannon · January 11, 2016, 6:42pm

Thank you so much for taking a look at this. I will give this a go tomorrow and let you know how it goes

Topic		Replies	Views
Does EdgeNGram autocomplete_filter make sense with prefix search? Elasticsearch	1	422	September 12, 2020
Boosting exact matches in edgengram search Elasticsearch	5	2231	July 5, 2017
Elastic search : EdgeGram, prefix, suffix Elasticsearch	2	2154	July 6, 2017
edgeNGram minimum length omits shorter words Elasticsearch	12	2864	July 6, 2017
Partial word matching with query_string and edge ngrams Elasticsearch	3	996	July 6, 2017

Edge NGrams with Aggregations on Indexed chess games

Related topics