Edge NGrams with Aggregations on Indexed chess games


(Jack Shannon) #1

Hi,

I am indexing chess games to create a game explorer. I have a database of millions of games and I will need to index them.

The aim is to be able to query a some opening moves (e.g. "e4 e5 Nf3 Nc6"), and get a response with the 10 most popular next moves, how often they have been played and the number of times white won, number of times black won and the number of times there was a draw.

Here is what I have so far:

The Analyzer:
PUT game/
{
"settings" : {
"analysis" : {
"analyzer" : {
"moves_analyzer" : {
"tokenizer" : "moves_tokenizer"
}
},
"tokenizer" : {
"moves_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "2",
"max_gram" : "3",
"token_chars": [ "letter", "digit", "whitespace", "punctuation" ]
}
}
}
}
}

The Mapping:
PUT game/game/_mapping
{
"properties": {
"winner":{
"type": "string"
},
"moves":{
"type": "string",
"analyzer": "moves_analyzer"
}
}
}

The Query and Aggregation (hardcoded to search for 'e4 e5'):
GET game/game/_search
{
query : {
query_string : {
query : "moves:e4 e5*",
analyzer : "moves_analyzer"
}
},
"aggs": {
"nextmoves": {
"terms": {
"field": "moves",
"script": "_value.split(' ')[2]",
"size": 10
},
"aggs": {
"winners": {
"terms": {
"field": "winner"
}
}
}
}
}
}

I'm getting some strange results. What I'm looking for is to be able to do pure prefix matching, that doesn't use any fuzziness to account for typos. I also don't like the way I am using scripting in the aggregation, is there another way of doing this?

Any input on a better way to do this would be appreciated.


(Jack Shannon) #2

Please could someone try and help me with this, or let me know if my questions are unclear.


(Shaunak Kashyap) #3

Right off the bat I noticed a couple of discrepancies that might be affecting the results:

  1. You have named your custom analyzer, moves_analyzer, but you are referring to move_analyzer in the mapping for the moves field. Notice the difference in names (plural vs. singular).

  2. In your query, you are using m for the field name (e.g. m:e4 e5*). Shouldn't this be moves instead?


(Jack Shannon) #4

Thank you for response!

These are both errors with me translating elastic4s (https://github.com/sksamuel/elastic4s) wrapped queries in an attempt to be readable, I have just checked my code and it is correct there. Sorry, I will edit the post.

I read somewhere that Edge NGrams account for typos, could this be affecting the results?


(David Pilato) #5

In case you need some ideas around your project, check http://fr.lichess.org/

He is indexing all games in elasticsearch. Source code here: https://github.com/ornicar/lila


(Shaunak Kashyap) #7

I think I found the issue with the prefix matching not working as expected. In the custom tokenizer definition you are setting max_gram to 3. This means, when a string like "e4 e5 Nf3 Nc6" is indexed using this tokenizer, it will create only these two tokens: "e4" and "e4 " (note the trailing whitespace). This can be tested using the _analyze API as mentioned in the previous comment.

I'm assuming what you want for prefix matching is tokens like this to be produced: "e4", "e4 ", "e4 e", "e4 e5", "e4 e5 ", "e4 e5 N", "e4 e5 Nf", "e4 e5 Nf3", etc. To achieve this you will want to increase the max_gram size to the maximum prefix length you plan to search for.

Then, at query time, you can use a filter to perform efficient prefix searches like so:

POST game/_search
{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "term": {
          "moves": "e4 e5"
        }
      }
    }
  }
}

Or, if you are using Elasticsearch 2.0 or newer, you can use the new syntax as the filtered query is deprecated (but will still work if you really want to use it):

POST game/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "moves": "e4 e5"
        }
      }
    }
  }
}

Also, I noticed this setting in the custom tokenizer definition:

"token_chars": [ "letter", "digit", "whitespace", "punctuation" ] 

This is equivalent to:

"token_chars": []

... which is another way of saying, "keep all characters in the tokens". See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html for details.


(Jack Shannon) #8

I am actually developing this as a feature for lichess, currently the moves are not indexed. They need to be for a game explorer. I am trying to develop something like this: http://chessforge.com/


(Jack Shannon) #9

Thank you so much for taking a look at this. I will give this a go tomorrow and let you know how it goes :slight_smile:


(system) #10