Implement collocation in ElasticSearch

a22xel · November 3, 2020, 8:16am

I am new in Elastic Search Usage I am trying to do an analyser or an ingest pipeline that would create collocations of words ( unigram, bigrams, trigrams with a step up to 2 ) . I am aware it's feasible in python but I am only interested in an ES solution.
Up to know I tried to do it using shingles like this :

GET /_analyze
{
  "tokenizer": "standard",
  
  "filter": [
    {
      "type": "predicate_token_filter",
      "script": {
        "source": "token.getPosition() % 2 == 0"
      }
    },
    {
      "type": "shingle",
      "max_shingle_size": 5,
      "min_shingle_size": 3,
      "output_unigrams":false,
      "token_separator":" ",
      "filler_token":""
    },
    "trim",
    "unique",
    {
      "type":"pattern_replace",
      "pattern":"\\s+",
      "replacement":" "
    }
   
  ],
  "text": "aerial photo airplane taken with a nice camera"
}

it gives me this output
output :

{
  "tokens" : [
    {
      "token" : "aerial airplane",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "aerial airplane with",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "airplane",
      "start_offset" : 13,
      "end_offset" : 28,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "airplane with",
      "start_offset" : 13,
      "end_offset" : 32,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "airplane with nice",
      "start_offset" : 13,
      "end_offset" : 39,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 3
    },
    {
      "token" : "with",
      "start_offset" : 28,
      "end_offset" : 35,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "with nice",
      "start_offset" : 28,
      "end_offset" : 39,
      "type" : "shingle",
      "position" : 2,
      "positionLength" : 2
    },
    {
      "token" : "nice",
      "start_offset" : 35,
      "end_offset" : 46,
      "type" : "shingle",
      "position" : 3
    }
  ]
}

but my ideal output would be ( just outputing the tokens)

['aerial', 'photo', 'airplane', 'taken', 'with', 'a', 'nice', 'camera', ('aerial', 'photo'), ('aerial', 'airplane'), ('aerial', 'taken'), ('photo', 'airplane'), ('photo', 'taken'), ('photo', 'with'), ('airplane', 'taken'), ('airplane', 'with'), ('airplane', 'a'), ('taken', 'with'), ('taken', 'a'), ('taken', 'nice'), ('with', 'a'), ('with', 'nice'), ('with', 'camera'), ('a', 'nice'), ('a', 'camera'), ('nice', 'camera'), ('aerial', 'photo', 'airplane'), ('aerial', 'photo', 'taken'), ('aerial', 'photo', 'with'), ('aerial', 'airplane', 'taken'), ('aerial', 'airplane', 'with'), ('aerial', 'taken', 'with'), ('photo', 'airplane', 'taken'), ('photo', 'airplane', 'with'), ('photo', 'airplane', 'a'), ('photo', 'taken', 'with'), ('photo', 'taken', 'a'), ('photo', 'with', 'a'), ('airplane', 'taken', 'with'), ('airplane', 'taken', 'a'), ('airplane', 'taken', 'nice'), ('airplane', 'with', 'a'), ('airplane', 'with', 'nice'), ('airplane', 'a', 'nice'), ('taken', 'with', 'a'), ('taken', 'with', 'nice'), ('taken', 'with', 'camera'), ('taken', 'a', 'nice'), ('taken', 'a', 'camera'), ('taken', 'nice', 'camera'), ('with', 'a', 'nice'), ('with', 'a', 'camera'), ('with', 'nice', 'camera'), ('a', 'nice', 'camera')]

mayya · November 9, 2020, 10:16pm

An interesting problem.
I think doing this through analysis in elasticsearch will not work, as the current shingle filters don't support step, and make shingles only with the adjacent words.
There is a multiplexer filter that helps to produce parallel streams, but it doesn't work with shingles.
You can create a github issue in the elasticsearch repository with a feature request for shingle filters to have this additional step parameter that allows to form shingles also with other words.

Finding collocations can be done through querying with match_phrase with a corresponding slop parameter. In this case, indexing just with standard analyzer should be sufficient.

But search only helps you to answer if terms in a phrase are collocated or not.

system · December 7, 2020, 10:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.