How to index documents which contain pascal-case strings and do some stemming associations?

For e.g. I have a document as follows:

 {
 "filename": "SRV_MODA_ACCT_FSCK",
 "svcName": "VerifyAddOperationForUserTransactions"
 }

If I'd provide search terms such as "srv", "moda", "acct", etc, I'd (for e.g.) want this record also to turn up in my search result. Similarly, if I type "verify", "add", "transaction", I'd want the above document to turn up during a search.

Further, There may be other documents, where for e.g. the word "acct" is replaced by "account" or "accounts". (Either in the filename, or svcName fields). However, when the user searches "account", I'd still want the above document to show up. Similarly "chrg" (short for "charge"). If the user typed "chrg" for instance, I'd want all documents where either the filename or svcName has the word "chrg" in them. They may be pascal-case. For e.g. ComputeTxnFeeChrgs. In here, "transaction" is abbreviated as "txn". So if the user searched for chrgs and txn together, all those documents that contain this need to be returned. The same results need be returned if the user typed the whole words - "charges" and "transactions", or "charge" and "transaction"...etc

How to make elastic search achieve this?

Welcome!

You need to play with analyzers for that.
As an example, I wrote this script which applies to filename field.

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "lowercase",
          "filter": [ "my_synonyms" ]
        }
      },
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["acct, account, accounts => acct"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "filename": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}
PUT test/_doc/1
{
  "filename": "SRV_MODA_ACCT_FSCK",
  "svcName": "VerifyAddOperationForUserTransactions"
}
GET test/_search
{
  "query": {
    "match": {
      "filename": "srv"
    }
  }
}
GET test/_search
{
  "query": {
    "match": {
      "filename": "account"
    }
  }
}

This is matching what you asked for.
I hope this helps as a starting point.

Then understand what analyzers are by reading some documentation:

HTH

Here is an example of tokenizations I need:

  1. verifyLAMBForCare => verify, LAMB, For, Care
  2. populateTXNSForBalanceComputation => populate, TXNS, For, Balance, Computation
  3. fetchDtlsOfCommodity => fetch, Dtls, Of, Commodity

I tried some pattern based tokenizations. However, I am unable to crack the first two examples with it.

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(?=[A-Z])"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "verifyLAMBForCare"
}

The result I got for the above:

{
  "tokens" : [
    {
      "token" : "verify",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "L",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "A",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "M",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "B",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "For",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "Care",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 6
    }
  ]
}

Not quite what I was looking for. Is there a way I can index the words in capital letters. Perhaps, I have to tokenize in two or more passes. (?) For e.g. verifyLAMBForCare I may have to extract LAMB alone and make elasticsearch interpret it as a token.

Whereas for the previous tokenization strategy. (where the same word was tokenized as verify, L, A, M, B, For, Care) perhaps there is someway to tell elasticsearch to "discard" L, A, M, B here. (?)

Hoping I can get more guidance here.

Hi I think it's no strictly an Elasticsaerch domain question, but you could obtain the desired result making the regex a little more complicated.

Using this regex: (?=[A-Z][a-z])|((?<=[a-z])(?=[A-Z]))
You can break strings when an uppercase char is found only if:

  • is followed by a lowercase char
  • is preceded by a lowercase char

Thus, any sequence of only uppercase char should not be split.

I've tested it on verifyLAMBForCareLAMB and it gives the desired results, having verify, LAMB, For, Care, LAMB as tokens.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.