How to index documents which contain pascal-case strings and do some stemming associations?

deostroll · November 26, 2019, 10:17am

For e.g. I have a document as follows:

 {
 "filename": "SRV_MODA_ACCT_FSCK",
 "svcName": "VerifyAddOperationForUserTransactions"
 }

If I'd provide search terms such as "srv", "moda", "acct", etc, I'd (for e.g.) want this record also to turn up in my search result. Similarly, if I type "verify", "add", "transaction", I'd want the above document to turn up during a search.

Further, There may be other documents, where for e.g. the word "acct" is replaced by "account" or "accounts". (Either in the filename, or svcName fields). However, when the user searches "account", I'd still want the above document to show up. Similarly "chrg" (short for "charge"). If the user typed "chrg" for instance, I'd want all documents where either the filename or svcName has the word "chrg" in them. They may be pascal-case. For e.g. ComputeTxnFeeChrgs. In here, "transaction" is abbreviated as "txn". So if the user searched for chrgs and txn together, all those documents that contain this need to be returned. The same results need be returned if the user typed the whole words - "charges" and "transactions", or "charge" and "transaction"...etc

How to make elastic search achieve this?

dadoonet · November 26, 2019, 10:38am

Welcome!

You need to play with analyzers for that.
As an example, I wrote this script which applies to filename field.

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "lowercase",
          "filter": [ "my_synonyms" ]
        }
      },
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["acct, account, accounts => acct"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "filename": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}
PUT test/_doc/1
{
  "filename": "SRV_MODA_ACCT_FSCK",
  "svcName": "VerifyAddOperationForUserTransactions"
}
GET test/_search
{
  "query": {
    "match": {
      "filename": "srv"
    }
  }
}
GET test/_search
{
  "query": {
    "match": {
      "filename": "account"
    }
  }
}

This is matching what you asked for.
I hope this helps as a starting point.

Then understand what analyzers are by reading some documentation:

HTH

deostroll · November 27, 2019, 10:12am

Here is an example of tokenizations I need:

verifyLAMBForCare => verify, LAMB, For, Care
populateTXNSForBalanceComputation => populate, TXNS, For, Balance, Computation
fetchDtlsOfCommodity => fetch, Dtls, Of, Commodity

I tried some pattern based tokenizations. However, I am unable to crack the first two examples with it.

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(?=[A-Z])"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "verifyLAMBForCare"
}

The result I got for the above:

{
  "tokens" : [
    {
      "token" : "verify",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "L",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "A",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "M",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "B",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "For",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "Care",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 6
    }
  ]
}

Not quite what I was looking for. Is there a way I can index the words in capital letters. Perhaps, I have to tokenize in two or more passes. (?) For e.g. verifyLAMBForCare I may have to extract LAMB alone and make elasticsearch interpret it as a token.

Whereas for the previous tokenization strategy. (where the same word was tokenized as verify, L, A, M, B, For, Care) perhaps there is someway to tell elasticsearch to "discard" L, A, M, B here. (?)

Hoping I can get more guidance here.

edovac · December 5, 2019, 9:53am

Hi I think it's no strictly an Elasticsaerch domain question, but you could obtain the desired result making the regex a little more complicated.

Using this regex: (?=[A-Z][a-z])|((?<=[a-z])(?=[A-Z]))
You can break strings when an uppercase char is found only if:

is followed by a lowercase char
is preceded by a lowercase char

Thus, any sequence of only uppercase char should not be split.

I've tested it on verifyLAMBForCareLAMB and it gives the desired results, having verify, LAMB, For, Care, LAMB as tokens.

system · January 2, 2020, 9:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to match a field and a sentence with capital letter and special character? Elasticsearch	4	726	May 31, 2019
Cant do case insensitive search in elastic search Elasticsearch	3	700	March 16, 2023
Indexing emails that come in uppercase, won't match lowercase searches Elasticsearch	10	2570	July 6, 2017
Convert English to accents and then search Elasticsearch	2	446	July 26, 2017
Search inc,corp Elasticsearch	4	303	July 6, 2017

How to index documents which contain pascal-case strings and do some stemming associations?

Related topics