Search string with space in a long text


(Emrah) #1

Hi,

I would like to hear from anyone who has a solid structural solution of setting and mapping for an index that will have fields that consist of long text where I can search with space in.

To use wildcard, field type must keywords, which is not suggested for long text as I have understood.

Currently, I use match_phrase_prefix and it works.

However, the result is not acquired. For example, when I search for 'street n', I see that 'street - n' is returned as well.

Thanks,
emsi


(Jaspreet Singh) #2

I will try to help with the info you have shared.
If your mapping specified default analyzer i.e. standard, you get the standard tokenizer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-tokenizer.html) . It basically tokenizes based on unicode standard annex.
What it really means is, both while indexing and querying against a standard analyzed field, it will break street - n or even street-n into street and n, and treat them as separate tokens.
Then depending on the logic of your search query - you get back results. For example, for match query, it would return any or all documents that generate tokens street and n including street - n. For your case, you will need to use another tokenizer.


(Viseshini Reddy) #3

Hi,

I have a field with the following mapping - "content": {"type": "string", "index": "not_analyzed"}. I don't want ES to analyze it because I would like to search for strings with spaces. I'm using the following regular expression content:/[0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4}/ (Lucene query syntax) to search for Credit or Debit card numbers in the content field. But I don't get any results.

How do I fix it?


(Emrah) #4

Thanks. Have you also work with grams? I have been trying them. Apparently (n/egde_)gram is the key to handle autocomplete but I haven't managed to get the right query :frowning:

For example, I get a result that consist of letters or the keywords in the middle even though they are documents have it in the beginning.


(Jaspreet Singh) #5

Can you give a concrete example of what you are trying to search for so I can share relevant information? Talking about specifics will help us be more efficient


(Jaspreet Singh) #6

@Viseshini Btw I hope you are leveraging multiple fields (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) to index content as both text and keyword cos it is likely you need both.
Im not sure what you mean by not getting results - did you not get any numbers back? or did you not get back valid credit card numbers?
The other thing to keep in mind is every credit card co. has a different number format. The above regex will likely give you any random number, not just credit or debit card.
Check this - https://www.regular-expressions.info/creditcard.html


(Emrah) #7

Of course. My goal is to seeing search results instantly so-called search-as-you-type.
I have been trying different approaches. Now I am using fuzzy query.

At first, it seems working, but then I realized it does not behave as accurate as I expected, which is to have matching results on top and then the rest.

For example, I got text Cafe... even though there are contents start with Cafe.

Is there any way to achieve this? Below you may find the mapping along with settings.

BTW, this won't be actual one as soon as I have found out what settings should be but I want to have an index like this so I kinda play with queries.

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "gram_filter": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 50,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 50,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 50,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        },
        "ngram_analyzer": {
          "tokenizer": "ngram_tokenizer",
          "filter": [
            "lowercase"
          ]
        },
        "gram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "gram_filter"
          ]
        },
        "edge_ngram_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "edge_ngram_tokenizer"
        },
        "edge_ngram_search_analyzer": {
          "tokenizer": "lowercase"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": true
      },
      "dynamic": false,
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "autocomplete": {
              "type": "text",
              "analyzer": "autocomplete"
            },
            "ngram": {
              "type": "text",
              "analyzer": "ngram_analyzer"
            },
            "gram": {
              "type": "text",
              "analyzer": "gram_analyzer"
            },
            "edge_ngram": {
              "type": "text",
              "analyzer": "edge_ngram_analyzer",
              "search_analyzer": "edge_ngram_search_analyzer"
            }
          }
        }
      }
    }
  }
}

(Viseshini Reddy) #8

I have changed the mapping to "content": {"type": "text", "fields": { "raw": { "type": "keyword" }}, "index": "not_analyzed"}. I did not get any numbers back in either of the mapping types.

Thanks for the inputs on credit card number formats. I'm just testing how to use regex on elasticsearch strings.


(Jaspreet Singh) #9

I would really try simple things first, just to make sure I get the syntax right. Check this - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html.
Once you are confident - then try credit card regex cos this is complex.


(Jaspreet Singh) #10

Have you seen Completion suggestor (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html) and the related Context suggestor (https://www.elastic.co/guide/en/elasticsearch/reference/current/suggester-context.html)?
They are recommended approaches for search as you type feature. (optimized for quick fetch)


(Emrah) #11

The completion suggester can the key to achieving my goal but how can I apply lowercase analyser?

Currently, I am sending normalized string however, I would rather Elastic does it during indexing.

And I tried. Perhaps I am missing something.