Search string with space in a long text

3ms1 · November 9, 2018, 3:14pm

Hi,

I would like to hear from anyone who has a solid structural solution of setting and mapping for an index that will have fields that consist of long text where I can search with space in.

To use wildcard, field type must keywords, which is not suggested for long text as I have understood.

Currently, I use match_phrase_prefix and it works.

However, the result is not acquired. For example, when I search for 'street n', I see that 'street - n' is returned as well.

Thanks,
emsi

Jaspreet_Singh · November 14, 2018, 7:46pm

I will try to help with the info you have shared.
If your mapping specified default analyzer i.e. standard, you get the standard tokenizer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-tokenizer.html) . It basically tokenizes based on unicode standard annex.
What it really means is, both while indexing and querying against a standard analyzed field, it will break street - n or even street-n into street and n, and treat them as separate tokens.
Then depending on the logic of your search query - you get back results. For example, for match query, it would return any or all documents that generate tokens street and n including street - n. For your case, you will need to use another tokenizer.

Viseshini · November 19, 2018, 8:22am

Hi,

I have a field with the following mapping - "content": {"type": "string", "index": "not_analyzed"}. I don't want ES to analyze it because I would like to search for strings with spaces. I'm using the following regular expression content:/[0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4}/ (Lucene query syntax) to search for Credit or Debit card numbers in the content field. But I don't get any results.

How do I fix it?

3ms1 · November 19, 2018, 10:14am

Thanks. Have you also work with grams? I have been trying them. Apparently (n/egde_)gram is the key to handle autocomplete but I haven't managed to get the right query

For example, I get a result that consist of letters or the keywords in the middle even though they are documents have it in the beginning.

Jaspreet_Singh · November 19, 2018, 5:04pm

Can you give a concrete example of what you are trying to search for so I can share relevant information? Talking about specifics will help us be more efficient

Jaspreet_Singh · November 19, 2018, 5:14pm

@Viseshini Btw I hope you are leveraging multiple fields (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) to index content as both text and keyword cos it is likely you need both.
Im not sure what you mean by not getting results - did you not get any numbers back? or did you not get back valid credit card numbers?
The other thing to keep in mind is every credit card co. has a different number format. The above regex will likely give you any random number, not just credit or debit card.
Check this - https://www.regular-expressions.info/creditcard.html

3ms1 · November 19, 2018, 9:13pm

Of course. My goal is to seeing search results instantly so-called search-as-you-type.
I have been trying different approaches. Now I am using fuzzy query.

At first, it seems working, but then I realized it does not behave as accurate as I expected, which is to have matching results on top and then the rest.

For example, I got text Cafe... even though there are contents start with Cafe.

Is there any way to achieve this? Below you may find the mapping along with settings.

BTW, this won't be actual one as soon as I have found out what settings should be but I want to have an index like this so I kinda play with queries.

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "gram_filter": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 50,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 50,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 50,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        },
        "ngram_analyzer": {
          "tokenizer": "ngram_tokenizer",
          "filter": [
            "lowercase"
          ]
        },
        "gram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "gram_filter"
          ]
        },
        "edge_ngram_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "edge_ngram_tokenizer"
        },
        "edge_ngram_search_analyzer": {
          "tokenizer": "lowercase"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": true
      },
      "dynamic": false,
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "autocomplete": {
              "type": "text",
              "analyzer": "autocomplete"
            },
            "ngram": {
              "type": "text",
              "analyzer": "ngram_analyzer"
            },
            "gram": {
              "type": "text",
              "analyzer": "gram_analyzer"
            },
            "edge_ngram": {
              "type": "text",
              "analyzer": "edge_ngram_analyzer",
              "search_analyzer": "edge_ngram_search_analyzer"
            }
          }
        }
      }
    }
  }
}

Viseshini · November 20, 2018, 11:48am

I have changed the mapping to "content": {"type": "text", "fields": { "raw": { "type": "keyword" }}, "index": "not_analyzed"}. I did not get any numbers back in either of the mapping types.

Thanks for the inputs on credit card number formats. I'm just testing how to use regex on elasticsearch strings.

Jaspreet_Singh · November 20, 2018, 6:28pm

I would really try simple things first, just to make sure I get the syntax right. Check this - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html.
Once you are confident - then try credit card regex cos this is complex.

Jaspreet_Singh · November 20, 2018, 6:30pm

Have you seen Completion suggestor (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html) and the related Context suggestor (https://www.elastic.co/guide/en/elasticsearch/reference/current/suggester-context.html)?
They are recommended approaches for search as you type feature. (optimized for quick fetch)

3ms1 · November 22, 2018, 11:39am

The completion suggester can the key to achieving my goal but how can I apply lowercase analyser?

Currently, I am sending normalized string however, I would rather Elastic does it during indexing.

And I tried. Perhaps I am missing something.

system · December 20, 2018, 11:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Searching with spaces Elasticsearch	1	311	July 6, 2017
Search with white space Elasticsearch	7	7577	July 6, 2017
Search with spaces Elasticsearch	1	321	July 6, 2017
Search query issue using "term" and spaces Elasticsearch	7	13671	July 6, 2017
Help with analyzer and mapping Elasticsearch	9	554	July 6, 2017

Search string with space in a long text

Related topics