How to search for text within text?

Hi

I have an index in which there is data, including the email, and I want to search in this field (email) for specific data, for example

mary.smith@sakilacustomer.org

I want to look for fields that contain smith from the middle of the email and .org or .com from the end of the email
What is the most suitable for this so that the query is very fast

1 Like

Try this:

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "[.@]"
        }
      }
    }
  },
  "mappings": {
    "properties": {
        "email": {
            "type": "text",
            "analyzer": "my_analyzer"
        }
    }
  }
}
POST /test/_doc
{
    "email": "mary.smith@sakilacustomer.org"
}
GET /test/_search
{
    "query": {
        "match": {
          "email": "mary"
        }
    }
}
GET /test/_search
{
    "query": {
        "match": {
          "email": "sakilacustomer"
        }
    }
}

If you want to understand what is happening behind the scene, run this:

GET /test/_analyze
{
    "analyzer": "my_analyzer",
    "text": ["mary.smith@sakilacustomer.org"]
}

This gives:

{
  "tokens": [
    {
      "token": "mary",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "smith",
      "start_offset": 5,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "sakilacustomer",
      "start_offset": 11,
      "end_offset": 25,
      "type": "word",
      "position": 2
    },
    {
      "token": "org",
      "start_offset": 26,
      "end_offset": 29,
      "type": "word",
      "position": 3
    }
  ]
}

I want the query to return all the emails according to the condition in the query
For example
GET index/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"email": {
"value": "smith"
}
}
},
{
"wildcard": {
"email": {
"value": "*org"
}
}
}
]
}
}
}

I want this query to return all the emails that contain in the middle of it the Im smith and at the end of it .org

This is Saud all the identical values whether they are in the middle or the beginning or the end and I do not want this I want the value that I specify in the query exactly
For example I want the one in the middle and I don't want any other similar values to query from the values at the end or the beginning I just want the one in the middle

Please reproduce the problem, the same way I did. That will help to understand exactly what you have and what you want.

The query has no problem but it is very slow

There are 2 possibilities from here:

  • you continue to use wildcards and the query will remain slow
  • you change the query / mapping to find a faster way to get your results

I proposed something which is must faster but you answered this is not answering your use case.
So I asked, please reproduce your use case so we can help you.
You answered that you don't have any problem with the query but it just slow.

Not sure how to help if you don't want to be helped.

Create the mapping David presented above and index the document he used as an example. You should then be able to query it as follows without requiring any wildcard queries, which should speed it up considerably.

GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "email": "smith"
          }
        },
        {
          "match": {
            "email": "org"
          }
        }
      ]
    }
  }
}

As the field is tokenised you can query directly on the individual components, which you can see from the analysis output above.

I am sorry
I really want to help because he became very slow and tired

What do you want me to do because when I understand what exactly you mean

Yes, your words are correct, but this returns any similar data and the use case does not allow that, so it is necessary to return the exact similar data

I do not understand what you mean by that. You need to provide explicit examples of what does and what does not work.

Please show an example with a query and sample data and what is returned that you do not want and a sample document that is not returned even though you expect it to be.

If you want to search on specific parts of the string and not treat it all as a list of tokens you can break out components into separate fields, e.g. like this:

POST /test/_doc
{
    "email": "mary.smith@sakilacustomer.org",
    "domain": "sakilacustomer.org",
    "domain_suffix": "org"
}

Here you can search on the whole email address but also on the domain or just the suffix without using wildcards, which should be fast and scale well.

If you do not want to break out parts into separate fields in the source document you may also be able to achieve this through multi fields using different analysers.

{

"mappings": {

"properties": {
  "active": {
    "type": "long"
  },
  "activebool": {
    "type": "boolean"
  },
  "address_id": {
    "type": "long"
  },
  "create_date": {
    "type": "date"
  },
  "customer_id": {
    "type": "long"
  },
  "email": {
    "type": "text",
    
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "first_name": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "last_name": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "last_update": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "store_id": {
    "type": "long"
  }
}

}

}

These are my mappings

POST /uep/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"email": "smith"
}
},
{
"wildcard": {
"email": "*.org"
}
}
]
}
}
}

This is my query that returns the data very accurately and that I want the data to be exactly the same as the data it returns

{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2,
"hits": [
{
"_index": "uep",
"_id": "ha0j8ZEBH-KoI_5wv58R",
"_score": 2,
"_source": {
"customer_id": 1,
"store_id": 1,
"first_name": "Mary",
"last_name": "Smith",
"email": "mary.smith@sakilacustomer.org",
"address_id": 5,
"activebool": true,
"create_date": "2006-02-14",
"last_update": "2013-05-26 14:49:45.738",
"active": 1
}
},
{
"_index": "uep",
"_id": "bq0j8ZEBH-KoI_5wy6Iv",
"_score": 2,
"_source": {
"customer_id": 1,
"store_id": 1,
"first_name": "Mary",
"last_name": "Smith",
"email": "mary.smith@sakilacustomer.org",
"address_id": 5,
"activebool": true,
"create_date": "2006-02-14",
"last_update": "2013-05-26 14:49:45.738",
"active": 1
}
}
]
}
}

These results are returned by the query

In order to achieve what you want I believe you will need to change your mappings and get away from the slow wildcard queries. I would recommend setting up a small test index with the mappings David suggested and load this with a small subset of your data. You can then test and iterate on mappings and queries until you find something that works for you.

Conclusion
I want a way to do this and repeat the exact same data shown above

We have told you what to do but you do not seem to listen. I do not have time for this type of pointless back-and-fourth without any progress so wish you good luck.

If you insist on using wildcard queries, changing your mappings to include the wildcard field type would make this faster, even though it likely will be slower than the other approaches described.

Sorry about that.
I have done everything you suggested and I highly appreciate your efforts with me

Look, that's what I suggested, friend.
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": "[.@]"
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}

POST /test/_doc
{
"email": "mary.ssmith@sakilacustomergg.org"
}

And I made a query like this

GET /test/_search
{
"query": {
"match": {
"email": "*org"
}
}
}

And these are the outputs
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits":
}
}

Note that it does not return anything and if you delete * it will return the values when you can stand the word anywhere in the email and I do not want it to do that I want the word to be at the end of the email only if not at the end of the email do not repeat it

I think if I'm going to use it, there's no way out of it.