How to search for text within text?

dsagent · September 14, 2024, 4:31pm

Hi

I have an index in which there is data, including the email, and I want to search in this field (email) for specific data, for example

mary.smith@sakilacustomer.org

I want to look for fields that contain smith from the middle of the email and .org or .com from the end of the email
What is the most suitable for this so that the query is very fast

dadoonet · September 14, 2024, 5:29pm

Try this:

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "[.@]"
        }
      }
    }
  },
  "mappings": {
    "properties": {
        "email": {
            "type": "text",
            "analyzer": "my_analyzer"
        }
    }
  }
}
POST /test/_doc
{
    "email": "mary.smith@sakilacustomer.org"
}
GET /test/_search
{
    "query": {
        "match": {
          "email": "mary"
        }
    }
}
GET /test/_search
{
    "query": {
        "match": {
          "email": "sakilacustomer"
        }
    }
}

If you want to understand what is happening behind the scene, run this:

GET /test/_analyze
{
    "analyzer": "my_analyzer",
    "text": ["mary.smith@sakilacustomer.org"]
}

This gives:

{
  "tokens": [
    {
      "token": "mary",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "smith",
      "start_offset": 5,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "sakilacustomer",
      "start_offset": 11,
      "end_offset": 25,
      "type": "word",
      "position": 2
    },
    {
      "token": "org",
      "start_offset": 26,
      "end_offset": 29,
      "type": "word",
      "position": 3
    }
  ]
}

dsagent · September 14, 2024, 5:39pm

I want the query to return all the emails according to the condition in the query
For example
GET index/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"email": {
"value": "smith"
}
}
},
{
"wildcard": {
"email": {
"value": "*org"
}
}
}
]
}
}
}

I want this query to return all the emails that contain in the middle of it the Im smith and at the end of it .org

dsagent · September 14, 2024, 5:45pm

This is Saud all the identical values whether they are in the middle or the beginning or the end and I do not want this I want the value that I specify in the query exactly
For example I want the one in the middle and I don't want any other similar values to query from the values at the end or the beginning I just want the one in the middle

dadoonet · September 14, 2024, 9:38pm

Please reproduce the problem, the same way I did. That will help to understand exactly what you have and what you want.

dsagent · September 16, 2024, 5:51am

The query has no problem but it is very slow

dadoonet · September 16, 2024, 6:30am

There are 2 possibilities from here:

you continue to use wildcards and the query will remain slow
you change the query / mapping to find a faster way to get your results

I proposed something which is must faster but you answered this is not answering your use case.
So I asked, please reproduce your use case so we can help you.
You answered that you don't have any problem with the query but it just slow.

Not sure how to help if you don't want to be helped.

Christian_Dahlqvist · September 16, 2024, 6:39am

Create the mapping David presented above and index the document he used as an example. You should then be able to query it as follows without requiring any wildcard queries, which should speed it up considerably.

GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "email": "smith"
          }
        },
        {
          "match": {
            "email": "org"
          }
        }
      ]
    }
  }
}

As the field is tokenised you can query directly on the individual components, which you can see from the analysis output above.

dsagent · September 16, 2024, 6:48am

I am sorry
I really want to help because he became very slow and tired

What do you want me to do because when I understand what exactly you mean

dsagent · September 16, 2024, 6:52am

Yes, your words are correct, but this returns any similar data and the use case does not allow that, so it is necessary to return the exact similar data

Christian_Dahlqvist · September 16, 2024, 6:54am

I do not understand what you mean by that. You need to provide explicit examples of what does and what does not work.

Please show an example with a query and sample data and what is returned that you do not want and a sample document that is not returned even though you expect it to be.

If you want to search on specific parts of the string and not treat it all as a list of tokens you can break out components into separate fields, e.g. like this:

POST /test/_doc
{
    "email": "mary.smith@sakilacustomer.org",
    "domain": "sakilacustomer.org",
    "domain_suffix": "org"
}

Here you can search on the whole email address but also on the domain or just the suffix without using wildcards, which should be fast and scale well.

If you do not want to break out parts into separate fields in the source document you may also be able to achieve this through multi fields using different analysers.

dsagent · September 16, 2024, 7:06am

{

"mappings": {

"properties": {
  "active": {
    "type": "long"
  },
  "activebool": {
    "type": "boolean"
  },
  "address_id": {
    "type": "long"
  },
  "create_date": {
    "type": "date"
  },
  "customer_id": {
    "type": "long"
  },
  "email": {
    "type": "text",
    
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "first_name": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "last_name": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "last_update": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "store_id": {
    "type": "long"
  }
}

}

These are my mappings

dsagent · September 16, 2024, 7:09am

POST /uep/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"email": "smith"
}
},
{
"wildcard": {
"email": "*.org"
}
}
]
}
}
}

This is my query that returns the data very accurately and that I want the data to be exactly the same as the data it returns

dsagent · September 16, 2024, 7:11am

{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2,
"hits": [
{
"_index": "uep",
"_id": "ha0j8ZEBH-KoI_5wv58R",
"_score": 2,
"_source": {
"customer_id": 1,
"store_id": 1,
"first_name": "Mary",
"last_name": "Smith",
"email": "mary.smith@sakilacustomer.org",
"address_id": 5,
"activebool": true,
"create_date": "2006-02-14",
"last_update": "2013-05-26 14:49:45.738",
"active": 1
}
},
{
"_index": "uep",
"_id": "bq0j8ZEBH-KoI_5wy6Iv",
"_score": 2,
"_source": {
"customer_id": 1,
"store_id": 1,
"first_name": "Mary",
"last_name": "Smith",
"email": "mary.smith@sakilacustomer.org",
"address_id": 5,
"activebool": true,
"create_date": "2006-02-14",
"last_update": "2013-05-26 14:49:45.738",
"active": 1
}
}
]
}
}

These results are returned by the query

Christian_Dahlqvist · September 16, 2024, 7:12am

In order to achieve what you want I believe you will need to change your mappings and get away from the slow wildcard queries. I would recommend setting up a small test index with the mappings David suggested and load this with a small subset of your data. You can then test and iterate on mappings and queries until you find something that works for you.

dsagent · September 16, 2024, 7:13am

Conclusion
I want a way to do this and repeat the exact same data shown above

Christian_Dahlqvist · September 16, 2024, 7:14am

We have told you what to do but you do not seem to listen. I do not have time for this type of pointless back-and-fourth without any progress so wish you good luck.

If you insist on using wildcard queries, changing your mappings to include the wildcard field type would make this faster, even though it likely will be slower than the other approaches described.

dsagent · September 16, 2024, 7:37am

Sorry about that.
I have done everything you suggested and I highly appreciate your efforts with me

Look, that's what I suggested, friend.
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": "[.@]"
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}

POST /test/_doc
{
"email": "mary.ssmith@sakilacustomergg.org"
}

And I made a query like this

GET /test/_search
{
"query": {
"match": {
"email": "*org"
}
}
}

And these are the outputs
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits":
}
}

Note that it does not return anything and if you delete * it will return the values when you can stand the word anywhere in the email and I do not want it to do that I want the word to be at the end of the email only if not at the end of the email do not repeat it

dsagent · September 16, 2024, 7:39am

I think if I'm going to use it, there's no way out of it.

Topic		Replies	Views
Query for email address within a text field Elasticsearch	1	680	June 30, 2021
Search for a string within a specific field Kibana	2	4366	March 7, 2018
Searching fields for specific values Elasticsearch	2	1925	July 5, 2017
Search For Url returning no results Elasticsearch	5	784	July 5, 2017
How to search email column in elastic search? Elasticsearch	2	327	July 4, 2019

How to search for text within text?

Related topics