Searching emails by domain


#1

I've been trying to find out how to search through millions of emails with wildcards if possible.

Examples of searches would go like

  1. *@domain
  2. user@*
  3. *keyword@domain

I'm still new to ES and have been trying to find a solution now for quite a while.
Any suggestions?


(Adrien Grand) #2

The most efficient way would be to index the name and domain parts of your email into different properties of your document, eg.

 {
  "email": "user@example.org",
  "email_name": "user",
  "email_domain": "example.org"
}

Otherwise you can use wildcard queries https://www.elastic.co/guide/en/elasticsearch/reference/5.1/query-dsl-wildcard-query.html but they might be slow when there are few characters before the first wildcard (especially leading wildcards).


#3

It would be very difficult to parse all the data I've been given and to be able to do that.
I was using wildcard query before but once I reached a certain number of emails the node timeout from the searches.

I was told earlier that a reverse token filter and a suffix query would increase the search significantly, but I'm a bit puzzled trying to get it to work.
I'm using Elasticsearch-PHP to handle the searches.


(Adrien Grand) #4

Why do you say that? Possibly tools like Logstash and Node ingest could help there.

Indeed a reverse token filter could help replace loading wildcards with suffix queries, but it will require reindexing too, like the extraction of the user and domain fields.

Here is an example if you'd like:

DELETE index

PUT index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "reverse_keyword": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": ["reverse"]
          }
        }
      }
    }
  }, 
  "mappings": {
    "type": {
      "properties": {
        "email": {
          "type": "keyword",
          "fields": {
            "reverse": {
              "type": "text",
              "analyzer": "reverse_keyword"
            }
          }
        }
      }
    }
  }
}

PUT index/type/1
{
  "email": "user@example.org"
}

GET index/_search
{
  "query": {
    "wildcard": {
      "email": {
        "value": "*.org"
      }
    }
  }
}

GET index/_search
{
  "query": {
    "wildcard": {
      "email.reverse": {
        "value": "gro.*"
      }
    }
  }
}

The two queries will match the same documents except that the second one will perform much faster since it does not have a leading wildcard.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.