Analyzers and Search

I use ES to store key value pairs. I haven't defined the mappings while
creating the index.

UpdateRequestBuilder up = elasticClient.prepareUpdate(indexName, indexType,
key).

    setConsistencyLevel(WriteConsistencyLevel.*DEFAULT*

).setReplicationType(ReplicationType.ASYNC)

    .setDoc(map)

    .setUpsert(map);

Here map is my set of K-V pairs.

One of the keys that I am storing is email and sometimes I also store
values like URL and product id's which include non-alphabets. Now I dont
use an analyzer and haven't defined my mappings so I assume the standard
analyzer is used. So all my values are lower-cases and tokenized before
indexing.

Here is what is stored in the index:

{"took":27,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test1","_id":"097d4c2c-8c34-49bb-ac4b-977a5ea8abbf","_score":1.0, "_source" : {"ATID":"38318394","WTY":"Stocks","CREATED":"2013-04-09T18:30:52.845Z","CTN":"503-703-1115","EMAIL":"abcd.efgh@gmail.com","PRODUCT":"116/56L/A"}}]}}

Now I find that I am not able to search for emails e.g.
(abcd.efgh@gmail.com). I get false positives for "@gmail.com" , "abcd.efgh"
, even abcd@gmail.com. Is there a way I can use an analyzer just in search
without changing my mapping definition. Scores dont matter to me as well so
Filters are fine with me as well

Query String:{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "abcd.efgh@gmail.com",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "abcd.efgh",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

Query String:{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "@gmail.com",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "abcd@gmail.com",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey

the standard analyzer breaks email adresses and urls apart, you can find
out by using the analyze API

» curl -XGET 'localhost:9200/_analyze?pretty&analyzer=standard' -d '
foo@bar.de'
{
"tokens" : [ {
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "bar.de",
"start_offset" : 4,
"end_offset" : 10,
"type" : "",
"position" : 2
} ]
}

If you use the uax url email tokenizier, this does not happen, and
emails/urls are indexed as one token
» curl -XGET 'localhost:9200/_analyze?pretty&tokenizer=uax_url_email' -d '
foo@bar.de'
{
"tokens" : [ {
"token" : "foo@bar.de",
"start_offset" : 0,
"end_offset" : 10,
"type" : "",
"position" : 1
} ]
}

So use that tokenizer in your mapping in order to not split urls and emails
and everything should work as you expect. More at
http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer/

Hope this helps...

--Alexander

On Tue, Apr 9, 2013 at 8:45 PM, avinsemail@gmail.com wrote:

I use ES to store key value pairs. I haven't defined the mappings while
creating the index.

UpdateRequestBuilder up = elasticClient.prepareUpdate(indexName,
indexType, key).

    setConsistencyLevel(WriteConsistencyLevel.*DEFAULT*

).setReplicationType(ReplicationType.ASYNC)

    .setDoc(map)

    .setUpsert(map);

Here map is my set of K-V pairs.

One of the keys that I am storing is email and sometimes I also store
values like URL and product id's which include non-alphabets. Now I dont
use an analyzer and haven't defined my mappings so I assume the standard
analyzer is used. So all my values are lower-cases and tokenized before
indexing.

Here is what is stored in the index:

{"took":27,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test1","_id":"097d4c2c-8c34-49bb-ac4b-977a5ea8abbf","_score":1.0, "_source" : {"ATID":"38318394","WTY":"Stocks","CREATED":"2013-04-09T18:30:52.845Z","CTN":"503-703-1115","EMAIL":"abcd.efgh@gmail.com","PRODUCT":"116/56L/A"}}]}}

Now I find that I am not able to search for emails e.g. (
abcd.efgh@gmail.com). I get false positives for "@gmail.com" ,
"abcd.efgh" , even abcd@gmail.com. Is there a way I can use an analyzer
just in search without changing my mapping definition. Scores dont matter
to me as well so Filters are fine with me as well

Query String:{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "abcd.efgh@gmail.com",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "abcd.efgh",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

Query String:{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "@gmail.com",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

{

"bool" : {

"must" : [ {

  "match" : {

    "EMAIL" : {

      "query" : "abcd@gmail.com",

      "type" : "boolean"

    }

  }

}, {

  "range" : {

    "CREATED" : {

      "from" : "2013-04-08",

      "to" : "2013-04-18",

      "include_lower" : true,

      "include_upper" : true

    }

  }

} ]

}

}

Hits:1

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.