Trouble finding the documents I want (whitespace analyzer)


(SomeUser) #1

I'm having some trouble setting up an index and getting my queries to find
the documents I want.

I have a field containing human written text and can contain emails, URLs,
FQDNs, IP addresses, names with "-" in them e.g. Mary-Anne etc. When I for
instance search for an email (abc.def@qwe.com*) I get matches for documents
containing "abc", "def", "qwe", "com" which is a problem since many users
have an email from the same domain (such as @gmail.com).

If I write "ab*" I want to find documents containing "abc.def@qwe.com" or
"abort" in the same way as writing "addr*" finds documents containing
"address". If I write "abc.def@qwe.com" I want to find documents containing
"abc.def@qwe.com" and not for example "xyz@qwe.com".

What am I missing? How should I index and query for text containing these
types of tokens?

Mapping:
{
"mappings": {
"mytype": {
"properties": {
"email": {
"type": "string",
"analyzer": "uax_url_email"
},
"status": {
"type": "string",
"index": "not_analyzed"
},
"description": {
"type": "string",
"analyzer": "custom_whitespace"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"uax_url_email": {
"filter": ["standard",
"lowercase",
"stop"],
"tokenizer": "uax_url_email"
},
"custom_whitespace": {
"filter": ["lowercase"],
"tokenizer": "whitespace"
}
}
}
}
}

Sample document
{
status: "somestatus",
email: "abc.def@qwe.com",
Description: "Description about something containing some difficult to
search for terms 192.168.0.1, abc.def@qwe.com, Mary-Ann etc"
}

Query
{
"query": {
"query_string": {
"query": "abc.def@qwe.com*",
"default_operator": "AND",
"analyze_wildcard": true
}
}
}

--


(Ivan Brusic) #2

When using a query_string query, you need to specific default_field.
If not, the _all field is used, which uses the standard analyzer.

--
Ivan

On Thu, Sep 20, 2012 at 2:44 AM, SomeUser fannkeentoong@gmail.com wrote:

I'm having some trouble setting up an index and getting my queries to find
the documents I want.

I have a field containing human written text and can contain emails, URLs,
FQDNs, IP addresses, names with "-" in them e.g. Mary-Anne etc. When I for
instance search for an email (abc.def@qwe.com*) I get matches for documents
containing "abc", "def", "qwe", "com" which is a problem since many users
have an email from the same domain (such as @gmail.com).

If I write "ab*" I want to find documents containing "abc.def@qwe.com" or
"abort" in the same way as writing "addr*" finds documents containing
"address". If I write "abc.def@qwe.com" I want to find documents containing
"abc.def@qwe.com" and not for example "xyz@qwe.com".

What am I missing? How should I index and query for text containing these
types of tokens?

Mapping:
{
"mappings": {
"mytype": {
"properties": {
"email": {
"type": "string",
"analyzer": "uax_url_email"
},
"status": {
"type": "string",
"index": "not_analyzed"
},
"description": {
"type": "string",
"analyzer": "custom_whitespace"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"uax_url_email": {
"filter": ["standard",
"lowercase",
"stop"],
"tokenizer": "uax_url_email"
},
"custom_whitespace": {
"filter": ["lowercase"],
"tokenizer": "whitespace"
}
}
}
}
}

Sample document
{
status: "somestatus",
email: "abc.def@qwe.com",
Description: "Description about something containing some difficult to
search for terms 192.168.0.1, abc.def@qwe.com, Mary-Ann etc"
}

Query
{
"query": {
"query_string": {
"query": "abc.def@qwe.com*",
"default_operator": "AND",
"analyze_wildcard": true
}
}
}

--

--


(system) #3