Best way to search/index the data - with and without whitespace


(thale jacobs) #1

Hello - I am having a problem indexing and searching for words that may or
may not contain whitespace...Below is an example

Here is how the index is created:

curl -s -XPUT 'localhost:9200/test/name/1' -d '{ "street": "Lakeshore Dr" }'
curl -s -XPUT 'localhost:9200/test/name/2' -d '{ "street": "Sunnyshore Dr"
}'
curl -s -XPUT 'localhost:9200/test/name/3' -d '{ "street": "Lake View Dr" }'
curl -s -XPUT 'localhost:9200/test/name/4' -d '{ "street": "Shore Dr" }'

If I want to query for record 1/"Lakeshore Dr", I can using the following
query:

curl -s -XGET 'localhost:9200/test/name/_search?pretty=true' -d '{
"query":{
"bool":{
"must":[
{
"match":{
"street":{
"query":"lakeshore dr",
"type":"phrase"
}
}
}
]
}
}
}';

This returns the desired result of document id 1. But if a user searches
for "Lake Shore Dr" (a space between Lake and Shore), it is still desired
to return document id 1.

And the inverse of this problem is if a user searches for "Lakeview Dr"
(but indexed as "Lake View Dr"):
curl -s -XGET 'localhost:9200/test/name/_search?pretty=true' -d '{
"query":{
"bool":{
"must":[
{
"match":{
"street":{
"query":"lakeview dr",
"type":"phrase"
}
}
}
]
}
}
}';

The search matches to no documents. If the search is changed to a booleansearch instead of a phrase
,
many docs will match on "dr", but doc #3, "Lake Shore" is not necessarily
returned as the top match.

NGrams at index time?? Ngrams at search time?? Remove whitespace at index
time/search time??

Any suggestions would be appreciated. Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/06538a83-17d1-446c-9b27-cebf12c6fc47%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

Thale, I would try edge ngrams (both index and search) and see how that
works. I don't see why it wouldn't work for your 2 cases - just make your
queries into match queries and use the "AND" operator. Good luck!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/16c67efb-0d8a-48df-a58f-4a2842c0cfda%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(thale jacobs) #3

Hello Binh Ly - Thanks for the replay. I thought I had read that ngram
searching should only be used at either index time or search time, but not
both... Is that not the case? Thanks again. Thale

On Wednesday, January 29, 2014 6:49:10 PM UTC-5, Binh Ly wrote:

Thale, I would try edge ngrams (both index and search) and see how that
works. I don't see why it wouldn't work for your 2 cases - just make your
queries into match queries and use the "AND" operator. Good luck!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ef6a8b2f-e291-419f-8a8b-1eefa8657d2b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #4

Thale, you are correct - ngrams are usually used at index-time only, but in
your case and requirements, you might want to experiment both index and
seach time. I'd probably just increase the edge min ngram size to something
reasonable like maybe 4(?) and see if that works or not.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a86542f0-0c79-4e06-9f12-3c7200f855e9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(thale jacobs) #5

This is how I set up the mappings:

curl -s -XPUT 'localhost:9200/test' -d '{
"mappings": {
"properties": {
"name": {
"street": {
"type": "string",
"index_analyzer": "index_ngram",
"search_analyzer": "search_ngram"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"desc_ngram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 20
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "desc_ngram", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}'

This is how I built the index:

curl -s -XPUT 'localhost:9200/test/name/1' -d '{ "street": "Lakeshore Dr" }'
curl -s -XPUT 'localhost:9200/test/name/2' -d '{ "street": "Sunnyshore Dr"
}'
curl -s -XPUT 'localhost:9200/test/name/3' -d '{ "street": "Lake View Dr" }'
curl -s -XPUT 'localhost:9200/test/name/4' -d '{ "street": "Shore Dr" }'

If a user attempts to search for "Lake Shore Dr", I want to only match to
document 1/"Lakeshore Dr"
If a user attempts to search for "Lakeview Dr", I want to only match to
document 3/"Lake View Dr"

Here is an example of the query that is not working correctly:

curl -s -XGET 'localhost:9200/test/_search?pretty=true' -d '{
"query":{
"bool":{
"must":[
{
"match":{
"street":{
"query":"lake shore dr",
"type":"boolean"
}
}
}
]
}
}
}';

So is the issue with how I am setting up the mappings (tokenizer?, edgegram
vs ngrams?, size of ngrams?) or the query (I have tried things like setting
the minimum_should_match, and the analyzer to use), but I have not been
able to get the desired results.

Thanks all.

On Thursday, February 6, 2014 10:16:40 AM UTC-5, Binh Ly wrote:

Thale, you are correct - ngrams are usually used at index-time only, but
in your case and requirements, you might want to experiment both index and
seach time. I'd probably just increase the edge min ngram size to something
reasonable like maybe 4(?) and see if that works or not.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3b7a9d63-3a08-4cfc-96ce-4b22d44cd9db%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #6

Thale,

I played with your data a little and it turns out it is more complex than I
thought. Something like this works somewhat but may require some
fine-tuning depending on your exact requirements. Anyway give this a try
and see how it works (BTW I did this in ES 1.0 RC 2):

  1. PUT http://localhost:9200/test
    {
    "settings": {
    "index": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
    "analyzer": {
    "en1": {
    "tokenizer": "standard",
    "filter": [
    "standard",
    "lowercase",
    "en1"
    ]
    }
    },
    "filter": {
    "en1": {
    "type" : "ngram",
    "min_gram" : 4,
    "max_gram" : 4
    }
    }
    }
    }
    },
    "mappings": {
    "doc": {
    "properties": {
    "street": {
    "type": "string",
    "analyzer": "en1"
    }
    }
    }
    }
    }

  2. POST http://localhost:9200/test/doc/_bulk
    { "index": {} }
    { "street": "Lakeshore Dr" }
    { "index": {} }
    { "street": "Sunnyshore Dr" }
    { "index": {} }
    { "street": "Lake View Dr" }
    { "index": {} }
    { "street": "Shore Dr" }

Query example:

GET http://localhost:9200/test/doc/_search
{
"query": {
"match": {
"street": {
"query": "lake shore dr",
"minimum_should_match": "3<50%"
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3c62899-30ad-4223-9dfe-7053e2c72f72%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7