Best way to index the data for this problem

Robin_Boutros · March 29, 2013, 1:11am

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

items can have from 0 to 6 sockets
some sockets can be linked
there are 3 types of sockets (Red, Green, Blue)
when 3 sockets are linked (for instance), we consider that they're all
linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now, is:
as soon as a group has a , I add * between every subgroup, so I would
search for: bg r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ppearcy · March 29, 2013, 6:23am

Hey,
Yeah, most of your questions around efficiency are correct, exact match
is fast, whitespace analyzer would work, and if your data set is small, it
may not matter if you do crazy wild carding, but won't scale.

I think you should look into what can be done with ngrams (likely the
token filter). When wildcards aren't used just stick to a field with white
space analyzer.

When wildcards are used, switch to searching by ngrams. Play around with it
and it might work. Sorry, I don't have a concrete example, others might,
but it seems with your wild cards, you'll want to be matching boundary
pairs (bg, br, gr) which ngrams will help with.

Best Regards.
Paul

On Thursday, March 28, 2013 7:11:32 PM UTC-6, Robin Boutros wrote:

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

items can have from 0 to 6 sockets

some sockets can be linked

there are 3 types of sockets (Red, Green, Blue)

when 3 sockets are linked (for instance), we consider that they're all
linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now,
is: as soon as a group has a , I add * between every subgroup, so I would
search for: bg r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Robin_Boutros · March 29, 2013, 12:29pm

Thanks for taking the time Paul.

So, how small is small enough for regexps to be efficient. Is 3 or 4
million documents ok?

The problem with ngram if I'm not mistaken is that it will fail to find
cases like this:

Indexed data: rgb
User search: rb

The way is understand ngram works is that it will index r - rg - rgb - g -
gb - b, so... no rb in there

I was trying to think of a way to convert these strings into numbers. Like
r = 1, b = 6 * 1 + 1, g = 6 * 6 + 1, it would work well for exact matches,
but I hoped there was a clever way where if x > y then the string
associated with x contains the string associated with y, but I couldn't
find one.

One last question: let's say an item with 6 sockets has 2 * 3 linked
sockets: "bgr bgr". If i search for bgr, it will return this item. Is there
a way to make it return only items that have this pattern twice?

Thanks!

On Thursday, March 28, 2013 9:11:32 PM UTC-4, Robin Boutros wrote:

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

items can have from 0 to 6 sockets

some sockets can be linked

there are 3 types of sockets (Red, Green, Blue)

when 3 sockets are linked (for instance), we consider that they're all
linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now,
is: as soon as a group has a , I add * between every subgroup, so I would
search for: bg r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 29, 2013, 1:35pm

The problem with ngram if I'm not mistaken is that it will fail to
find cases like this:

Indexed data: rgb
User search: rb

The way is understand ngram works is that it will index r - rg - rgb -
g - gb - b, so... no rb in there

Those are edge ngrams, not ngrams. Ngrams would give you:

r,g,b,rg,gb,rgb

If you apply the ngrams to the search term 'rb' then it'd search on
r,b,rb, so 'r' and 'b' would match. Of course, you'll end up with lots
of false positives.

Regexes may be a solution, in v0.90+ but you want to be sure that you
don't have a wildcard at the beginning of your regex, otherwise you will
be searching all terms in the index.

One last question: let's say an item with 6 sockets has 2 * 3 linked
sockets: "bgr bgr". If i search for bgr, it will return this item. Is
there a way to make it return only items that have this pattern twice?

using a match_phrase query "bgr bgr"

I think you need to use a multi field to analyze the same data in
multiple ways, then combine several queries to make exact matches more
relevant, but still to include partial ngram matches

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Robin_Boutros · March 29, 2013, 3:21pm

@Clinton Thanks

Seems like the n-grams I listed are the same as yours in a different
order
"if you apply the ngrams to the search term". I don't understand that.
Usually it would just search the term within the n-grams, not all the
n-grams from the user input, right? Still, it's a lot of false positive...
But I see your point in being ok with false positive and boosting the score
of best matches. The thing is, users will probably search for scoket color,
but order by other things, so I'm not sure how that would work.
I understand that starting with a wildcard is really bad, but in somes
cases, I don't think I'd have a choice if I go the regexp route.

Also, let's say I search for "gr* b" with a default operator "AND", would
it first search for b, and then do the regexp on only documents that match
b, or not?

Thanks

On Thursday, March 28, 2013 9:11:32 PM UTC-4, Robin Boutros wrote:

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

items can have from 0 to 6 sockets

some sockets can be linked

there are 3 types of sockets (Red, Green, Blue)

when 3 sockets are linked (for instance), we consider that they're all
linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now,
is: as soon as a group has a , I add * between every subgroup, so I would
search for: bg r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 30, 2013, 8:29am

Hiya

"if you apply the ngrams to the search term". I don't understand
that. Usually it would just search the term within the n-grams, not
all the n-grams from the user input, right? Still, it's a lot of false
positive... But I see your point in being ok with false positive and
boosting the score of best matches. The thing is, users will probably
search for scoket color, but order by other things, so I'm not sure
how that would work.

You can mitigate the number of false positives by using
minimum_should_match, which understands a variety of formats, eg:

minimum_should_match: 2
at least two terms should match
minimum_should_match: "60%"
at least 60% of terms should match
minimum_should_match: "2<75% 8<-3"
for 1 or 2 clauses, both are required
for 3 to 8 clauses, 75% of them are required
for 9 or more clauses, all but three of them are required

Also, let's say I search for "gr* b" with a default operator "AND",
would it first search for b, and then do the regexp on only documents
that match b, or not?

I've looked at your question on stackoverflow a few times, and started
answering it then discarded my answer. The problem is that I'm not
entirely sure what you want to achieve.

If you could fill it out with more details, specifically "given items
BGGR, BBBGG R, etc I want to be able to match X, Y but not Z, in order
Foo"

With the requirements more clearly delineated, it'll be easier to
suggest the right approach

ta

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Advice on my approach to this search problem Elasticsearch	11	527	July 6, 2017
Issues with elasticsearch + autocomplete Elasticsearch	10	1085	July 6, 2017
edgeNGram minimum length omits shorter words Elasticsearch	12	2865	July 6, 2017
Help needed with the query Elasticsearch	10	979	July 6, 2017
Hypenation and superfluous results with ngram analyser for autocomplete Elasticsearch	6	440	July 6, 2017

Best way to index the data for this problem

Related topics