Best way to index the data for this problem

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

  • items can have from 0 to 6 sockets
  • some sockets can be linked
  • there are 3 types of sockets (Red, Green, Blue)
  • when 3 sockets are linked (for instance), we consider that they're all
    linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now, is:
as soon as a group has a , I add * between every subgroup, so I would
search for: bg
r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,
Yeah, most of your questions around efficiency are correct, exact match
is fast, whitespace analyzer would work, and if your data set is small, it
may not matter if you do crazy wild carding, but won't scale.

I think you should look into what can be done with ngrams (likely the
token filter). When wildcards aren't used just stick to a field with white
space analyzer.

When wildcards are used, switch to searching by ngrams. Play around with it
and it might work. Sorry, I don't have a concrete example, others might,
but it seems with your wild cards, you'll want to be matching boundary
pairs (bg, br, gr) which ngrams will help with.

Best Regards.
Paul

On Thursday, March 28, 2013 7:11:32 PM UTC-6, Robin Boutros wrote:

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

  • items can have from 0 to 6 sockets
  • some sockets can be linked
  • there are 3 types of sockets (Red, Green, Blue)
  • when 3 sockets are linked (for instance), we consider that they're all
    linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now,
is: as soon as a group has a , I add * between every subgroup, so I would
search for: bg
r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for taking the time Paul.

So, how small is small enough for regexps to be efficient. Is 3 or 4
million documents ok?

The problem with ngram if I'm not mistaken is that it will fail to find
cases like this:

Indexed data: rgb
User search: rb

The way is understand ngram works is that it will index r - rg - rgb - g -
gb - b, so... no rb in there :frowning:

I was trying to think of a way to convert these strings into numbers. Like
r = 1, b = 6 * 1 + 1, g = 6 * 6 + 1, it would work well for exact matches,
but I hoped there was a clever way where if x > y then the string
associated with x contains the string associated with y, but I couldn't
find one.

One last question: let's say an item with 6 sockets has 2 * 3 linked
sockets: "bgr bgr". If i search for bgr, it will return this item. Is there
a way to make it return only items that have this pattern twice?

Thanks!

On Thursday, March 28, 2013 9:11:32 PM UTC-4, Robin Boutros wrote:

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

  • items can have from 0 to 6 sockets
  • some sockets can be linked
  • there are 3 types of sockets (Red, Green, Blue)
  • when 3 sockets are linked (for instance), we consider that they're all
    linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now,
is: as soon as a group has a , I add * between every subgroup, so I would
search for: bg
r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The problem with ngram if I'm not mistaken is that it will fail to
find cases like this:

Indexed data: rgb
User search: rb

The way is understand ngram works is that it will index r - rg - rgb -
g - gb - b, so... no rb in there :frowning:

Those are edge ngrams, not ngrams. Ngrams would give you:

r,g,b,rg,gb,rgb

If you apply the ngrams to the search term 'rb' then it'd search on
r,b,rb, so 'r' and 'b' would match. Of course, you'll end up with lots
of false positives.

Regexes may be a solution, in v0.90+ but you want to be sure that you
don't have a wildcard at the beginning of your regex, otherwise you will
be searching all terms in the index.

One last question: let's say an item with 6 sockets has 2 * 3 linked
sockets: "bgr bgr". If i search for bgr, it will return this item. Is
there a way to make it return only items that have this pattern twice?

using a match_phrase query "bgr bgr"

I think you need to use a multi field to analyze the same data in
multiple ways, then combine several queries to make exact matches more
relevant, but still to include partial ngram matches

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

@Clinton Thanks :slight_smile:

  1. Seems like the n-grams I listed are the same as yours in a different
    order :wink:

  2. "if you apply the ngrams to the search term". I don't understand that.
    Usually it would just search the term within the n-grams, not all the
    n-grams from the user input, right? Still, it's a lot of false positive...
    But I see your point in being ok with false positive and boosting the score
    of best matches. The thing is, users will probably search for scoket color,
    but order by other things, so I'm not sure how that would work.

  3. I understand that starting with a wildcard is really bad, but in somes
    cases, I don't think I'd have a choice if I go the regexp route.

Also, let's say I search for "gr* b" with a default operator "AND", would
it first search for b, and then do the regexp on only documents that match
b, or not?

Thanks :slight_smile:

On Thursday, March 28, 2013 9:11:32 PM UTC-4, Robin Boutros wrote:

Hey,

My question isn't reeeally specific to ES, but maybe you can help me.

I'm building a search engine for a game. In this game, items have sockets.

  • items can have from 0 to 6 sockets
  • some sockets can be linked
  • there are 3 types of sockets (Red, Green, Blue)
  • when 3 sockets are linked (for instance), we consider that they're all
    linked to each other, not just to the closest one.

My goal is to allow users to search for items with specific colors (with a
degree of incertainty)

The way a user would describe the socket combination like that:

"rrrbb" would be an item with 5 linked sockets, 3 red and 2 blue.
"ggg rb" would have 3 green socket linked and 1 red and 1 blue linked.

Still following?

When I index an item, I order the linked sockets alphabetically and by
group size (group size might not be necessary): "rb ggb" would become "bgg
br".

And I would reorder the user input the same way.

So, searching for an exact combination is super easy and fast(?). I
wouldn't even need to analyze the data, would I?

But I would like these scenarios to work:

"ggg" would find "ggg rb" -> this is easy with just a whitespace analyzer?

But what I'm struggling with would be:

"bg* r b" would match "bgrr r b" -> the way I'm doing things right now,
is: as soon as a group has a , I add * between every subgroup, so I would
search for: bg
r b.
It works, but I think it's really inefficient, isn't it?

Do you have an idea on how to index the data (analyzers, as a string or
terms or...) to make this possible an fast?

Thanks a looooot if you read until the end <3

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hiya

  1. "if you apply the ngrams to the search term". I don't understand
    that. Usually it would just search the term within the n-grams, not
    all the n-grams from the user input, right? Still, it's a lot of false
    positive... But I see your point in being ok with false positive and
    boosting the score of best matches. The thing is, users will probably
    search for scoket color, but order by other things, so I'm not sure
    how that would work.

You can mitigate the number of false positives by using
minimum_should_match, which understands a variety of formats, eg:

  • minimum_should_match: 2
    at least two terms should match

  • minimum_should_match: "60%"
    at least 60% of terms should match

  • minimum_should_match: "2<75% 8<-3"
    for 1 or 2 clauses, both are required
    for 3 to 8 clauses, 75% of them are required
    for 9 or more clauses, all but three of them are required

Also, let's say I search for "gr* b" with a default operator "AND",
would it first search for b, and then do the regexp on only documents
that match b, or not?

I've looked at your question on stackoverflow a few times, and started
answering it then discarded my answer. The problem is that I'm not
entirely sure what you want to achieve.

If you could fill it out with more details, specifically "given items
BGGR, BBBGG R, etc I want to be able to match X, Y but not Z, in order
Foo"

With the requirements more clearly delineated, it'll be easier to
suggest the right approach

ta

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.