Search 'U2' in music catalogue

Ales_Kafka_2 · April 2, 2012, 1:26am

Hi,

I have quite a large music catalogue and I'm playing with
ElasticSearch. I have already indexed my albums (properties: name
(boost 5.0), genre, artist (boost 3.0), array of tracks) using
lowercase tokenizer with asciifolding filter (same analyzer is used
for search). Now, I would like to search for query 'U2', but it keeps
returning me albums with name like 'R U Mine?' or 'Got 2 Luv U'. And I
couldn't find a way to retrieve Albums with artist name 'U2' with
highest score or ignore documents containing just 'U'. Is there any
possibility to configure request in this sense?

I have already tried disable fuzzy, and so on. But nothing seems to
help. Apparently, ES perform left and right trim of Numeric
characters. What should I do to make ES search exact alphanumeric
phase?

Thanks

dadoonet · April 2, 2012, 4:59am

For artists, you can use a multifield mapping and apply for the second mapping of the field a keyword (+lowercase) analyzer.

With a keyword analyzer, U2 won't be break in tokens.

HTH
David
Twitter : @dadoonet / @elasticsearchfr

Le 2 avr. 2012 à 03:26, Ales Kafka ales.kafka@gmail.com a écrit :

Hi,

I have quite a large music catalogue and I'm playing with
Elasticsearch. I have already indexed my albums (properties: name
(boost 5.0), genre, artist (boost 3.0), array of tracks) using
lowercase tokenizer with asciifolding filter (same analyzer is used
for search). Now, I would like to search for query 'U2', but it keeps
returning me albums with name like 'R U Mine?' or 'Got 2 Luv U'. And I
couldn't find a way to retrieve Albums with artist name 'U2' with
highest score or ignore documents containing just 'U'. Is there any
possibility to configure request in this sense?

I have already tried disable fuzzy, and so on. But nothing seems to
help. Apparently, ES perform left and right trim of Numeric
characters. What should I do to make ES search exact alphanumeric
phase?

Thanks

AEvar_Arnfjord_Bjarm · April 2, 2012, 10:53am

On Mon, Apr 2, 2012 at 03:26, Ales Kafka ales.kafka@gmail.com wrote:

I have already tried disable fuzzy, and so on. But nothing seems to
help. Apparently, ES perform left and right trim of Numeric
characters. What should I do to make ES search exact alphanumeric
phase?

It doesn't do that, here's how "U2" is analyzed with the standard analyzer:

$ searchanalyze -t U2 -a standard
We analyzed with as <{
'tokens' => [
{
'end_offset' => 2,
'position' => 1,
'start_offset' => 0,
'token' => 'u2',
'type' => ''
}
]
}

Ales_Kafka_2 · April 2, 2012, 12:31pm

You are right. Standard analyzer analysis word 'U2' correctly as one token
as was intended.

Now, I have found two other problem, I wish to capture. Even standard
analyzer split query 'Ke$ha' into tokens 'ke' and 'ha'. Same problem with
query 'p!nk'. Is there any way to adjust standard analyzer not to split
these kind of queries?

I assume that I could use 'whitespace' tokenizer, but I would loose power
in other type of queries like '1998/1999' which wouldn't have been split
into two years.

On Monday, April 2, 2012 12:53:14 PM UTC+2, Ævar Arnfjörð Bjarmason wrote:

On Mon, Apr 2, 2012 at 03:26, Ales Kafka wrote

:

It doesn't do that, here's how "U2" is analyzed with the standard analyzer:

$ searchanalyze -t U2 -a standard
We analyzed with as <{
'tokens' => [
{
'end_offset' => 2,
'position' => 1,
'start_offset' => 0,
'token' => 'u2',
'type' => ''
}
]
}

On Monday, April 2, 2012 12:53:14 PM UTC+2, Ævar Arnfjörð Bjarmason wrote:

On Mon, Apr 2, 2012 at 03:26, Ales Kafka wrote:

I have already tried disable fuzzy, and so on. But nothing seems to
help. Apparently, ES perform left and right trim of Numeric
characters. What should I do to make ES search exact alphanumeric
phase?

It doesn't do that, here's how "U2" is analyzed with the standard analyzer:

$ searchanalyze -t U2 -a standard
We analyzed with as <{
'tokens' => [
{
'end_offset' => 2,
'position' => 1,
'start_offset' => 0,
'token' => 'u2',
'type' => ''
}
]
}

Greg_Brown · April 2, 2012, 4:54pm

Now, I have found two other problem, I wish to capture. Even standard
analyzer split query 'Ke$ha' into tokens 'ke' and 'ha'. Same problem with
query 'p!nk'. Is there any way to adjust standard analyzer not to split
these kind of queries?

You could build a custom set of synonyms based on artist names that
you pull automatically from DBpedia or some other source. This would
both allow you to use different names as synonyms, but all prevent
certain character sequences from be broken apart if they have a symbol
in them. For instance I am considering the following as all being the
same token:
plug-in, plug in, plugin, plug-ins, plug ins, plugins

Ales_Kafka_2 · April 3, 2012, 2:17pm

Thanks for interesting idea. But I go through topics that cover synonyms in
ES and I am not sure, how to implement it. As far as I understand it,
standard tokenizer splits text into tokens first (i.e. 'p!nk' into 'p' and
'nk') and synonym definition (in Synonym Token Filter) ('p!nk => pink')
couldn't be applied.

But I think I could live with the default standard analyzer, since results
are correct. (If default_operation: 'AND' is used, it basically doesnt
matter, whether 'p!nk' is indexed as one or two token)

Dne pondělí, 2. dubna 2012 18:54:48 UTC+2 Greg Brown napsal(a):

Now, I have found two other problem, I wish to capture. Even standard
analyzer split query 'Ke$ha' into tokens 'ke' and 'ha'. Same problem
with
query 'p!nk'. Is there any way to adjust standard analyzer not to split
these kind of queries?

You could build a custom set of synonyms based on artist names that
you pull automatically from DBpedia or some other source. This would
both allow you to use different names as synonyms, but all prevent
certain character sequences from be broken apart if they have a symbol
in them. For instance I am considering the following as all being the
same token:
plug-in, plug in, plugin, plug-ins, plug ins, plugins