Multiple analyzers in "_all" field?


(Paweł Młynarczyk) #1

Hi

I've got a multilingual documents to index. I want to create a full text
search, so the first thing on my mind was to use string query with the _all
field. The problem is that the _all field has it's own analyzer, so the
fields specific analyzers are not used (data is not analyzed properly). Is
there a way to use field's aproppriate analyzer when copying data to _all
instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my
case, because I've got milions of documents and every one of them contains
fields in more than one language + a number of language independent fields.
That means I would end up having heavily duplicated data in every index.

Thanks in advance

Paweł Młynarczyk

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

I would not use _all field for that but I would probably disable it and the use multifield type on your "content" field.
Probably one sub field per language.

See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

8 novembre 2013 at 16:40:23, Paweł Młynarczyk (zwarios@gmail.com) a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text search, so the first thing on my mind was to use string query with the _all field. The problem is that the _all field has it's own analyzer, so the fields specific analyzers are not used (data is not analyzed properly). Is there a way to use field's aproppriate analyzer when copying data to _all instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my case, because I've got milions of documents and every one of them contains fields in more than one language + a number of language independent fields. That means I would end up having heavily duplicated data in every index.

Thanks in advance

Paweł Młynarczyk

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Paweł Młynarczyk) #3

Thanks for your reply.

You suggest creating a multi field equivalent of the '_all' field, but
isn't that a waste to analyze all the language dependant data with every
analyzer? I mean if I would create that kind of custom '_all' field and put
there aggregated data from all the language dependant fields, than I would
end up having X '_all' fields (where X is the number of languages) right?
Additionaly would I have any option to boost a particular, more important
field? (In my case, every language have more than 1 field and some of them
are more important)

Paweł Młynarczyk

W dniu piątek, 8 listopada 2013 17:51:41 UTC+1 użytkownik David Pilato
napisał:

I would not use _all field for that but I would probably disable it and
the use multifield type on your "content" field.
Probably one sub field per language.

See:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

8 novembre 2013 at 16:40:23, Paweł Młynarczyk (zwa...@gmail.com<javascript:>)
a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text
search, so the first thing on my mind was to use string query with the _all
field. The problem is that the _all field has it's own analyzer, so the
fields specific analyzers are not used (data is not analyzed properly). Is
there a way to use field's aproppriate analyzer when copying data to _all
instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my
case, because I've got milions of documents and every one of them contains
fields in more than one language + a number of language independent fields.
That means I would end up having heavily duplicated data in every index.

Thanks in advance

Paweł Młynarczyk

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

I don't understand. What can you do?

There are two options for me:

For that last item, how can you proceed? Random analyzer? Try more than one analyzer? No analyzer at all (yes you may want to use ngrams to try to make it work but I guess that it will generate a lot of false positive results).
More than one analyzer is multifield. You're right. Behind the scene, it's like having X fields one for each language. But instead of providing a json document like:

{
"content_fr":"mon contenu francais",
"content_en":"mon contenu francais",
"content_de":"mon contenu francais"
}

You will be able to provide
{
"content":"mon contenu francais",
}

So in term of _source storage, you won't pay the price 3 time. In term of inverted index, yes you will consume space for content.fr, content.en and content.de.

Makes sense?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

8 novembre 2013 at 19:22:55, Paweł Młynarczyk (zwarios@gmail.com) a écrit:

Thanks for your reply.

You suggest creating a multi field equivalent of the '_all' field, but isn't that a waste to analyze all the language dependant data with every analyzer? I mean if I would create that kind of custom '_all' field and put there aggregated data from all the language dependant fields, than I would end up having X '_all' fields (where X is the number of languages) right? Additionaly would I have any option to boost a particular, more important field? (In my case, every language have more than 1 field and some of them are more important)

Pawe³ M³ynarczyk

W dniu pi±tek, 8 listopada 2013 17:51:41 UTC+1 u¿ytkownik David Pilato napisa³:
I would not use _all field for that but I would probably disable it and the use multifield type on your "content" field.
Probably one sub field per language.

See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

8 novembre 2013 at 16:40:23, Pawe³ M³ynarczyk (zwa...@gmail.com) a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text search, so the first thing on my mind was to use string query with the _all field. The problem is that the _all field has it's own analyzer, so the fields specific analyzers are not used (data is not analyzed properly). Is there a way to use field's aproppriate analyzer when copying data to _all instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my case, because I've got milions of documents and every one of them contains fields in more than one language + a number of language independent fields. That means I would end up having heavily duplicated data in every index.

Thanks in advance

Pawe³ M³ynarczyk

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Paweł Młynarczyk) #5

Yes it does make sense.
The thing is, I've got a doc that has fields:
name.en,
name.fr,
name.de,
name.pl,
description.en,
description.fr,
description.de,
description.pl,
etc

  • a lot of others, language independent.

Creating the 'content' multi field would aggregate data from multiple
languages in each document - that would cause the index to be huge wouldn't
it? Additionally, I'd have no option to boost a particular field (name in
this case) to be more important within the search.

I've already tried couple of solutions, but every one of them has some
drawbacks:

  1. Using the '_all' field - in this case I can't use language specific
    analyzers.
  2. Using separate index for every language - in this case the language
    independent data of each doc have to be indexed multiple times.
  3. Using 'fields' attribute of the 'string query' to do a text query on the
    right language fields. This seems a good one, but since I've got a lot of
    fields to execute this query against, then the query performance drops down
    as compared to query against the '_all' field.
  4. Aggregating language dependant data to one field, and execute a string
    query against that field. In this case I cannot boost a particular field.

I am looking to minimise the drawbacks. Any other ideas?

W dniu piątek, 8 listopada 2013 19:37:44 UTC+1 użytkownik David Pilato
napisał:

I don't understand. What can you do?

There are two options for me:

For that last item, how can you proceed? Random analyzer? Try more than
one analyzer? No analyzer at all (yes you may want to use ngrams to try to
make it work but I guess that it will generate a lot of false positive
results).
More than one analyzer is multifield. You're right. Behind the scene, it's
like having X fields one for each language. But instead of providing a json
document like:

{
"content_fr":"mon contenu francais",
"content_en":"mon contenu francais",
"content_de":"mon contenu francais"
}

You will be able to provide
{
"content":"mon contenu francais",
}

So in term of _source storage, you won't pay the price 3 time. In term of
inverted index, yes you will consume space for content.fr, content.en and
content.de.

Makes sense?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

8 novembre 2013 at 19:22:55, Paweł Młynarczyk (zwa...@gmail.com<javascript:>)
a écrit:

Thanks for your reply.

You suggest creating a multi field equivalent of the '_all' field, but
isn't that a waste to analyze all the language dependant data with every
analyzer? I mean if I would create that kind of custom '_all' field and put
there aggregated data from all the language dependant fields, than I would
end up having X '_all' fields (where X is the number of languages) right?
Additionaly would I have any option to boost a particular, more important
field? (In my case, every language have more than 1 field and some of them
are more important)

Pawe³ M³ynarczyk

W dniu pi±tek, 8 listopada 2013 17:51:41 UTC+1 u¿ytkownik David Pilato
napisa³:

I would not use _all field for that but I would probably disable it and
the use multifield type on your "content" field.
Probably one sub field per language.

See:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

HTH

 -- 

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

8 novembre 2013 at 16:40:23, Pawe³ M³ynarczyk (zwa...@gmail.com) a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text
search, so the first thing on my mind was to use string query with the _all
field. The problem is that the _all field has it's own analyzer, so the
fields specific analyzers are not used (data is not analyzed properly). Is
there a way to use field's aproppriate analyzer when copying data to _all
instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my
case, because I've got milions of documents and every one of them contains
fields in more than one language + a number of language independent fields.
That means I would end up having heavily duplicated data in every index.

Thanks in advance

Pawe³ M³ynarczyk

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

You can use the _all field with the combo analyzer

It can concatenate many tokens from many analyzers into one field. Note,
that scoring is a little bit skewed.

Do not forget to use the unique token filter to reduce multiple occurances
of tokens
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7