Multiple analyzers in "_all" field?

Pawel_Mlynarczyk · November 8, 2013, 3:40pm

Hi

I've got a multilingual documents to index. I want to create a full text
search, so the first thing on my mind was to use string query with the _all
field. The problem is that the _all field has it's own analyzer, so the
fields specific analyzers are not used (data is not analyzed properly). Is
there a way to use field's aproppriate analyzer when copying data to _all
instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my
case, because I've got milions of documents and every one of them contains
fields in more than one language + a number of language independent fields.
That means I would end up having heavily duplicated data in every index.

Thanks in advance

Paweł Młynarczyk

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 8, 2013, 4:51pm

I would not use _all field for that but I would probably disable it and the use multifield type on your "content" field.
Probably one sub field per language.

See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

8 novembre 2013 at 16:40:23, Paweł Młynarczyk (zwarios@gmail.com) a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text search, so the first thing on my mind was to use string query with the _all field. The problem is that the _all field has it's own analyzer, so the fields specific analyzers are not used (data is not analyzed properly). Is there a way to use field's aproppriate analyzer when copying data to _all instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my case, because I've got milions of documents and every one of them contains fields in more than one language + a number of language independent fields. That means I would end up having heavily duplicated data in every index.

Thanks in advance

Paweł Młynarczyk

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Pawel_Mlynarczyk · November 8, 2013, 6:22pm

Thanks for your reply.

You suggest creating a multi field equivalent of the '_all' field, but
isn't that a waste to analyze all the language dependant data with every
analyzer? I mean if I would create that kind of custom '_all' field and put
there aggregated data from all the language dependant fields, than I would
end up having X '_all' fields (where X is the number of languages) right?
Additionaly would I have any option to boost a particular, more important
field? (In my case, every language have more than 1 field and some of them
are more important)

Paweł Młynarczyk

W dniu piątek, 8 listopada 2013 17:51:41 UTC+1 użytkownik David Pilato
napisał:

I would not use _all field for that but I would probably disable it and
the use multifield type on your "content" field.
Probably one sub field per language.

See:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch Platform — Find real-time answers at scale | Elastic

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

8 novembre 2013 at 16:40:23, Paweł Młynarczyk (zwa...@gmail.com<javascript:>)
a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text
search, so the first thing on my mind was to use string query with the _all
field. The problem is that the _all field has it's own analyzer, so the
fields specific analyzers are not used (data is not analyzed properly). Is
there a way to use field's aproppriate analyzer when copying data to _all
instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my
case, because I've got milions of documents and every one of them contains
fields in more than one language + a number of language independent fields.
That means I would end up having heavily duplicated data in every index.

Thanks in advance

Paweł Młynarczyk

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 8, 2013, 6:37pm

I don't understand. What can you do?

There are two options for me:

you know the language in advance (because you can detect it or whatever) and in that case, use _analyzer field to extract the analyzer to apply from the json document itself: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-analyzer-field.html
you don't know the language.

For that last item, how can you proceed? Random analyzer? Try more than one analyzer? No analyzer at all (yes you may want to use ngrams to try to make it work but I guess that it will generate a lot of false positive results).
More than one analyzer is multifield. You're right. Behind the scene, it's like having X fields one for each language. But instead of providing a json document like:

{
"content_fr":"mon contenu francais",
"content_en":"mon contenu francais",
"content_de":"mon contenu francais"
}

You will be able to provide
{
"content":"mon contenu francais",
}

So in term of _source storage, you won't pay the price 3 time. In term of inverted index, yes you will consume space for content.fr, content.en and content.de.

Makes sense?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

8 novembre 2013 at 19:22:55, Paweł Młynarczyk (zwarios@gmail.com) a écrit:

Thanks for your reply.

You suggest creating a multi field equivalent of the '_all' field, but isn't that a waste to analyze all the language dependant data with every analyzer? I mean if I would create that kind of custom '_all' field and put there aggregated data from all the language dependant fields, than I would end up having X '_all' fields (where X is the number of languages) right? Additionaly would I have any option to boost a particular, more important field? (In my case, every language have more than 1 field and some of them are more important)

Pawe³ M³ynarczyk

W dniu pi±tek, 8 listopada 2013 17:51:41 UTC+1 u¿ytkownik David Pilato napisa³:
I would not use _all field for that but I would probably disable it and the use multifield type on your "content" field.
Probably one sub field per language.

See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

8 novembre 2013 at 16:40:23, Pawe³ M³ynarczyk (zwa...@gmail.com) a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text search, so the first thing on my mind was to use string query with the _all field. The problem is that the _all field has it's own analyzer, so the fields specific analyzers are not used (data is not analyzed properly). Is there a way to use field's aproppriate analyzer when copying data to _all instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my case, because I've got milions of documents and every one of them contains fields in more than one language + a number of language independent fields. That means I would end up having heavily duplicated data in every index.

Thanks in advance

Pawe³ M³ynarczyk

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Pawel_Mlynarczyk · November 10, 2013, 11:15am

Yes it does make sense.
The thing is, I've got a doc that has fields:
name.en,
name.fr,
name.de,
name.pl,
description.en,
description.fr,
description.de,
description.pl,
etc

a lot of others, language independent.

Creating the 'content' multi field would aggregate data from multiple
languages in each document - that would cause the index to be huge wouldn't
it? Additionally, I'd have no option to boost a particular field (name in
this case) to be more important within the search.

I've already tried couple of solutions, but every one of them has some
drawbacks:

Using the '_all' field - in this case I can't use language specific
analyzers.
Using separate index for every language - in this case the language
independent data of each doc have to be indexed multiple times.
Using 'fields' attribute of the 'string query' to do a text query on the
right language fields. This seems a good one, but since I've got a lot of
fields to execute this query against, then the query performance drops down
as compared to query against the '_all' field.
Aggregating language dependant data to one field, and execute a string
query against that field. In this case I cannot boost a particular field.

I am looking to minimise the drawbacks. Any other ideas?

W dniu piątek, 8 listopada 2013 19:37:44 UTC+1 użytkownik David Pilato
napisał:

I don't understand. What can you do?

There are two options for me:

you know the language in advance (because you can detect it or whatever)
and in that case, use _analyzer field to extract the analyzer to apply from
the json document itself:
Elasticsearch Platform — Find real-time answers at scale | Elastic

you don't know the language.

For that last item, how can you proceed? Random analyzer? Try more than
one analyzer? No analyzer at all (yes you may want to use ngrams to try to
make it work but I guess that it will generate a lot of false positive
results).
More than one analyzer is multifield. You're right. Behind the scene, it's
like having X fields one for each language. But instead of providing a json
document like:

{
"content_fr":"mon contenu francais",
"content_en":"mon contenu francais",
"content_de":"mon contenu francais"
}

You will be able to provide
{
"content":"mon contenu francais",
}

So in term of _source storage, you won't pay the price 3 time. In term of
inverted index, yes you will consume space for content.fr, content.en and
content.de.

Makes sense?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

8 novembre 2013 at 19:22:55, Paweł Młynarczyk (zwa...@gmail.com<javascript:>)
a écrit:

Thanks for your reply.

You suggest creating a multi field equivalent of the '_all' field, but
isn't that a waste to analyze all the language dependant data with every
analyzer? I mean if I would create that kind of custom '_all' field and put
there aggregated data from all the language dependant fields, than I would
end up having X '_all' fields (where X is the number of languages) right?
Additionaly would I have any option to boost a particular, more important
field? (In my case, every language have more than 1 field and some of them
are more important)

Pawe³ M³ynarczyk

W dniu pi±tek, 8 listopada 2013 17:51:41 UTC+1 u¿ytkownik David Pilato
napisa³:
I would not use _all field for that but I would probably disable it and
the use multifield type on your "content" field.
Probably one sub field per language.

See:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch Platform — Find real-time answers at scale | Elastic

HTH
 -- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

8 novembre 2013 at 16:40:23, Pawe³ M³ynarczyk (zwa...@gmail.com) a écrit:

Hi

I've got a multilingual documents to index. I want to create a full text
search, so the first thing on my mind was to use string query with the _all
field. The problem is that the _all field has it's own analyzer, so the
fields specific analyzers are not used (data is not analyzed properly). Is
there a way to use field's aproppriate analyzer when copying data to _all
instead of just reanalyzing it with the _all's analyzer?

Creating separate index for each language is not a good solution for my
case, because I've got milions of documents and every one of them contains
fields in more than one language + a number of language independent fields.
That means I would end up having heavily duplicated data in every index.

Thanks in advance

Pawe³ M³ynarczyk

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · November 11, 2013, 9:05am

You can use the _all field with the combo analyzer

It can concatenate many tokens from many analyzers into one field. Note,
that scoring is a little bit skewed.

Do not forget to use the unique token filter to reduce multiple occurances
of tokens
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
_analyse field: which analyzer will be used on search? Elasticsearch	3	340	July 6, 2017
_all field analyzer Elasticsearch	2	322	July 6, 2017
Use multiple analyzers by field on query Elasticsearch	6	1142	October 26, 2021
How to query with multiple languages (field per language approach) Elasticsearch	1	778	July 6, 2017
Multi-word, multi-field search with analyzers Elasticsearch	1	396	July 6, 2017

Multiple analyzers in "_all" field?

Paweł Młynarczyk

Paweł Młynarczyk

Pawe³ M³ynarczyk

You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.

Pawe³ M³ynarczyk

Related topics

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.