Using differents analysers based on the document language

Lucas_Corte_Real_Sal · May 2, 2013, 1:41pm

Hi everyone!

Sorry if this is a dumb question, but I can't find anything about this at
the documentation or foruns on the web.

In my office we are using ES to index crawled news from the web. We crawl
news in three languages: english, portuguese and spanish. Each news is
indexed in the same index (don't distinguished by language in the indexing
process).

We have a service that queries this index and retrieves some documents to
the client. Here is my problem: I would like to choose which analyser to be
used based on the document language, mainly because of the stopwords
(intrinsic to the idiom).

Here is a simple example. Imagine that my query is "Rio de Janeiro". So I
have 3 tokens ("Rio", "de", "Janeiro"). For english and spanish there
aren't stopwords in this query, but in portuguese the token "de" is a
stopword. So, in the query, while evaluating documents, if the document
language is portuguese, I want that ES considers the token "de" as a
stopword, while if the document language is english or spanish don't.

Doing some research, I figured out that I have 3 choices:

1- Index my documents separately by idiom, specifying for each one of the
index which analyser that must be used. (HARD because the software is
already in production);

2- Build an adapter between my service and the ES API. This adapter will
split the original query in three queries, one for each language. Each one
of these queries will be used to search for documents of it's respective
language using the respective analyser;

3- Find some magic functionality that allows me to choose the correct
analyser that should be used for each document, based on the document
language (I don't think that this option exists).

So, I am looking for guidance to which way I should chose. If anyone have
another option please let me know.

Best regards,
Lucas Saldanha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Caio_D_Angelo · May 2, 2013, 8:11pm

We dealed with something like that here. In short:

Your first option was our final one. Although, our system wasn't production
yet at that time.

Your second option is a little complicated when it comes to pagination. We
considered it, but not took it.

Your third would be great, but we can't find the trick in time to deploy
our solution.

It turns us out later, that you can specify different analyzers on
indexing, but we didn't spend much time trying this.

Maybe you can go by this way, indexing with different analyzers and see
what happens on searching. I surely would give it a try.

Em quinta-feira, 2 de maio de 2013 10h41min34s UTC-3, Lucas Corte Real
Saldanha escreveu:

Hi everyone!

Sorry if this is a dumb question, but I can't find anything about this at
the documentation or foruns on the web.

In my office we are using ES to index crawled news from the web. We crawl
news in three languages: english, portuguese and spanish. Each news is
indexed in the same index (don't distinguished by language in the indexing
process).

We have a service that queries this index and retrieves some documents to
the client. Here is my problem: I would like to choose which analyser to be
used based on the document language, mainly because of the stopwords
(intrinsic to the idiom).

Here is a simple example. Imagine that my query is "Rio de Janeiro". So I
have 3 tokens ("Rio", "de", "Janeiro"). For english and spanish there
aren't stopwords in this query, but in portuguese the token "de" is a
stopword. So, in the query, while evaluating documents, if the document
language is portuguese, I want that ES considers the token "de" as a
stopword, while if the document language is english or spanish don't.

Doing some research, I figured out that I have 3 choices:

1- Index my documents separately by idiom, specifying for each one of the
index which analyser that must be used. (HARD because the software is
already in production);

2- Build an adapter between my service and the ES API. This adapter will
split the original query in three queries, one for each language. Each one
of these queries will be used to search for documents of it's respective
language using the respective analyser;

3- Find some magic functionality that allows me to choose the correct
analyser that should be used for each document, based on the document
language (I don't think that this option exists).

So, I am looking for guidance to which way I should chose. If anyone have
another option please let me know.

Best regards,
Lucas Saldanha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Language analysers Behaviour in ES Elasticsearch	3	662	July 5, 2017
How to use other language stop filter in _analyze? Elasticsearch	3	241	December 22, 2022
How do I use "lang" analyzers? Actually, should I use them? Elasticsearch	4	350	July 6, 2017
Elasticsearch Foreign Language Stop-words Elasticsearch	2	490	July 6, 2017
Language and HTML analyzer Elasticsearch	4	600	July 5, 2017

Using differents analysers based on the document language

Related topics