Searching for non English text

Designed and implemented full text search, proximity search, match, match phrase, MLT, wildcard for english language.

Now i am trying to identify what all the features will be missed when we have other languages like french, Italian and dutch languages ?

Need to understand using English specific index to store non-English characters in search and what features will not work ?

Any answers ?

The most common things that are language specific are:

  • Stemming: Typically a user expects a search for "dog" to also find documents that have "dogs" in the text. A user often expects a search for "run" to return documents with "running" in the text. As a result of these types of requirements, you typically want to store the root form of the words in the index and use those to search against as well. The rules for things like pluralization vary by language, so saying "remove 's' from every word and store that" doesn't make a lot of sense in many languages. You'll want to use language-appropriate stemming/analyzers depending on the language for optimal retrieval.
  • Lemmatization: Going a step beyond most stemming algorithms, which tend to lop off words, some languages/words tend to follow different rules entirely for things like plurals. "Mice" is the plural of "mouse" in English and there's no algorithm that gets you between the two. For these, you can build/add on dictionary-based approaches
  • Folding: In some languages, partiularly where there are a lot of borrowed words, the diacritics may be used intermittently. English is an example, and a concrete example is that sometimes a document may have "café" and sometimes a document may have "cafe" indexed. It's common in some languages to use folding to normalize the words in the index and at search time in case somebody leaves out the diacritic in one of the two places.
  • Normalizing other characters: Some languages/character sets have varied ways of writing things like numbers. For example, in Japanese, the numbers 1, 2, and 3, are sometimes written , , and but sometimes the Arabic (1, 2, 3) characters are used. There are token filters for this.
  • Collation: In English, the alphabet is a, b, c, ... z so if you provide your users the ability to sort alphabetically, this is what they'd expect for English documents. The Greek alphabet is very different though (α, β, γ, ..., ω), but even languages that use the Latin character set can have slight variants in terms of expected sort order. And there are languages like German that introduce non-Latin characters (German example: ß) to a mostly-Latin alphabet. For this, if you introduce alphabetic-style sorts in your application, you'll want to look at ICU collation
  • Stopwords: Stopwords aren't always helpful (they can be a net negative to introduce) but if you want them, they're language dependent. For example, the term "con" is very common in Spanish and is often a stopword there, but it's much rarer and means something very important in English, so you probably wouldn't want "con" to be a stopword in most indices that have English text. This also falls part of our analyzers, though you can specify your own, as stopwords tend to not just be language dependent but data dependent.
  • Decompounding: Some languages tend to combine lots of concepts together into a string of characters that doesn't have any spaces. For those languages, there's decompounding
  • Word segmentation: Some languages (particularly some Asian languages) don't have spaces between words and you ideally want to determine where 1 concept stops and another concept starts at index and search time. You can just split into 1, 2, or 3 character strings and index them all or you can potentially do something a bit more intelligent like use a probabilistic approach. If you have Chinese, Korean, or Japanese text, for example, you'll want to look at this type of thing.
  • Synonyms: Synonyms tend to be not just language-specific but data-specific as well. Elasticsearch supports synonyms, and you may want to create separate synonyms per language based on your use case.
  • Word frequencies: Related to the stopwords, the frequency of words varies by language. Again, "con" is less common in English than in Spanish. Elasticsearch tries to determine the relevancy of words by how common they are in the corpus and in your query. If it sees "con" in a huge portion of documents, it will assume "con" tends to have less importance, so the con part is likely to have less to do with the relevance score when you search for con man than man if you have a big jumble of English and Spanish words in an index, for example.

Because different languages have different rules, different term frequencies, and different types of queries that users ask, I'd strongly encourage you to separate out documents into separate language-specific indices if language-specific search is important to you. A common way of doing so is setting up something like mydocs-en for English documents, etc. If a user really wants to search across multiple indices, you can do so by searching mydocs-*

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.