How to create a custom analyzer to ignore accents and pt-br stopwords using elasticsearch nest api?

Hi community!

First of all, consider that I am using a "News" Class (Noticia, in portuguese) that has a string field called "Content" (Conteudo in portuguese)

public class Noticia
{
	public string Conteudo { get; set; } 
}

I am trying to create an index that is configured to ignore accents and pt-br stopwords as well as to allow up to 40mi chars to be analysed in a highligthed query.

I can create such an index using this code:

var createIndexResponse = client.Indices.Create(indexName, c => c
    .Settings(s => s
        .Setting("highlight.max_analyzed_offset" , 40000000)
        .Analysis(analysis => analysis
            .TokenFilters(tokenfilters => tokenfilters
                .AsciiFolding("folding-accent", ft => ft
                )
                .Stop("stoping-br", st => st
                    .StopWords("_brazilian_")
                )
            )
            .Analyzers(analyzers => analyzers
                .Custom("folding-analyzer", cc => cc
                    .Tokenizer("standard")
                    .Filters("folding-accent", "stoping-br")
                )
            )
        )
    )
    .Map<Noticia>(mm => mm
        .AutoMap()
        .Properties(p => p
            .Text(t => t
                .Name(n => n.Conteudo)
                .Analyzer("folding-analyzer")
            )
        )
    )
);

If I test this analyzer using Kibana Dev Tools, I get the result that I want: No accents and stopwords removed!

POST intranet/_analyze
{
  "analyzer": "folding-analyzer",
  "text": "Férias de todos os funcionários"
}

Result:

{
  "tokens" : [
    {
      "token" : "Ferias",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "funcionarios",
      "start_offset" : 19,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

The same (good) results are being returned when I use NEST to analyze a query using my folding analyser (Tokens "Ferias" e "funcionarios" are returned)

var analyzeResponse = client.Indices.Analyze(a => a
.Index(indexName)
.Analyzer("folding-analyzer")
.Text("Férias de todos os funcionários")
);

However, If I perform a search using NEST ElasticSearch .NET client, terms like "FĂ©rias" (with accent) and "Ferias" (without accent) are beign treated as different.

My goal is to perform a query that returns all results, no matter if the word is FĂ©rias or Ferias

Thats the simplified code (C# nest) I am using to query elasticsearch:

var searchResponse = ElasticClient.Search<Noticia>(s => s
    .Index(indexName)
    .Query(q => q
    .MultiMatch(m => m
                .Fields(f => f
                    .Field(p => p.Titulo,4)
                    .Field(p => p.Conteudo,2)
                )
                .Query(termo)
            )
    )
);

and that's the extended API call associated with the searchResponse

Successful (200) low level call on POST: /intranet/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: ###NODE ADDRESS### Took: 00:00:00.3880295
# Request:
{"query":{"multi_match":{"fields":["categoria^1","titulo^4","ementa^3","conteudo^2","attachments.attachment.content^1"],"query":"Ferias"}},"size":100}
# Response:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 13.788051,
    "hits" : [
      {
        "_index" : "intranet",
        "_type" : "_doc",
        "_id" : "4934",
        "_score" : 13.788051,
        "_source" : {
          "conteudo" : "blablabla ferias blablabla",
          "attachments" : [ ],
          "categoria" : "Novidades da Biblioteca - DBD",
          "publicadaEm" : "2008-10-14T00:00:00",
          "titulo" : "INFORMATIVO DE DIREITO ADMINISTRATIVO E LRF - JUL/2008",
          "ementa" : "blablabla",
          "matriculaAutor" : 900794,
          "atualizadaEm" : "2009-02-03T13:44:00",
          "id" : 4934,
          "indexacaoAtiva" : true,
          "status" : "DisponĂ­vel"
        }
      }
    ]
  }
}

I have also tryed to use Multi Fields and Suffix in a query, without success

.Map<Noticia>(mm => mm
	.AutoMap()
	.Properties(p => p
		.Text(t => t
		.Name(n => n.Conteudo)
		.Analyzer("folding-analyzer")
        .Fields(f => f
			.Text(ss => ss
				.Name("folding")
                .Analyzer("folding-analyzer")
                )
        )
		
(...)

var searchResponse = ElasticClient.Search<Noticia>(s => s
	.Index(indexName)   
	.Query(q => q
    .MultiMatch(m => m
        .Fields(f => f
        .Field(p => p.Titulo,4)
        .Field(p => p.Conteudo.Suffix("folding"),2)       
				)
                .Query(termo)
            )
    )
);

Any clue what I am doing wrong or what I can do to reach my goal?

Thanks a lot in advance!

Please share the output of retrieving the index mappings from that index to ensure that the analyzer is properly configured for the field that you are trying to query.

Also, you can use the analyze API and specify a field instead of an analyzer, this way you can also check if that analyzer is used properly.

THANKS!

After a few days I found out what I was doing wrong and it was all about the mapping.

Here are the steps I took to approach the problem and solve it in the end

1 - first of all I`ve opened kibana console and found out that only the last field of my mapped fields was being assigned to my custom analyser (folding-analyser)

To test each one of your fields you can use the GET FIELD MAPPING API and a command in dev tools like this:

GET /<index>/_mapping/field/<field>

then you'll be able to see if your analyser is being assigned to your field or not

2 - After that, I discovered that the last field was the only one being assigned to my custom analyser and the reason was because I was messing up with fluent mapping in two ways:

  • First of all, I had to chain my text properties correctly
  • Second of all, I was trying to map another POCO class in another Map<> clause, when I was supposed to use the Object<> clause

the correct mapping that worked for me was a bit like this:

.Map<Noticia>(mm => mm
        .AutoMap()
        .Properties(p => p
            .Text(t => t
                .Name(n => n.Field1)
                .Analyzer("folding-analyzer")
            )
            .Text(t => t
                .Name(n => n.Field2)
                .Analyzer("folding-analyzer")
            )
            .Object<NoticiaArquivo>(o => o
                .Name(n => n.Arquivos)
                .Properties(eps => eps
                    .Text(s => s
                        .Name(e => e.NAField1)
                        .Analyzer("folding-analyzer")
                    )
                    .Text(s => s
                        .Name(e => e.NAField2)
                        .Analyzer("folding-analyzer")
                    )
                )
            )
        )
    )

Finally, It's important to share that when you assign an analyser using the .Analyzer("analiserName") clause, you're telling elastic search that you want to use the argument analyser both for indexing and search

If you want to use an analyser only when you search and not on indexing time, you should use the .SearchAnalyzer("analiserName") clause.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.