Fewer documents indexed when using custom analyzer (NEST client)

Mario_Klisanic · July 11, 2021, 1:01pm

Hello everyone,

while working on a project of mine (using NEST client for .NET), I realized that the total number of documents indexed is different when using an analyzer from the number of documents indexed without the analyzer.

My code for creating an index WITHOUT an analyzer looks like this:

 var createIndexResponse = _elasticJsonClient.Indices.Create("jsonindex", c => c
                                                      .Map<CustomControl>(m => m.AutoMap()));

My code for creating and index WITH an analyzer looks like this:

  var createIndexResponse = _elasticJsonClient.Indices.Create("jsonindex", c => c
                                                        .Settings(st => st
                                                            .Setting(UpdatableIndexSettings.MaxNGramDiff, 18)
                                                            .Analysis(an => an
                                                                .Analyzers(anz => anz
                                                                    .Custom("ngram_analyzer", na => na
                                                                    .Tokenizer("ngram_tokenizer")
                                                                    .Filters("lowercase"))
                                                                    )
                                                                .Tokenizers(tz => tz
                                                                    .NGram("ngram_tokenizer", td => td
                                                                        .MinGram(4)
                                                                       .MaxGram(5)
                                                                        .TokenChars(
                                                                           TokenChar.Letter,
                                                                            TokenChar.Digit,
                                                                            TokenChar.Punctuation,
                                                                           TokenChar.Symbol
                                                                        )
                                                                    )
                                                                )
                                                            )
                                                        )
                                                         .Map<CustomControl>(m => m.AutoMap())
                                                    );

Around 100 documents less get indexed when I use an analyzer, also when I changed minGram to 2 and maxGram to 20, I got even fewer documents.

Elasticsearch version used: 7.6.2

May I say that code for indexing documents is the same for each of the index's settings, so I'm certain that the reason for the different behaviors is the addition of an analyzer.

Thank you for any suggestions.

forloop · July 12, 2021, 7:10am

Hi @Mario_Klisanic, a couple of things to check:

How are you counting the number of documents? Is counting performed after the index has been refreshed?

Is the index response successful for each indexed document?

Mario_Klisanic · July 12, 2021, 7:47am

Okay, this is a little embarrassing, looks like I just needed to refresh the index, I totally forgot about that. After refreshing I saw that the number of documents is the same for all settings, thank you for your time.

Mario_Klisanic · July 13, 2021, 6:18pm

Hey @forloop, I've just noticed something strange, maybe this was my problem in the first place when I set MaxNGramDiff to 7 for example, while my MinGram and MaxGram are 3 and 9 respectively, around 200 documents are indexed, I've refreshed my index a couple of times so now I'm sure that is the correct number, on the other hand when I set MinGram and MaxGram to 4 and 5, all of my documents (around 500) are indexed. What could be the reason for this behavior?

Edit: I'm getting 400 responses for some of my .Index requests, reason is: input automaton is too large 1001, so that is the reason why I get fewer documents indexed, since I'm indexing strings that are more of a paragraph length (for example some have 1000+ characters), is it possible to use long strings with settings that I mentioned above?

system · August 10, 2021, 6:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch index size using different analyzers Elasticsearch	16	1890	February 1, 2017
Illegal_argument_exception while creating new index Elasticsearch	2	1576	March 12, 2020
Analyzers Elasticsearch	8	466	July 6, 2017
Analyzer Settings for partial and as-is searches Elasticsearch	6	620	July 6, 2017
Partial word search does not work with Ngram Analyzer! Elasticsearch	2	1415	October 11, 2017

Fewer documents indexed when using custom analyzer (NEST client)

Related topics