Fewer documents indexed when using custom analyzer (NEST client)

Hello everyone,

while working on a project of mine (using NEST client for .NET), I realized that the total number of documents indexed is different when using an analyzer from the number of documents indexed without the analyzer.

My code for creating an index WITHOUT an analyzer looks like this:

 var createIndexResponse = _elasticJsonClient.Indices.Create("jsonindex", c => c
                                                      .Map<CustomControl>(m => m.AutoMap()));

My code for creating and index WITH an analyzer looks like this:

  var createIndexResponse = _elasticJsonClient.Indices.Create("jsonindex", c => c
                                                        .Settings(st => st
                                                            .Setting(UpdatableIndexSettings.MaxNGramDiff, 18)
                                                            .Analysis(an => an
                                                                .Analyzers(anz => anz
                                                                    .Custom("ngram_analyzer", na => na
                                                                    .Tokenizer("ngram_tokenizer")
                                                                    .Filters("lowercase"))
                                                                    )
                                                                .Tokenizers(tz => tz
                                                                    .NGram("ngram_tokenizer", td => td
                                                                        .MinGram(4)
                                                                       .MaxGram(5)
                                                                        .TokenChars(
                                                                           TokenChar.Letter,
                                                                            TokenChar.Digit,
                                                                            TokenChar.Punctuation,
                                                                           TokenChar.Symbol
                                                                        )
                                                                    )
                                                                )
                                                            )
                                                        )
                                                         .Map<CustomControl>(m => m.AutoMap())
                                                    );

Around 100 documents less get indexed when I use an analyzer, also when I changed minGram to 2 and maxGram to 20, I got even fewer documents.

Elasticsearch version used: 7.6.2

May I say that code for indexing documents is the same for each of the index's settings, so I'm certain that the reason for the different behaviors is the addition of an analyzer.

Thank you for any suggestions.

Hi @Mario_Klisanic, a couple of things to check:

  1. How are you counting the number of documents? Is counting performed after the index has been refreshed?
  1. Is the index response successful for each indexed document?

Okay, this is a little embarrassing, looks like I just needed to refresh the index, I totally forgot about that. After refreshing I saw that the number of documents is the same for all settings, thank you for your time.

Hey @forloop, I've just noticed something strange, maybe this was my problem in the first place when I set MaxNGramDiff to 7 for example, while my MinGram and MaxGram are 3 and 9 respectively, around 200 documents are indexed, I've refreshed my index a couple of times so now I'm sure that is the correct number, on the other hand when I set MinGram and MaxGram to 4 and 5, all of my documents (around 500) are indexed. What could be the reason for this behavior?

Edit: I'm getting 400 responses for some of my .Index requests, reason is: input automaton is too large 1001, so that is the reason why I get fewer documents indexed, since I'm indexing strings that are more of a paragraph length (for example some have 1000+ characters), is it possible to use long strings with settings that I mentioned above?