Deduplicating, Stop words removal and Lowercasing for an Appsearch document

Do we have an option/fine-tuning available in Appsearch to achieve the following requirements, on a given document indexed in AppSearch.

  1. Deduplicating words in a given document.
  2. Stop words removal.
  3. Storing everything as lowercase.

Currently we can observe this is not happening by default. Verifying these, using the following ways,

  1. Search call
    {{domain}}/api/as/v1/engines/{{engine}}/search
    I can see the raw result having duplicate words and all the stop words are available.

  2. GetById call
    {{domain}}/api/as/v1/engines/{{engine}}/documents/?ids = someId

1 Like

You'd need to build something to do that before you index your data to App Search. We don't provide anything that would do that for you automatically.

Just curious, what are you trying to achieve by doing any of that?

HI @JasonStoltz we know that there are prebuilt analyzers in Elasticsearch Elasticsearch analysis guide

Do any of these analysis happen in appsearch by default,
so for Example:
If we index a books content to appsearch, we would not want duplicated words or stopwords to be indexed.

I don't know off of the top of my head the exact setup we have for analyzers, but they almost certainly handle stop words and casing. Picking the correct "Language" option for your Engine will help.

I don't think duplicated terms is something you need to worry about necessarily. The frequency at which a term is found in a particular field is a signal that can determine relevance, meaning if a term appears 200 times in the body of 1 document and only 1 time in another, it may be considered more relevant. Additionally, I'm pretty sure that since Elasticsearch uses an Inverted Index (you can google that one), that duplicated terms don't have any sort of detrimental impact to your index.

I'm not an expert though, you might have better luck inquiring about analyzers in the Elasticsearch discuss group.

The reason I asked the OP what they are trying to achieve is because I don't think it's something that they need to worry about. When you index a document, a document is indexed and the raw content of that document is stored as well, but separately. Just because they see the raw document in their search response doesn't mean stop words are being searched.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.