Deduplicating, Stop words removal and Lowercasing for an Appsearch document

Balaji_Sudharsanam · October 13, 2021, 8:17am

Do we have an option/fine-tuning available in Appsearch to achieve the following requirements, on a given document indexed in AppSearch.

Deduplicating words in a given document.
Stop words removal.
Storing everything as lowercase.

Currently we can observe this is not happening by default. Verifying these, using the following ways,

Search call
{{domain}}/api/as/v1/engines/{{engine}}/search
I can see the raw result having duplicate words and all the stop words are available.
GetById call
{{domain}}/api/as/v1/engines/{{engine}}/documents/?ids = someId

JasonStoltz · October 13, 2021, 1:00pm

You'd need to build something to do that before you index your data to App Search. We don't provide anything that would do that for you automatically.

Just curious, what are you trying to achieve by doing any of that?

Subhasis_Dash · October 13, 2021, 1:27pm

HI @JasonStoltz we know that there are prebuilt analyzers in Elasticsearch Elasticsearch analysis guide

Do any of these analysis happen in appsearch by default,
so for Example:
If we index a books content to appsearch, we would not want duplicated words or stopwords to be indexed.

JasonStoltz · October 13, 2021, 3:30pm

I don't know off of the top of my head the exact setup we have for analyzers, but they almost certainly handle stop words and casing. Picking the correct "Language" option for your Engine will help.

I don't think duplicated terms is something you need to worry about necessarily. The frequency at which a term is found in a particular field is a signal that can determine relevance, meaning if a term appears 200 times in the body of 1 document and only 1 time in another, it may be considered more relevant. Additionally, I'm pretty sure that since Elasticsearch uses an Inverted Index (you can google that one), that duplicated terms don't have any sort of detrimental impact to your index.

I'm not an expert though, you might have better luck inquiring about analyzers in the Elasticsearch discuss group.

JasonStoltz · October 13, 2021, 3:34pm

The reason I asked the OP what they are trying to achieve is because I don't think it's something that they need to worry about. When you index a document, a document is indexed and the raw content of that document is stored as well, but separately. Just because they see the raw document in their search response doesn't mean stop words are being searched.

system · November 10, 2021, 3:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Take account of repeat words (duplicate words) Elasticsearch	1	1371	June 20, 2019
Stopwords in analyzer doesn't seem to work Elasticsearch	3	383	June 26, 2020
Lower casing all field names in reindex Elasticsearch	2	660	June 6, 2017
Stop words and re indexing Elasticsearch	5	1679	July 6, 2017
Problem understanding phrase matching with stop words Elasticsearch	3	1281	September 21, 2017

Deduplicating, Stop words removal and Lowercasing for an Appsearch document

Related topics