Constructing custom analyser for full-text queries

Hi everyone. I'm having a little confusion with the general flow of constructing custom analyzers and mappings, during document indexing time and at query time.

The reason I'm asking all of this is that I'm trying to construct my own analyzer using n-grams/shingles as token filters along with other character filters so that I can perform full-text queries even if user's input is full of typo error or local slangs. Hence, the query would be a phrase/sentence level where the sequence might just be very important!

  1. How does the "analyzer" work? I often see tokenizers and filters being specified at the same level as the "analyzer" level, yet within the "analyzer, it already has a nested tokenizer and filter specified.

  2. Mappings come after the setting variable. I'm confused as to why I often see articles saying one can specify analyzer within the mapping. Is this necessary even after creating my custom analyzer earlier?

  3. Is it correct to say that the above custom analyzer will therefore automatically be used when I add documents to Elasticsearch? meaning the analyzer is used at index time? if not do I have to specify something when adding the documents?

  4. Does this custom_analyzer automatically apply to my search queries as well?

  5. I read in the documentation that if were to use shingles token filter, I shouldn't apply a stop_word token filter. Could someone advise me on this.

Thank you in advance!

  1. An analyzer is just a container for a tokenizer, a list of filters, optionally char filters and normalizers. The combination of those defines the uniqueness of an analyzer

  2. The order of mappings or settings when creating an index does not matter. The mapping defines how certain fields within documents should be indexed. If that is a string field, you can define how those strings should be stored in the inverted index based on the analyzer and its configuration.

  3. If you configure your custom analyzer in any field in your mapping, it is used exactly for that field. If you only configure an analyzer in your settings, but you do not apply it to any field, it will not be used.

  4. Unless configured otherwise in the mapping (or in the query), index and search analyzers will be the same.

  5. Where did you read that? let's see if we can improve the docs, if it is not clear.

--Alex

Thanks for your advice Alex. Analyzers are just like containers, and mappings are used to specify if a certain field should be analyzed. Understood! Just a few more clarifications here if I may.

So here's the problem I'm trying to solve. I have thousands of documents. When the user inputs a query, a phrase/sentence that possibly contains typo errors and colloquial language.

Do correct me if I'm wrong. ere's my thought process if I use n-gram token filter:

custom_analyzer at index time (when adding documents) with this sequence:
lowercase filter -> n-gram token filter

at query time:
match_phrase: sentence_input

Should my query be analyzed by the above analyzer?
If yes means I'm checking how many hits the document indexed n-grams matches with the query's n-gram am I right? Does this mean that I should use a boolean query or match phrase query?

Should a stopword token filter be added for at both index time and query time? This might also affect the words being queried since some may have typo and colloquial language.

Thank you and pardon my ignorance!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.