I may be misunderstanding the question but removing stop words is something that happens during indexing not after. What you're looking at in Kibana is the original JSON source form, not the indexed tokens that the analysis process created. Some docs that may help: Text analysis | Elasticsearch Guide [8.11] | Elastic
Choice of tokenizers and/or analyzers is defined in the index mapping.
The docs I linked to show the REST APIs that help you test and define mappings.
You can use any choice of client (Curl. Perl, python, logstash, ruby ...) to talk to the elasticsearch Rest API.
If you create indexes every day with similar mappings you can define those mappings once in a template that automatically applies to any new indices that match the template's chosen index naming pattern
So while Indexing the pdf's I need to define the mappings with analyzers to make it compatible for further processing(calling via api after indexing for further analysis with third-party library).
In addition, I should also store the pages of PDF as I should also use it for full text search.
this can be achieved with the single mapping right?
Not necessarily. If the PDFs are insanely large then maybe you'd have to resort to separate docs but I expect a single string for the text and maybe separate strings for any structured fields Tika might give you e.g. filename, author
As I need to perform both full-text search and text transformation/Analytics. In a document I need to store both tokenized/analysed text and the content of pdf/page to retrieve while searching.
The original binary PDF file - markup/font choices and all.
The plain text string held as a field in a JSON doc
The individual words of the text stored as tokens inside a search index.
Your parsing app uses a tool like Tika to make 2) from 1). It sends 2) to elasticsearch.
Elasticsearch stores 2) and uses a choice of Analyzer from the index mapping to create and store 3).
Elasticsearch doesn't do anything with 1) - or other 1-like document formats such as Word, Excel, Powerpoint etc. It works only with 2) ie JSON
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.