Analyzers at Index time and search time are not matching

Hi, I am trying to use synsets in Elasticsearch to build a pseudo QA System.

  1. My mapping file looks something as below:
 
{
  "settings": {
    "analysis": {
      "analyzer": {
        "synset": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stemmer","synonym", "english_stop"
          ]
        }
      },
      "filter": {
      	"english_stop": {
          			"type":       "stop",
          			"stopwords":  "_english_" 
        		},
      	"synonym" : {
                    "type": "synonym",
                        "format": "wordnet",
                        "synonyms_path": "analysis/wn_s.pl"
                },
        "my_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      }
    }
  },
  "mappings" : {
  	"dynamic_templates": [
	      {
	        "text_fields": {
	          "match_mapping_type": "string",
	          "mapping": {
	            "type": "text",
	            "analyzer" : "synset",
	            "copy_to": "unified_field"
	          }
	        }
	      }
	    ],
	    
	    	"properties": {
	    		
	    		"unified_field": {
	    			"type": "text"
	    		}
	    	}
	    	
	    
  	
    }
}

When I test the analyzer with a Test Text I get correct results, i.e., no stop words in it.
ex.
Attaching a part of it due to character limit.

{
  "field": "230",
  "text":     "Persepolis was the ritual center of the ancient kingdom of Achaemenids, and the figures at Persepolis remain bound by the rules of grammar and syntax of visual language."
}

The output is :

{
    "tokens": [
        {
            "token": "persepoli",
            "start_offset": 0,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "wa",
            "start_offset": 11,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "washington",
            "start_offset": 11,
            "end_offset": 14,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "evergreen",
            "start_offset": 11,
            "end_offset": 14,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "state",
            "start_offset": 15,
            "end_offset": 18,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "ritual",
            "start_offset": 19,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "rite",
            "start_offset": 19,
            "end_offset": 25,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "ritualis",
            "start_offset": 19,
            "end_offset": 25,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "ceremoni",
            "start_offset": 19,
            "end_offset": 25,
            "type": "SYNONYM",
            "position": 3
        }
     
    ]
}

But when at Query time, there's a lot of junk/ stop words getting highlighted and they are also inconsistent.

My Query :

{
        "from":0,
        "size":10,
        "query" : {
    "query_string" : {
      "query" : "Timur established the Timurid Empire in Iran in what year",
      "fields"  : ["unified_field"]
    }
      },
      "highlight": {
        "require_field_match": "false",
        "order": "score",
          
        "fields": {
          "*": {}
        }
      },
        "_source":"false"
        }
    

The response: (Attaching a part of it due to character limit

{
    "took": 2454,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 30,
            "relation": "eq"
        },
        "max_score": 15.215088,
        "hits": [
            {
                "_index": "nlpproject2",
                "_type": "_doc",
                "_id": "400",
                "_score": 15.215088,
                "_source": {},
                "highlight": {
                    "0": [
                        "<em>Iran</em> (/aɪˈræn/ or <em>i</em>/ɪˈrɑːn/; Persian: Irān – ایران\u200e\u200e [ʔiːˈɾɒːn] ( listen)), also known as <em>Persia</em> (/ˈpɜːrʒə",
                        "/ or /<em>ˈpɜːrʃə</em>/), officially the <em>Islamic Republic of Iran</em> (جمهوری اسلامی ایران – Jomhuri ye Eslāmi ye"
                    ],
                    "3": [
                        "With 78.4 million inhabitants, <em>Iran</em> is the <em>world's</em> 17th-most-populous country."
                    ],
                    "5": [
                        "<em>Iran</em> has long <em>been</em> of geostrategic importance because of its central location in Eurasia and Western"
                    ],
                    "6": [
                        "<em>Iran</em> is home <em>to</em> one of the world's oldest civilizations, beginning with the formation of the Proto-Elamite"
                    ],
                    "8": [
                        "<em>Iran</em> reached the <em>pinnacle</em> of its power during the Achaemenid Empire founded by Cyrus the Great in 550"
                    ],
                    "10": [
                        "Parthian Empire emerged from the ashes and was succeeded by the Sassanid Dynasty in 224 AD, under which <em>Iran</em>",
                        "again became <em>one</em> of the leading powers in the world, along with the Roman-Byzantine Empire, for a period"
                    ],
                    "11": [
                        "In 633 AD, Rashidun Arabs invaded <em>Iran</em> and conquered <em>it</em> by 651 AD, largely converting Iranian people"
                    ],
                    "13": [
                        "<em>Iran</em> became a <em>major</em> contributor to the Islamic Golden <em>Age</em>, producing many influential scientists, scholars"
                    ],
                    "14": [
                        "people from Sunni Islam to Twelver Shia Islam, and made Twelver Shia Islam the official religion of <em>Iran</em>"
                    ],
                    "15": [
                        "Safavid conversion of <em>Iran</em> from Sunnism <em>to</em> Shiism marked one of the most important turning points in"
                    ],
                    "16": [
                        ", briefly possessing <em>what</em> was arguably the most powerful empire at the time.",
                        "Starting in 1736 under Nader Shah, <em>Iran</em> <em>reached</em> its <em>greatest</em> territorial extent since the Sassanid Empire"
                    ],
                    "17": [
                        "of the concept of <em>Iran</em> for centuries, <em>to</em> neighboring Imperial Russia.",
                        "During the 19th century, <em>Iran</em> irrevocably lost <em>swaths</em> of its territories in the Caucasus which made part"
                    ],
                    "19": [
                        "Following a coup d'état instigated by the U.K. and the U.S. in 1953, <em>Iran</em> gradually became <em>close</em> allies"
                    ],
                    "21": [
                        "Tehran is <em>the</em> country's capital and largest city, as well as its leading cultural and economic center"
                    ],
                    "22": [
                        "<em>Iran</em> is a <em>major</em> regional and middle power, exerting considerable influence in international energy security"
                    ],
                    "24": [
                        "The term <em>Iran</em> derives directly <em>from</em> Middle Persian Ērān, first attested in a 3rd-century inscription"
                    ],
                    "27": [
                        "Historically, <em>Iran</em> has been <em>referred</em> to as <em>Persia</em> by the <em>West</em>, due mainly to the writings of Greek historians",
                        "who called <em>Iran</em> Persis (Greek: <em>Περσίς</em>), meaning \"land of the Persians.\""
                    ],
                    "31": [
                        "<em>Iran</em>.",
                        "In 1935, Reza Shah requested <em>the</em> international community to refer to the country by its native name,"
                    ],
                    "32": [
                        "the Persian New <em>Year</em>, Nowruz, March 21, 1935, substituted <em>Iran</em> for <em>Persia</em> <em>as</em> the <em>official</em> name of the",
                        "explained at the time, \"At the suggestion of the Persian Legation in Berlin, the Tehran government, <em>on</em>"
                    ],
                    "33": [
                        "and <em>Iran</em> <em>interchangeably</em>.",
                        "the decision, and Professor Ehsan Yarshater, editor of Encyclopædia Iranica, propagated a move to use <em>Persia</em>"
                    ],
                    "34": [
                        "Today, both <em>Persia</em> and <em>Iran</em> <em>are</em> used <em>in</em> cultural contexts; although, <em>Iran</em> is the <em>name</em> used officially"
                    ],
                    "35": [
                        ", attest to a human presence in <em>Iran</em> since the <em>Lower</em> Paleolithic era, c. 800,000–200,000 BC.",
                        "The earliest archaeological artifacts in <em>Iran</em>, like those <em>excavated</em> at the Kashafrud and Ganj Par sites"
                    ],
                    "37": [
                        ", as well <em>as</em> Susa and Chogha Mish developing in and around the Zagros region.",
                        "millennium BC, early agricultural communities such as Chogha Golan and Chogha Bonut began to flourish in <em>Iran</em>"
                    ],
                    "40": [
                        "During the Bronze <em>Age</em>, <em>Iran</em> was home <em>to</em> several civilizations including Elam, Jiroft, and Zayande River"
                    ],
                    "41": [
                        "<em>in</em> Mesopotamia.",
                        "Elam, the most prominent of these civilizations, developed in the southwest of <em>Iran</em>, alongside those"
                    ],
                    "46": [
                        "and the <em>eastern</em> Anatolia.",
                        "single ruler in 728 BC led to the foundation of the Median Empire which, by 612 BC, controlled the whole <em>Iran</em>"
                    ],
                    "49": [
                        "The conquest of Media was a result of <em>what</em> is called the Persian Revolt."
                    ],
                    "52": [
                        "At its greatest extent, the Achaemenid Empire included the modern territories of <em>Iran</em>, Azerbaijan, Armenia",
                        ", <em>Georgia</em>, Turkey, much of the Black Sea coastal regions, northeastern Greece and southern Bulgaria ("
                    ],
                    "57": [
                        "<em>Furthermore</em>, one of the Seven Wonders of the Ancient World, the Mausoleum at Halicarnassus, was built"
                    ],
                    "59": [
                        "Following the premature death of Alexander, <em>Iran</em> came under <em>the</em> control of the Hellenistic Seleucid Empire"
                    ],
              
                    "65": [
                        "The prolonged and gradual process of the Islamization of <em>Iran</em> began following <em>the</em> conquest."
                    ]
           
        

My questions :

  1. Why are stop words like "the", "in" etc are getting highlighted? Is it not the case that, by default search and index analyzers the same?
  2. Also, since stop words are not stored due to the indexing strategies(Custom Analyzers), Why are they showing up in Highlights?
  3. If you closely look in to the highlighting results, stop words are highlighted in certain cases and are not in certain cases.
    for example :
"65": [
                        "The prolonged and gradual process of the Islamization of <em>Iran</em> began following <em>the</em> conquest."
                    ],

Here, two occurrences of "the" are not highlighted where as one of it is highlighted.
Why is there a such inconsistency?

Hoping for a quick response.
Thanks in advance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.