Hi, I am trying to use synsets in Elasticsearch to build a pseudo QA System.
- My mapping file looks something as below:
{
"settings": {
"analysis": {
"analyzer": {
"synset": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stemmer","synonym", "english_stop"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"synonym" : {
"type": "synonym",
"format": "wordnet",
"synonyms_path": "analysis/wn_s.pl"
},
"my_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
},
"mappings" : {
"dynamic_templates": [
{
"text_fields": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"analyzer" : "synset",
"copy_to": "unified_field"
}
}
}
],
"properties": {
"unified_field": {
"type": "text"
}
}
}
}
When I test the analyzer with a Test Text I get correct results, i.e., no stop words in it.
ex.
Attaching a part of it due to character limit.
{
"field": "230",
"text": "Persepolis was the ritual center of the ancient kingdom of Achaemenids, and the figures at Persepolis remain bound by the rules of grammar and syntax of visual language."
}
The output is :
{
"tokens": [
{
"token": "persepoli",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "wa",
"start_offset": 11,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "washington",
"start_offset": 11,
"end_offset": 14,
"type": "SYNONYM",
"position": 1
},
{
"token": "evergreen",
"start_offset": 11,
"end_offset": 14,
"type": "SYNONYM",
"position": 1
},
{
"token": "state",
"start_offset": 15,
"end_offset": 18,
"type": "SYNONYM",
"position": 2
},
{
"token": "ritual",
"start_offset": 19,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "rite",
"start_offset": 19,
"end_offset": 25,
"type": "SYNONYM",
"position": 3
},
{
"token": "ritualis",
"start_offset": 19,
"end_offset": 25,
"type": "SYNONYM",
"position": 3
},
{
"token": "ceremoni",
"start_offset": 19,
"end_offset": 25,
"type": "SYNONYM",
"position": 3
}
]
}
But when at Query time, there's a lot of junk/ stop words getting highlighted and they are also inconsistent.
My Query :
{
"from":0,
"size":10,
"query" : {
"query_string" : {
"query" : "Timur established the Timurid Empire in Iran in what year",
"fields" : ["unified_field"]
}
},
"highlight": {
"require_field_match": "false",
"order": "score",
"fields": {
"*": {}
}
},
"_source":"false"
}
The response: (Attaching a part of it due to character limit
{
"took": 2454,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 30,
"relation": "eq"
},
"max_score": 15.215088,
"hits": [
{
"_index": "nlpproject2",
"_type": "_doc",
"_id": "400",
"_score": 15.215088,
"_source": {},
"highlight": {
"0": [
"<em>Iran</em> (/aɪˈræn/ or <em>i</em>/ɪˈrɑːn/; Persian: Irān – ایران\u200e\u200e [ʔiːˈɾɒːn] ( listen)), also known as <em>Persia</em> (/ˈpɜːrʒə",
"/ or /<em>ˈpɜːrʃə</em>/), officially the <em>Islamic Republic of Iran</em> (جمهوری اسلامی ایران – Jomhuri ye Eslāmi ye"
],
"3": [
"With 78.4 million inhabitants, <em>Iran</em> is the <em>world's</em> 17th-most-populous country."
],
"5": [
"<em>Iran</em> has long <em>been</em> of geostrategic importance because of its central location in Eurasia and Western"
],
"6": [
"<em>Iran</em> is home <em>to</em> one of the world's oldest civilizations, beginning with the formation of the Proto-Elamite"
],
"8": [
"<em>Iran</em> reached the <em>pinnacle</em> of its power during the Achaemenid Empire founded by Cyrus the Great in 550"
],
"10": [
"Parthian Empire emerged from the ashes and was succeeded by the Sassanid Dynasty in 224 AD, under which <em>Iran</em>",
"again became <em>one</em> of the leading powers in the world, along with the Roman-Byzantine Empire, for a period"
],
"11": [
"In 633 AD, Rashidun Arabs invaded <em>Iran</em> and conquered <em>it</em> by 651 AD, largely converting Iranian people"
],
"13": [
"<em>Iran</em> became a <em>major</em> contributor to the Islamic Golden <em>Age</em>, producing many influential scientists, scholars"
],
"14": [
"people from Sunni Islam to Twelver Shia Islam, and made Twelver Shia Islam the official religion of <em>Iran</em>"
],
"15": [
"Safavid conversion of <em>Iran</em> from Sunnism <em>to</em> Shiism marked one of the most important turning points in"
],
"16": [
", briefly possessing <em>what</em> was arguably the most powerful empire at the time.",
"Starting in 1736 under Nader Shah, <em>Iran</em> <em>reached</em> its <em>greatest</em> territorial extent since the Sassanid Empire"
],
"17": [
"of the concept of <em>Iran</em> for centuries, <em>to</em> neighboring Imperial Russia.",
"During the 19th century, <em>Iran</em> irrevocably lost <em>swaths</em> of its territories in the Caucasus which made part"
],
"19": [
"Following a coup d'état instigated by the U.K. and the U.S. in 1953, <em>Iran</em> gradually became <em>close</em> allies"
],
"21": [
"Tehran is <em>the</em> country's capital and largest city, as well as its leading cultural and economic center"
],
"22": [
"<em>Iran</em> is a <em>major</em> regional and middle power, exerting considerable influence in international energy security"
],
"24": [
"The term <em>Iran</em> derives directly <em>from</em> Middle Persian Ērān, first attested in a 3rd-century inscription"
],
"27": [
"Historically, <em>Iran</em> has been <em>referred</em> to as <em>Persia</em> by the <em>West</em>, due mainly to the writings of Greek historians",
"who called <em>Iran</em> Persis (Greek: <em>Περσίς</em>), meaning \"land of the Persians.\""
],
"31": [
"<em>Iran</em>.",
"In 1935, Reza Shah requested <em>the</em> international community to refer to the country by its native name,"
],
"32": [
"the Persian New <em>Year</em>, Nowruz, March 21, 1935, substituted <em>Iran</em> for <em>Persia</em> <em>as</em> the <em>official</em> name of the",
"explained at the time, \"At the suggestion of the Persian Legation in Berlin, the Tehran government, <em>on</em>"
],
"33": [
"and <em>Iran</em> <em>interchangeably</em>.",
"the decision, and Professor Ehsan Yarshater, editor of Encyclopædia Iranica, propagated a move to use <em>Persia</em>"
],
"34": [
"Today, both <em>Persia</em> and <em>Iran</em> <em>are</em> used <em>in</em> cultural contexts; although, <em>Iran</em> is the <em>name</em> used officially"
],
"35": [
", attest to a human presence in <em>Iran</em> since the <em>Lower</em> Paleolithic era, c. 800,000–200,000 BC.",
"The earliest archaeological artifacts in <em>Iran</em>, like those <em>excavated</em> at the Kashafrud and Ganj Par sites"
],
"37": [
", as well <em>as</em> Susa and Chogha Mish developing in and around the Zagros region.",
"millennium BC, early agricultural communities such as Chogha Golan and Chogha Bonut began to flourish in <em>Iran</em>"
],
"40": [
"During the Bronze <em>Age</em>, <em>Iran</em> was home <em>to</em> several civilizations including Elam, Jiroft, and Zayande River"
],
"41": [
"<em>in</em> Mesopotamia.",
"Elam, the most prominent of these civilizations, developed in the southwest of <em>Iran</em>, alongside those"
],
"46": [
"and the <em>eastern</em> Anatolia.",
"single ruler in 728 BC led to the foundation of the Median Empire which, by 612 BC, controlled the whole <em>Iran</em>"
],
"49": [
"The conquest of Media was a result of <em>what</em> is called the Persian Revolt."
],
"52": [
"At its greatest extent, the Achaemenid Empire included the modern territories of <em>Iran</em>, Azerbaijan, Armenia",
", <em>Georgia</em>, Turkey, much of the Black Sea coastal regions, northeastern Greece and southern Bulgaria ("
],
"57": [
"<em>Furthermore</em>, one of the Seven Wonders of the Ancient World, the Mausoleum at Halicarnassus, was built"
],
"59": [
"Following the premature death of Alexander, <em>Iran</em> came under <em>the</em> control of the Hellenistic Seleucid Empire"
],
"65": [
"The prolonged and gradual process of the Islamization of <em>Iran</em> began following <em>the</em> conquest."
]
My questions :
- Why are stop words like "the", "in" etc are getting highlighted? Is it not the case that, by default search and index analyzers the same?
- Also, since stop words are not stored due to the indexing strategies(Custom Analyzers), Why are they showing up in Highlights?
- If you closely look in to the highlighting results, stop words are highlighted in certain cases and are not in certain cases.
for example :
"65": [
"The prolonged and gradual process of the Islamization of <em>Iran</em> began following <em>the</em> conquest."
],
Here, two occurrences of "the" are not highlighted where as one of it is highlighted.
Why is there a such inconsistency?
Hoping for a quick response.
Thanks in advance.