Hi David,
I will try to better explain the steps that I follow
first I create a pipeline to extract the content of the pdfs that I load in the indexes
PUT _ingest/pipeline/pip1
{
"description":"Extract attachment information",
"processors":[{
"attachment":{"field":"data",
"indexed_chars":-1
}
},
{
"set":{
"field":"_source.indexed_at",
"value":"{{_ingest.timestamp}}"
}
}
]
}
Then I create an index with the following configuration
PUT indextext
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": [
""
]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
},
"synonym": {
"type": "synonym",
"synonyms_path": "sinonimos.txt"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym"
]
},
"rebuilt_spanish": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"data": {
"type": "text",
"analyzer": "rebuilt_spanish",
"search_analyzer": "rebuilt_spanish",
"search_quote_analyzer": "my_analyzer"
},
"attachment.content": {
"type": "text",
"analyzer": "rebuilt_spanish",
"search_analyzer": "rebuilt_spanish",
"search_quote_analyzer": "my_analyzer"
}
}
}
}
In the next step, I load the pdf document in elastic, passing the previously created pipeline as a parameter.
Once the document is uploaded, I add custom fields to refer to it. For example, if the document is a manual about samsung televisions, in the custom field "variation1" I assign the value "samsung tv fix 2015".
Well, at this point I can launch the query I wrote above to find information. Now I would like that in the query if, for example, it asks "fix samsung 2015" it would first look for the documents that in the "variation1" field contain something similar to that and then it would search the entire body of the document, but that the result would be prioritized if in the "variacion1" field it has content similar to that of the user's query