Search first in one field and if not return result, search in all document

castano_guill · September 13, 2021, 10:01am

Hello, I´m new in the community.

Currently I ingest PDF documents to an index in elastic, in those documents I add custom fields
of various information (state, short way of referring to it, etc, ..)
I want to change my query to first prioritize some specific fields and if no information is found in those fields, search the entire document.

This would be possible?

My current query has this format to search in the entire PDF document

 {
    "query": {
      "bool": {
        "must": [
        {
      "query_string": {
        "query": "text to find"
      }
    }
    ],
    "filter": [
    {
    "match": {
    "state": "validada"
    }
    }
    ]
      }
    },
    "size": 5,
    "highlight": {
      "pre_tags": ["<em><b>"],
      "post_tags": ["</b></em>"],
      "fields": {
        "attachment.content": {
          "fragment_size": 500,
          "number_of_fragments": 5
        }
      },
      "order": "score",
      "fragmenter": "span"
    }
  }

Could you help me?

Thanks!

dadoonet · September 13, 2021, 10:47am

Welcome!

It's not clear to me about the exact content of your documents so far and what you'd like to search for.

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case. It will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

castano_guill · September 13, 2021, 11:11am

Hi David,

I will try to better explain the steps that I follow

first I create a pipeline to extract the content of the pdfs that I load in the indexes

PUT _ingest/pipeline/pip1
{
  "description":"Extract attachment information",
  "processors":[{
    "attachment":{"field":"data",
    "indexed_chars":-1
      
    }
    
  },
  {
    "set":{
      "field":"_source.indexed_at",
      "value":"{{_ingest.timestamp}}"
      
    }
    
  }
  ]
  
}

Then I create an index with the following configuration

PUT indextext
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type": "stop",
          "stopwords": "_spanish_"
        },
        "spanish_keywords": {
          "type": "keyword_marker",
          "keywords": [
            ""
          ]
        },
        "spanish_stemmer": {
          "type": "stemmer",
          "language": "spanish"
        },
        "synonym": {
          "type": "synonym",
          "synonyms_path": "sinonimos.txt"
        
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "synonym"
          ]
        },
        "rebuilt_spanish": {
          "tokenizer": "standard",
          "filter": [
            "synonym",
            "lowercase",
            "spanish_stop",
            "spanish_keywords",
            "spanish_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "data": {
        "type": "text",
        "analyzer": "rebuilt_spanish",
        "search_analyzer": "rebuilt_spanish",
        "search_quote_analyzer": "my_analyzer"
      },
      "attachment.content": {
        "type": "text",
        "analyzer": "rebuilt_spanish",
        "search_analyzer": "rebuilt_spanish",
        "search_quote_analyzer": "my_analyzer"
      }
    }
  }
}

In the next step, I load the pdf document in elastic, passing the previously created pipeline as a parameter.

Once the document is uploaded, I add custom fields to refer to it. For example, if the document is a manual about samsung televisions, in the custom field "variation1" I assign the value "samsung tv fix 2015".

Well, at this point I can launch the query I wrote above to find information. Now I would like that in the query if, for example, it asks "fix samsung 2015" it would first look for the documents that in the "variation1" field contain something similar to that and then it would search the entire body of the document, but that the result would be prioritized if in the "variacion1" field it has content similar to that of the user's query

dadoonet · September 13, 2021, 6:02pm

If you could also provide a full document as it has been generated in elasticsearch, that'd help.

castano_guill · September 14, 2021, 7:38am

This is an example of document index in collection. i want search first if the user query is in "variacion0" field and if not return result, search in content of document.

	
_index	"maquinav_novena_documents"
_type	"_doc"
_id	"1"
_version	6
_seq_no	89
_primary_term	27
found	true
_source	
data	"........"
attachment	
date	"2021-03-17T10:02:29Z"
content_type	"application/pdf"
author	"lanjaron beats"
language	"es"
content	"Manual del usuario V2 \n \n \n \n\nS*H85* \n \n \n\n \nEl color y el aspecto pueden variar según el producto; las especificaciones están sujetas a cambios sin previo aviso para \n\nmejorar el rendimiento del producto. \n\n \n\nEl contenido de este manual está sujeto a cambios sin previo aviso a fin de mejorar su calidad. \n\n© Samsung Electronics \n\nSamsung Electronics es el propietario del copyright de este manual. \n\nEl uso o la reproducción de este manual parcial o totalmente sin la autorización de Samsung Electronics están prohibidos. \n\nLas marcas comerciales distintas de Samsung Electronics son propiedad de sus respectivos propietarios. \n\n \nAntes de usar el equipo lea este manual para evitar fallas y guarde para \n\nfuturas referencias. \n\n \n\n• Se puede cobrar una tarifa administrativa si. \n\n– (a) el usuario solicita un técnico y el producto no tiene ningún defecto. \n\n(es decir, si el usuario no se ha leído este manual del usuario). \n\n– (b) el usuario lleva el producto a reparar a un centro de servicio y el producto no tiene ningún defecto. \n\n(es decir, si el usuario no se ha leído este manual del usuario). \n\n•  \n\n— \nSi tras la instalación no reinicia el ordenador quizás el software no funcione adecuadamente. \n\n— \nEl icono Easy Setting Box puede que no aparezca según el sistema del ordenador y las especificaciones del \n\nproducto. \n\n— \nSi no aparece el icono de acceso directo, pulse la tecla F5. \n\n \nRestricciones y problemas con la instalación \n\nLa instalación de Easy Setting Box puede resultar afectada por la tarjeta gráfica, la placa base y el \n\nentorno de red. \n\n• Windows XP 32Bit/64Bit \n\n• Windows Vista 32Bit/64Bit \n\n• Windows 7 32Bit/64Bit \n\n• Windows 8 32Bit/64Bit \n\n• Windows 8.1 32Bit/64Bit \n\n• Windows 10 32Bit/64Bit \n\n• Al menos 32 MB de memoria. \n\n• Al menos 60 MB de espacio libre en la unidad del disco \n\nduro. \n\nCapítulo 08 \n\nInstalación del software \n\nhttp://www.samsung.com/\n\n\n50 \n\n \n\n \n\n \n \n\nRequisitos previos para ponerse en contacto \n\ncon el Centro de servicio al cliente de \n\nSamsung \n\n— \nAntes de llamar al Centro de servicio técnico de Samsung, pruebe el producto de este modo. Si el problema \n\ncontinúa, póngase en contacto con el Centro de servicio técnico de Samsung. \n\n \n\nPrueba del producto \n\nUtilice la función de prueba para comprobar si el producto funciona normalmente. \n\nSi la pantalla se apaga y el indicador de alimentación parpadea aunque el producto esté correctamente \n\nconectado a un PC, lleve a cabo una prueba de autodiagnóstico. \n\n1 Apague el PC y el producto. \n\n2 Desconecte el cable del producto. \n\n3 Encender el producto. \n\n4 Si aparece el mensaje Comprobar cable señal, el producto funciona normalmente. \n\n— \nSi la pantalla permanece en blanco, compruebe el sistema de PC, la controladora de vídeo y el cable. \n\n \nComprobación de la resolución y la frecuencia \n\nCompruebe lo siguiente. \n\n \n Problema de instalación (modo PC) \n\n \n\nLa pantalla se enciende y se apaga continuamente. \n\nCompruebe que el cable esté bien conectado al producto y al PC, así como que los conectores estén \n\nfirmemente enchufados. \n \n\n \n\nAparecen espacios en blanco a los cuatro lados de la pantalla cuando se conecta un cable HDMI o \n\nHDMI-DVI al producto y al PC. \n\nLos espacios en blanco de la pantalla no tienen nada que ver con el producto. \n\nLos espacios en blanco de la pantalla los crea el PC o la tarjeta gráfica"
content_length	96254
indexed_at	"2021-05-26T08:48:54.513715800Z"
title	"manual samsung.pdf"
url	"public/maquinav/manual samsung.pdf"
state	"validada"
variacion0	"como arreglo tv samsung 2015"

Thanks for the help

castano_guill · September 16, 2021, 6:46am

Hello, someone could help me?
Thanks!

dadoonet · September 16, 2021, 9:38am

You could do something like this:

DELETE /indextext
PUT /indextext
PUT /indextext/_doc/1
{
  "attachment": {
    "content": "como arreglo tv samsung 2015",
    "variacion0": "como arreglo tv samsung 2015"
  }
}
PUT /indextext/_doc/2
{
  "attachment": {
    "content": "como arreglo tv samsung 2015",
    "variacion0": "something else"
  }
}

GET /indextext/_search
{
  "query": {
    "multi_match": {
      "query": "fix samsung 2015",
      "fields": [
        "attachment.variacion0^10.0",
        "attachment.content"
      ]
    }
  }
}

It gives:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 11.794991,
    "hits" : [
      {
        "_index" : "indextext",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 11.794991,
        "_source" : {
          "attachment" : {
            "content" : "como arreglo tv samsung 2015",
            "variacion0" : "como arreglo tv samsung 2015"
          }
        }
      },
      {
        "_index" : "indextext",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.36464313,
        "_source" : {
          "attachment" : {
            "content" : "como arreglo tv samsung 2015",
            "variacion0" : "something else"
          }
        }
      }
    ]
  }
}

Note the order of the resultset.

castano_guill · September 16, 2021, 2:31pm

Hello, Thanks!!

Only one last thing,
How can I combine "multi_match" with "bool" to check that the status of the document is "validated"

Thanks

Regards

dadoonet · September 17, 2021, 4:12am

Create a bool query. Put the multimatch query in the must clause and add a term query on validated field to the filter clause.

castano_guill · September 17, 2021, 10:54am

Thanks!!, now the new query is running.

system · October 15, 2021, 10:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Searching through PDF attachments and other documents in ElasticSearch with one query Elasticsearch	6	1706	October 29, 2020
Searching content doesn't show exact output Elasticsearch	8	1774	March 28, 2018
How to search in attachment content? Elasticsearch	3	992	July 6, 2022
Search froma a pdf file content Elasticsearch	9	470	July 23, 2020
Elasticsearch - attachment using Ingest - with node.js Elasticsearch	2	2614	June 21, 2017

Search first in one field and if not return result, search in all document

Related topics