Elastic with pfd, excel and word documents

DAVID_MARIN_ALVAREZ · February 14, 2024, 1:38pm

Hello, I need help, I am creating a project with which I need to index files of different types to perform searches on them using elastic, these files are extracted directly from an Oracle database in 12c, I would like their content to be indexed regardless of whether they are excel, word, pdf, etc. and that it is not necessary to encode them to upload them as plain text, is there a way to achieve this?

dadoonet · February 14, 2024, 2:09pm

If you have a directory containing the files You can use FSCrawler. There's a tutorial to help you getting started.

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

DAVID_MARIN_ALVAREZ · February 14, 2024, 2:17pm

The documents are not hosted in any directory, they are consulted through a model within the project and loaded through a function. At no time are they hosted locally because the amount of data is too high to be stored locally.

class AdmDocumentos(models.Model):
    id = models.BigIntegerField(primary_key=True, blank=True)
    nombre = models.CharField(max_length=100)
    archivo = models.TextField()  # Cambiado a TextField para almacenar texto

    class Meta:
        managed = False
        db_table = '"' + DATABASE_SCHEMA + '"."adm_documentos"'
        db_table_comment = 'Documentos'
        verbose_name = 'Documento'
        verbose_name_plural = 'Documentos'
        ordering = ['nombre']
        
    @set_sql_for_field('id', 'SELECT ' + DATABASE_SCHEMA + '.id_seq.NEXTVAL FROM dual')
    def save(self, *args, **kwargs):
        super().save(*args, **kwargs)

    def __str__(self):
        return self.nombre


# Conexión a Elasticsearch
connections.create_connection(hosts=['http://localhost:9200'], http_auth=('*******', '**********'))

# Definición del índice en Elasticsearch
class DocumentoIndex(Document):
    nombre = Text()
    archivo = Binary() 

    class Index:
        name = 'documento_index'

        # Definición del mapeo
        mappings = {
            "properties": {
                "nombre": {
                    "type": "text"
                },
                "archivo": {
                    "type": "binary",
                }
            }
        }

# Función para indexar documentos
@receiver(post_save, sender=AdmDocumentos)
def indexar_documento(sender, instance, created, **kwargs):
    if created:
        # Si se crea un nuevo documento, lo indexamos en Elasticsearch
        documento_index = DocumentoIndex(
            meta={'id': instance.id},
            nombre=instance.nombre,
        )
        try:
            # Verificar si el archivo es un objeto de tipo bytes
            if isinstance(instance.archivo, bytes):
                # Convertir el archivo a una cadena de texto antes de indexarlo
                contenido_decodificado = instance.archivo.decode('utf-8')
                documento_index.archivo = contenido_decodificado
            else:
                # Si el archivo ya es una cadena de texto, lo asignamos directamente
                documento_index.archivo = instance.archivo

            # Guardar el documento indexado en Elasticsearch
            documento_index.save()
        except Exception as e:
            print(f"Error al indexar el documento: {e}")

system · March 13, 2024, 2:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Recommended workflow for indexing many binary docs Elasticsearch	4	755	July 6, 2021
Indexing word, pdf documents? Elasticsearch	12	5949	July 7, 2020
How Attachments or file storage and searching is handled in Elasticsearch Elasticsearch	7	1408	August 13, 2020
Search froma a pdf file content Elasticsearch	9	465	July 23, 2020
How to specify file to Ingest Attachment Elasticsearch	11	4786	March 21, 2017

Elastic with pfd, excel and word documents

Related Topics