Elastic with pfd, excel and word documents

Hello, I need help, I am creating a project with which I need to index files of different types to perform searches on them using elastic, these files are extracted directly from an Oracle database in 12c, I would like their content to be indexed regardless of whether they are excel, word, pdf, etc. and that it is not necessary to encode them to upload them as plain text, is there a way to achieve this?

If you have a directory containing the files You can use FSCrawler. There's a tutorial to help you getting started.

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

The documents are not hosted in any directory, they are consulted through a model within the project and loaded through a function. At no time are they hosted locally because the amount of data is too high to be stored locally.

class AdmDocumentos(models.Model):
    id = models.BigIntegerField(primary_key=True, blank=True)
    nombre = models.CharField(max_length=100)
    archivo = models.TextField()  # Cambiado a TextField para almacenar texto

    class Meta:
        managed = False
        db_table = '"' + DATABASE_SCHEMA + '"."adm_documentos"'
        db_table_comment = 'Documentos'
        verbose_name = 'Documento'
        verbose_name_plural = 'Documentos'
        ordering = ['nombre']
        
    @set_sql_for_field('id', 'SELECT ' + DATABASE_SCHEMA + '.id_seq.NEXTVAL FROM dual')
    def save(self, *args, **kwargs):
        super().save(*args, **kwargs)

    def __str__(self):
        return self.nombre


# Conexión a Elasticsearch
connections.create_connection(hosts=['http://localhost:9200'], http_auth=('*******', '**********'))

# Definición del índice en Elasticsearch
class DocumentoIndex(Document):
    nombre = Text()
    archivo = Binary() 

    class Index:
        name = 'documento_index'

        # Definición del mapeo
        mappings = {
            "properties": {
                "nombre": {
                    "type": "text"
                },
                "archivo": {
                    "type": "binary",
                }
            }
        }

# Función para indexar documentos
@receiver(post_save, sender=AdmDocumentos)
def indexar_documento(sender, instance, created, **kwargs):
    if created:
        # Si se crea un nuevo documento, lo indexamos en Elasticsearch
        documento_index = DocumentoIndex(
            meta={'id': instance.id},
            nombre=instance.nombre,
        )
        try:
            # Verificar si el archivo es un objeto de tipo bytes
            if isinstance(instance.archivo, bytes):
                # Convertir el archivo a una cadena de texto antes de indexarlo
                contenido_decodificado = instance.archivo.decode('utf-8')
                documento_index.archivo = contenido_decodificado
            else:
                # Si el archivo ya es una cadena de texto, lo asignamos directamente
                documento_index.archivo = instance.archivo

            # Guardar el documento indexado en Elasticsearch
            documento_index.save()
        except Exception as e:
            print(f"Error al indexar el documento: {e}")

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.