Hello, I need help, I am creating a project with which I need to index files of different types to perform searches on them using elastic, these files are extracted directly from an Oracle database in 12c, I would like their content to be indexed regardless of whether they are excel, word, pdf, etc. and that it is not necessary to encode them to upload them as plain text, is there a way to achieve this?
If you have a directory containing the files You can use FSCrawler. There's a tutorial to help you getting started.
You can use the ingest attachment plugin.
There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id
The data
field is basically the BASE64 representation of your binary file.
The documents are not hosted in any directory, they are consulted through a model within the project and loaded through a function. At no time are they hosted locally because the amount of data is too high to be stored locally.
class AdmDocumentos(models.Model):
id = models.BigIntegerField(primary_key=True, blank=True)
nombre = models.CharField(max_length=100)
archivo = models.TextField() # Cambiado a TextField para almacenar texto
class Meta:
managed = False
db_table = '"' + DATABASE_SCHEMA + '"."adm_documentos"'
db_table_comment = 'Documentos'
verbose_name = 'Documento'
verbose_name_plural = 'Documentos'
ordering = ['nombre']
@set_sql_for_field('id', 'SELECT ' + DATABASE_SCHEMA + '.id_seq.NEXTVAL FROM dual')
def save(self, *args, **kwargs):
super().save(*args, **kwargs)
def __str__(self):
return self.nombre
# Conexión a Elasticsearch
connections.create_connection(hosts=['http://localhost:9200'], http_auth=('*******', '**********'))
# Definición del índice en Elasticsearch
class DocumentoIndex(Document):
nombre = Text()
archivo = Binary()
class Index:
name = 'documento_index'
# Definición del mapeo
mappings = {
"properties": {
"nombre": {
"type": "text"
},
"archivo": {
"type": "binary",
}
}
}
# Función para indexar documentos
@receiver(post_save, sender=AdmDocumentos)
def indexar_documento(sender, instance, created, **kwargs):
if created:
# Si se crea un nuevo documento, lo indexamos en Elasticsearch
documento_index = DocumentoIndex(
meta={'id': instance.id},
nombre=instance.nombre,
)
try:
# Verificar si el archivo es un objeto de tipo bytes
if isinstance(instance.archivo, bytes):
# Convertir el archivo a una cadena de texto antes de indexarlo
contenido_decodificado = instance.archivo.decode('utf-8')
documento_index.archivo = contenido_decodificado
else:
# Si el archivo ya es una cadena de texto, lo asignamos directamente
documento_index.archivo = instance.archivo
# Guardar el documento indexado en Elasticsearch
documento_index.save()
except Exception as e:
print(f"Error al indexar el documento: {e}")
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.