Attachments download are really slow

We have a combination of the HIVE and Elasticsearch setup on Docker. When we attempt to download any attachments from the HIVE which stores the documents in Elasticsearch, the process of download is really slow. For example a 780KB file (less than 1mb) takes approximate 2 minutes waiting before the file download has initiated,

We have eliminated the network since other docker containers share same internal subnet and file download speeds from other containers is fast (normal speed).

We suspect that we need to tune a parameter in ElastcSearch? could it be an index issue? Is there any parameters from ElasticSearch that we need to modify and increase? How do I trace what is happening in Elasticsearch when I am attempting the download?

Apologies, I come from an Oracle background (relational databases) and I am new to elasticsearch.

We are using ElasticSearch 5.6.2 running on Docker - 18.03.1-ce running on CentOS Linux release 7.4.1708

Many thanks!

It's actually not a great idea IMHO to store big blobs in elasticsearch.
It has not been designed for that.
I'm not saying that explains what you are seeing but just sharing some thoughts.

Thanks for your prompt response.

Please forgive me for this silly question but how did you know it's stored as a BLOB datatype? Also how do I query elasticsearch to find out where the files are being stored? (Default location)

Again I am from an Oracle background and elasticsearch is new to me.

Thanks!

That's a guess. I believe you have a field with a BASE64 content. This is what I'm calling the blob.

Thank you!

One final request, How do I query to find out the location where files are being stored in elasticsearch?

You don't have to know that.
It's an internal Lucene format in the data directory.

Again, thank you for your prompt response.

So am I correct to think that attachments of documents should be avoided with ElasticSearch full stop?

I did a bit of research and it appears that THE HIVE PROJECT most recommended database is Elasticsearch, but we need to add attachments to the HIVE, so I am confused, does that mean we need to ditch elasticsearch as a database and use a different database technology?

Again, i appreciate your thoughts on the matter.

Thanks!

Do you really need the attachments in the datastore? If they are large you could just save a URL of file path in the datastore and keep the actual attachment on the file system or a blob store.

Generally speaking Elasticsearch (and most other datastores) are built for lots of quick, parallel queries, but they are not the ideal filesystem as in "load this big binary for me". It's a bit academic, but I like the paper To BLOB or Not To BLOB:
Large Object Storage in a Database or a Filesystem?
, which describes tradeoffs and when to use datastores or filesystems for binaries. It doesn't apply 1:1 to Elasticsearch, but many concepts still match.

Having said that, 2min for 780KB still sounds totally wrong. I would take a look at two things:

  1. How did you configure Elasticsearch's data storage? Quoting from our docs: "Always use a volume bound on /usr/share/elasticsearch/data"

  2. What does the Elasticsearch query look like? You'll have to peel through some layers to get to the raw query I'm afraid, but we only know that part and not the Hive layer(s) on top. Also in the Elasticsearch response there is a field took, which tells you how long the Elasticsearch query took — that would be interesting as well.
    If you can't easily find the underlying query, you can log everything by issuing the following command:

     curl -XPUT 'http://localhost:9200/_all/_settings' -H "Content-Type: application/json" -d '{
         "index.indexing.slowlog.threshold.index.info" : "0s",
         "index.search.slowlog.threshold.fetch.info" : "0s",
         "index.search.slowlog.threshold.query.info" : "0s"
     }'
    

    You'll find the queries in the Docker logs afterwards (docker logs <container-id>). Be sure to reset these values (replace "0s" with null and run the query again), because this is pretty heavy.

1 Like

Hi xeraa,

Thank you so very much for your comprehensive response. When you mention that quote ---> if they are large you could just save a URL of file path in the datastore and keep the actual attachment on the file system or a blob store <---- How do I actually do this?

I will review our configuration and see if it matches the documentation you referenced.

Thanks again and for the query, will use it accordingly.

Take care!!

My assumption was that you base64 encode the attachment and don't search in it — otherwise please correct me.

In that case you wouldn't need to store the attachment in Elasticsearch, since you are not using it in your query. Thus you / your application could retrieve the attachment as required as long as you have a reference to it. That reference might just be a URL or a path where you can find it. Does that make sense?

Hi xeraa,,

Thanks again for your prompt response. Yes I found out that we not storing the document directly in Elasticsearch, we are using a volume bound on /usr/share/elasticsearch/data".

I put in the following settings thinking it will help but it didn't, it's still taking 2 minutes to retrieve 780KB file

Datastore

datastore {
name = data

Size of stored data chunks

chunksize = 50k
hash {
# Main hash algorithm /!\ Don't change this value
main = "SHA-256"
# Additional hash algorithms (used in attachments)
extra = ["SHA-1", "MD5"]
}
attachment.password = "xxxxxxx"
}

We are following the recommended settings: https://github.com/TheHive-Project/TheHiveDocs/blob/master/admin/configuration.md

I am still researching why we are experiencing such a delay. I came across this article: https://www.datadoghq.com/blog/elasticsearch-performance-scaling-problems/

I am looking at Problem #4: How can I speed up my index-heavy workload? Not sure if this going to help us in this scenario or not. what do you think?

Many Thanks!

Those settings must be Hive specific — not sure how they will influence performance, but IMO this needs more than just some parameter tuning.

This is the Elasticsearch data directory and where Elasticsearch / Lucene are storing the data (directly).

I thought the performance problem was for searching and not writing / indexing? Do you have so much data in Elasticsearch that it's a scaling problem? I would suspect that you have a problematic Docker setup or bad Elasticsearch queries (or a combination of both). Can you get us one of those slow queries? Otherwise it's all pretty much guessing of what is going on.

For general performance data we would recommend our own monitoring plugin. If you are using our Docker containers for 5.6 that should already be included.

Hi xeraa,

Thanks for your response. I will try and get the slow query.

It's definitely a search issue and only happens when we are fetching a document. For all other search operations the speed is fine, just when fetching.

We don't seem to have the tool plugin installed, so I will install that and monitor.

Thanks again, I will provide update once I have more information.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.