Loading Elasticsearch via Logstash on large dataset is going very slowly


#1

I have a large dataset in MySql (around 2.2 million rows) and my importing to Elasticsearch via Logstash works, but now is going incredibly slowly.

On my local machine in vagrant instances with 4GB RAM, each, it went relatively quickly (took 3 days) compared to taking an estimate 80+ days for a server-to-server transfer.

The query is quite complex (using a subquery, etc).

I switched the mysql server from using the the /tmp directory to using the /data/tmp_mysqldirectory, but even then I was occasionally running out of temporary space.

e.g: I was getting the error "‘message=>"Exception when executing JDBC query", :exception=>#<Sequel::DatabaseError: Java::JavaSql::SQLException: Error writing file '/data/tmp_mysql/MYHPf8X5' (Errcode: 28)>’"

I updated my query to have this limit (200):
UPDATE p_results set computed_at="0000-00-00 00:00:00" WHERE computed_at IS NULL LIMIT 200;

My configuration file looks like this: (notice that I'm using paging with a page size of 10000).

input {
jdbc {
    jdbc_connection_string => "jdbc:mysql://xxx.xxx.xxx.xxx:3306/xxx_production"
    jdbc_user => "xxx"
    jdbc_password => "xxx"
    jdbc_driver_library => "/usr/share/java/mysql.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement_filepath => "./sql/req.sql"
    jdbc_paging_enabled => "true"
    jdbc_page_size => 10000
}

}

output {
    elasticsearch {
        index => "xxx_resultats_preprod2"
        document_type => "resultats"
        hosts => ["localhost:9200"]
        codec => "plain"
        template => "./resultats_template.json"
        template_name => "xxx_resultats"
        template_overwrite => true
        document_id => "%{result_id}"
    }
}

running free -m on my logstash/elasticsearch server, I see this:

total used free shared buffers cached
Mem: 3951 2507 1444 0 148 724
-/+ buffers/cache: 1634 2316
Swap: 4093 173 3920
So total RAM= 4GB, and 2.5GB of it is used.

running free -m on my MySql server I see this:

         total       used       free     shared    buffers     cached

Mem: 3951 3836 115 0 2 1154
-/+ tampons/cache: 2679 1271
swap: 4093 813 3280
So total Ram = 4GB and ~3,8GB or 97% is used.
So my theory is that I'm occasionally swapping to disk and that is part of the reason why it's slow. Another theory is that I'm using a combination of LIMIT in my sql and using the logstash paging, and that could be the issue..

The load average on the Mysql server is relatively low right now.

top
load average: 1,00, 1,00, 1,00

under /data I see:
sudo du -h -d 1
13G ./tmp_mysql
4,5G ./production
18G .

using df-h I see:
total used utilization%
/dev/sdb1 32G 6,2G 24G 21% /data

If someone can help me make my queries execute faster I'd very much appreciate it!


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.