Slow indexing with Ruby Chewy gem via ActiveRecord

(Yang Li) #1

We use ES for analytics but we seem to have some indexing performance issue, and here are some details

  • We have a MySQL database that stores the processed data
  • We use a Ruby gem called Chewy (DSL) made by Toptal to interact with ElasticSearch
  • We index with the Chewy gem to import data from MySQL to ElasticSearch

The issue we are having here is slow indexing performance. Our current indexing is about 5000 MySQL entries per second and it sounds like we are way behind. At this point we are not 100% sure whether it's the MySQL/ActiveRecord/Chewy or ElasticSearch that is the bottleneck. Some diagnostics and profiling we did include:

  • Running iotop to monitor the disk read/write. It's usually 1MB/s read for a few seconds then a burst of 10MB/s disk write. This makes me think that it could be the read stream from MySQL that's slowing things down.
  • Separate the read and write stream by connecting Chewy to another MySQL server that is isolated from the ElasticSearch indexing cluster. There is improvement but it was not very significant. It probably went up by 2-3K records/second but still slow.

Since there are many pieces in this indexing process we'd appreciate some help or insights into why the indexing is slow or maybe some pointers about diagnosing the problem.

Some configs about ElasticSearch:

bootstrap.mlockall: true
script.groovy.sandbox.enabled: true
action.disable_delete_all_indices: true. false

# Heap size defaults to 256m min, 1g max
# Set ES_HEAP_SIZE to 50% of available RAM, but no more than 31g

# Maximum number of open files

# Maximum amount of locked memory

# Maximum number of VMA (Virtual Memory Areas) a process can own

Some MySQL configs:

# This will be passed to all mysql clients
# It has been reported that passwords should be enclosed with ticks/quotes
# escpecially if they contain "#" chars...
# Remember to edit /etc/mysql/debian.cnf when changing the socket location.
port            = 63306
socket          = /var/run/mysqld/mysqld.sock

# Here is entries for some specific programs
# The following values assume you have at least 32M ram

# This was formally known as [safe_mysqld]. Both versions are currently parsed.
socket          = /var/run/mysqld/mysqld.sock
nice            = 0

# * Basic Settings
user            = mysql
pid-file        = /var/run/mysqld/
socket          = /var/run/mysqld/mysqld.sock
port            = 63306
basedir         = /usr
datadir         = /var/lib/mysql
tmpdir          = /tmp
lc-messages-dir = /usr/share/mysql

# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address            =
# * Fine Tuning
key_buffer              = 128M
max_allowed_packet      = 128M
thread_stack            = 1536K
thread_cache_size       = 64

# * Query Cache Configuration
query_cache_limit       = 1M
query_cache_size        = 16M

max_allowed_packet      = 256M

#no-auto-rehash # faster start of mysql but no tab completition

key_buffer              = 16M

Environment information:

  • 3 nodes in the ES cluster with 16GB RAM 384GB SSD and Quad Core Intel Xeon CPU
  • Ubuntu Linux 14.04
  • ElasticSearch 1.4.4 (We have some dependency issues so haven't upgraded to 2.0+)
  • Ruby 2.2.2, Rails 4.2, latest Chewy gem

Chewy model example:

define_type ::TypeName.includes(:type_table_from_mysql) do
  field :book_id
  field :entry_time, type: 'date', value: -> { timestamp * 1000 }
  field :in_stock, type: 'boolean', value: -> (o) { ChewyHelper.in_stock?(o) } 
  field :price, type: 'float'
  field :day, type: 'integer', value: -> { }
  field :hour, type: 'integer', value: -> { }
  field :date, type: 'integer', value: -> { timestamp / (24*60*60) }

(system) #2