Discrepancy in Document Count When Importing Data from MySQL to Elasticsearch Using Logstash

Hello Elasticsearch Community,

I am encountering an issue with importing data from MySQL to Elasticsearch using Logstash. Here’s the context and details:


Setup Information:

  • MySQL Table Name: criteria_sites
  • Elasticsearch Index Name: mysql-criterion-site-data
  • Pipeline Input Configuration: I am using the JDBC plugin in Logstash to fetch data from MySQL.

MySQL Table Details:

  • The table criteria_sites has 16,626 rows.
  • It has a composite primary key consisting of:
    • criterion_value_id
    • site_id
SHOW KEYS FROM criteria_sites WHERE Key_name = 'PRIMARY';
+----------------+------------+----------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table          | Non_unique | Key_name | Seq_in_index | Column_name        | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------------+------------+----------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| criteria_sites |          0 | PRIMARY  |            1 | criterion_value_id | A         |       14030 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| criteria_sites |          0 | PRIMARY  |            2 | site_id            | A         |       16626 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+----------------+------------+----------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

Elasticsearch Output Configuration in Logstash:

output {
  elasticsearch {
    hosts => ["http://ip-address:9200"]
    index => "mysql-criterion-site-data"
    document_id => "criterion-site-%{criterion_value_id}"  # Uses only `criterion_value_id` as the ID
  }
}

Issue:

When I import the table using Logstash, the document count in Elasticsearch doesn’t match the number of rows in the MySQL table.

  • MySQL Table Rows: 16,626
  • Elasticsearch Index Statistics:
yellow open   mysql-criterion-site-data FSkeQzvVQKa4PwIDbnYV0Q   1   1      14030         2596      1.3mb  

You are creating the key based on only one component of the primary key, so it is natural and expected that the number of documents in Elasticsearch match the cardinality of this, which as you pointed out is 14030. If you want all documents inserted you need to create the document id based on both components of the primary key.

how to solve the issue and find all the 16,626 rows in elasticsearch?

Change this to something like this:

output {
  elasticsearch {
    hosts => ["http://ip-address:9200"]
    index => "mysql-criterion-site-data"
    document_id => "%{site_id}-%{criterion_value_id}"
  }
}

Naturally you will need to delete everything from the index before rerunning as you otherwise will get duplicates.

I will try and give you update

thanks man it worked
yellow open mysql-criterion-site-data l8otXpdqQOOeRnK8cAfGwg 1 1 16626 49878 2.2mb 2.2mb

can you explain to me why at first the number is mismatched, im new to elasticsearch and logstash

You were indexing documents with a document id (primary key in Elasticsearch) that only contained one component of the key you have in your data base. There are per your statistics only 14030 unique values in this key component in your database so there can never be more unique document ids than that in Elasticsearch.