Elasticsearch Index Contains More Documents Than MySQL Query Results

I’m encountering an issue where the number of documents in my Elasticsearch index exceeds the number of rows returned by my MySQL query. Specifically:

  • MySQL Query Result: 4802 rows.
  • Elasticsearch Index Document Count: 5515 documents.

This discrepancy persists even though the query results in MySQL are consistent. I suspect something is happening in my Logstash pipeline or Elasticsearch indexing process.



Logstash Configuration:

Here is my Logstash pipeline:

Input Configuration:

jdbc {
  jdbc_connection_string => "jdbc:mysql://ipadresss:3306/SUPER_TEST_TRACK"
  jdbc_user => "admin"
  jdbc_password => "something"
  jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
  jdbc_driver_library => "/usr/share/java/mysql-connector-java-9.1.0.jar"
  jdbc_paging_enabled => true
  jdbc_page_size => 9000
  statement => "SELECT
                  s.site AS siteName,
                  DATE_FORMAT(p.date_production, '%Y-%m-%d') AS productionDate,
                  SUM(p.initial_quantity * pcv.criterion_value) AS ptotalMeterage
                FROM
                  palet p
                JOIN palettes_criteria pc ON p.palet_id = pc.palette_id
                JOIN criterion_value pcv ON pc.criterion_value_id = pcv.criterion_value_id
                JOIN criterion c ON pcv.criterion_id = c.id_criterion
                JOIN site s ON p.id_site = s.id_site
                WHERE
                  c.criterion_name = 'unp'
                GROUP BY
                  s.site, productionDate
                ORDER BY
                  productionDate"
  schedule => "* * * * *"
}

Filter Configuration:

mutate {
  rename => {
    "siteName" => "site_name"
    "productionDate" => "production_date"
    "ptotalMeterage" => "p_total_meterage"
  }
  remove_field => ["@timestamp", "@version"]
}

Output Configuration:

elasticsearch {
  hosts => ["http://ipaddress:9200"]
  index => "produced-palette-meterage-per-site-and-date"
  document_id => "palet-%{sitename}-%{productiondate}-%{ptotalmeterage}"
}

Observations:

  1. Consistent MySQL Query Results:
  • Running the SQL query directly in MySQL consistently returns 4802 rows.
  • Example row from MySQL:
| siteName  | productionDate | ptotalMeterage      |
| BERRECHID | 2023-12-17     | 14364.279999999972  |
  1. Elasticsearch Index:
  • Document count (docs.count): 5515
  • Example document in Elasticsearch:
{
  "_index": "produced-palette-meterage-per-site-and-date",
  "_type": "_doc",
  "_id": "palet-BERRECHID-2023-12-17-14364.279999999972",
  "_source": {
    "sitename": "BERRECHID",
    "productiondate": "2023-12-17",
    "ptotalmeterage": 14364.279999999972
  }
}

where is the issue and how to solve it?also how to determine document_id for complexr statement

The document id in Elasticsearch should be the primary key of the record. The ptotalmeterage field looks like something that could be updated and change over time (could also be the case for productiondate ?). If that is the case a new document will be generated for each change as the document id will change.

sorry I asked you so many time so how determine the document_id and not having problem with complex statement and solve my issue
thanks in advance

In Elasticsearch it is very simple and the document ID is the primary key (therefore unique) of a document within an index. Whenever you want to update or delete a document you need to provide the document id and it therefore need to be static and not change over time. If it at any point changes a new document will be created as there is a new unique document id.

If you determine that a table, or a denormalised data set, has a primary key that consists of multiple columns you need to concatenate these into a single composite field and use this as the document id. One of your first queries was a good example of this.

Based on your questions it sound like you will need to improve your knowledge around relational database data modelling and normalisation/denormalisation. This will allow you to better define suitable primary keys that can be used for document ids in Elasticsearch. Unfortunately this is completely outside of the area of Elasticsearch, so not something we really can help with here.