Incomplete Data Indexing from Logstash to Elasticsearch: Missing Rows

I’m experiencing an issue with incomplete data being indexed from Logstash into Elasticsearch. Despite my SQL query returning 453,237 rows from the database, only 99,000 documents are indexed in Elasticsearch. I’m trying to understand why this discrepancy occurs and how to resolve it.


Input Configuration:

input {
  jdbc {
    jdbc_connection_string => "jdbc:mysql://database-ip:3306/SUPER_TEST_TRACK"
    jdbc_user => "admin"
    jdbc_password => "P@ssword2020"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_driver_library => "/usr/share/java/mysql-connector-java-9.1.0.jar"
    jdbc_paging_enabled => true
    jdbc_page_size => 1000
    statement => "SELECT
                    s.site AS siteName,
                    p.palet_id AS palet_id,
                    p.date_production AS dateProduction,
                    COALESCE(p.delivery_date, '1970-01-01') AS deliveryDate,
                    p.palet_state AS paletState,
                    p.quantity AS quantity,
                    p.initial_quantity AS initialQuantity,
                    c.criterion_Name AS criterionName,
                    cv.criterion_value AS criterionValue,
                    (p.quantity * cv.criterion_value) AS calculatedQuantity
                  FROM
                    palet p
                  JOIN
                    palettes_criteria pc ON p.palet_id=pc.palette_id
                  JOIN
                    site s ON p.id_site = s.id_site
                  JOIN
                    criterion_value cv ON pc.criterion_value_id = cv.criterion_value_id
                  JOIN
                    criterion c ON cv.criterion_id = c.id_criterion
                  WHERE
                    c.criterion_name = 'unp'"
    schedule => "* * * * *"
  }
}

Filter Configuration:

filter {
  mutate {
    remove_field => ["@timestamp", "@version"]
  }
}

Output Configuration:

output {
  elasticsearch {
    hosts => ["http://ip:9200"]
    index => "mysql-denormalized_palet-data"
    document_id => "%{siteName}-%{dateProduction}-%{deliveryDate}-%{palet_id}"
  }
}

Observations:

  1. The SQL Query directly executed in MySQL returns all 453,237 rows.
  2. Elasticsearch indexes only 99,000 documents after the pipeline runs.
  3. When I query Elasticsearch:
GET /mysql-denormalized_palet-data/_count

It returns:

{
  "count": 99000
}
  1. One indexed document looks like this:
{
  "_index": "mysql-denormalized_palet-data",
  "_type": "_doc",
  "_id": "%{siteName}-%{dateProduction}-%{deliveryDate}-37969",
  "_source": {
    "sitename": "BERRECHID",
    "palet_id": 37969,
    "dateproduction": "2023-01-04T00:00:00.000Z",
    "deliverydate": null,
    "paletstate": "INSTORAGE",
    "quantity": 36,
    "initialquantity": 36,
    "criterionvalue": "1.44",
    "criterionname": "UNP",
    "calculatedquantity": 51.84
  }
}

You are using a custom id that requires that the fields siteName, dateProduction, deliveryDate and palet_id exists in your document.

In the example you shared, the _id generated is:

"_id": "%{siteName}-%{dateProduction}-%{deliveryDate}-37969"

This means that the fields siteName, dateProduction and deliveryDate does not exist in your document.

Checking in the sample document you shared, the names are different.

You have a field named sitename not siteName, the same thing for the other two fields, the name is wrong.

Change your output to use the correct name of the fields to see if your issue is fixed.

can you provide the output part to solve the issue

I would recommend you try to look at the fields that exist in your document and use these when you define the document id in the Logstash output. As Leandro pointed out it seems the JDBC input plugin lowercases all column names, which is why you are having this issue. Please try yourself and let us know if you are not able to understand how it works based on this thread as well as the related one I responded to earlier.

Once you understand how it works you will be able to better troubleshoot this type of issues yourself.

thanks I got the exact number of documents
my question is see that there is a lot of doc deleted, does it affect query performance and also does it count in the store size?
yellow open mysql-denormalized_palet-data mZXkQsb_Tpab4si_k-519A 1 1 453237 1198755 471mb 471mb