Hey,
I'm using Elasticsearch 5.2.2, Logstash 5.2.2 and PostgreSQL 9.5.2.
And trying to index over 5,000,000 documents distributed in two types in one index.
But after indexing some documents are missing. ![]()
The first type "A" has all the contents of one of the database tables (~4,000,000); and
The second type "B" is a subset of type "A" (~1,000,000).
All I'm index is a single string.
-
The ids of A can't be duplicated and the query returning B can have duplicates but it is much less than the count of documents missing.
-
I'm using the input plugin jdbc with the option
jdbc_paging_enabled, and it states: "Be aware that ordering is not guaranteed between queries."
So I putORDER BY updatedin the query. But still getting missing documents. -
I kept monitoring the Thread Pool, but the bulk queue show no more than 5 and 0 rejected.
Any idea of what I should check?
When indexing ends I have:
GET /I/A/_count
{"count":3672436,...}
GET /I/B/_count
{"count":1154472,...}
GET /I/_stats/indexing?pretty&types=A,B
{
...
"_all" : {
"primaries" : {
"indexing" : {
"index_total" : 5494806,
...
"types" : {
"A" : {
"index_total" : 4309319,
...
},
"B" : {
"index_total" : 1185487,
...
}
}
}
},
...
},
...
}
The config for Logstash for both types, just changing the "type" and "statement":
input {
jdbc {
jdbc_driver_library => "/usr/local/lib/postgresql-9.4.1212.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://postgres:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres"
jdbc_validate_connection => true
jdbc_paging_enabled => true
jdbc_page_size => 100000
schedule => "* * * * *"
last_run_metadata_path => "/usr/share/logstash/data/A.logstash_jdbc_last_run"
type => "A"
statement => "SELECT id,str FROM a WHERE updated >= :sql_last_value ORDER BY updated ASC"
}
}
output {
if [type] == "A" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "I"
document_type => "A"
document_id => "%{id}"
}
}
}
Thanks.
