Hey,
I'm using Elasticsearch 5.2.2, Logstash 5.2.2 and PostgreSQL 9.5.2.
And trying to index over 5,000,000 documents distributed in two types in one index.
But after indexing some documents are missing.
The first type "A" has all the contents of one of the database tables (~4,000,000); and
The second type "B" is a subset of type "A" (~1,000,000).
All I'm index is a single string.
-
The ids of A can't be duplicated and the query returning B can have duplicates but it is much less than the count of documents missing.
-
I'm using the input plugin jdbc with the option
jdbc_paging_enabled
, and it states: "Be aware that ordering is not guaranteed between queries."
So I putORDER BY updated
in the query. But still getting missing documents. -
I kept monitoring the Thread Pool, but the bulk queue show no more than 5 and 0 rejected.
Any idea of what I should check?
When indexing ends I have:
GET /I/A/_count
{"count":3672436,...}
GET /I/B/_count
{"count":1154472,...}
GET /I/_stats/indexing?pretty&types=A,B
{
...
"_all" : {
"primaries" : {
"indexing" : {
"index_total" : 5494806,
...
"types" : {
"A" : {
"index_total" : 4309319,
...
},
"B" : {
"index_total" : 1185487,
...
}
}
}
},
...
},
...
}
The config for Logstash for both types, just changing the "type" and "statement":
input {
jdbc {
jdbc_driver_library => "/usr/local/lib/postgresql-9.4.1212.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://postgres:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres"
jdbc_validate_connection => true
jdbc_paging_enabled => true
jdbc_page_size => 100000
schedule => "* * * * *"
last_run_metadata_path => "/usr/share/logstash/data/A.logstash_jdbc_last_run"
type => "A"
statement => "SELECT id,str FROM a WHERE updated >= :sql_last_value ORDER BY updated ASC"
}
}
output {
if [type] == "A" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "I"
document_type => "A"
document_id => "%{id}"
}
}
}
Thanks.