I'm setting up new ES 6.5.4 in cluster containing 6 data nodes. Each node has 16 CPU and heap 8 GB.
Here's the conf:
[root@data01~]# curl -XGET 'https://localhost:9201/_cat/nodes?v'
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.100.115.12 41 83 0 0.13 0.13 0.14 mdi - tt2_6_5_4-data02
10.100.115.16 22 99 1 0.06 0.44 1.25 mdi - tt2_6_5_4-master02
10.100.115.14 43 99 2 2.12 1.63 2.06 mdi - tt2_6_5_4-data04
10.100.115.13 32 97 1 0.20 0.20 0.37 mdi * tt2_6_5_4-data03
10.100.115.15 26 99 2 1.03 1.04 1.87 mdi - tt2_6_5_4-master01
10.100.115.11 36 99 2 4.53 4.24 3.81 mdi - tt2_6_5_4-data01
10.100.115.17 43 99 1 0.35 0.70 1.11 mi - tt2_6_5_4-master03
I need to import transactions from Oracle database. My main goal is to get fast query results. Just want to get last 10,20,100,1000 transactions based on account_id
and currency_id
sorted by date_currency
and transaction_id
. I have the same setup on 2.3 Cluster with 3 big indices (transactions_2017, transactions_2018, transactions_2019). Each index has cca 150M documents and is about 150GB size. Query results are very bad, like 5-8 seconds (tried bool, filter, force_merge etc, nothig helped, so I'm moving to the new infrastructure and newer version of ES).
Questions?
-
How should i organize my indices, was thinking again for year indices (transactions_year) and maybe 6 shards or should I try something different now. Maybe indices by last account_id number and then query on just one indice instead of 3, that should be faster, or not?
-
What should I do about my mapping? I guess
account_id
andcurrency_id
should be type keyword, but what aboutindex:false
and doc_values:false on other fields? -
Should I use sorting while inserting data with logstash, will it perform better because my query has sort by
date_currency
andtransaction_id
-
Should I try with 1 or more replicas for better searching performance?
Here's my template (for now):
{ "transactions": { "order" : 0, "index_patterns": [ "transactions*" ], "settings": { "index": { "sort.field" : ["date_currency", "transaction_id"], "sort.order" : ["desc", "desc"], "number_of_shards": "6", "number_of_replicas": "1" } }, "mappings": { "transaction_item": { "_routing": { "required": true }, "properties": { "@timestamp": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }, "transaction_id": { "type": "keyword" }, "date_booking": { "type": "date", "format": "date", "index": false, "doc_values": false }, "date_currency": { "format": "date", "type": "date", //can't use keyword here? }, "message": { "type": "keyword", "index": false, "doc_values": false }, "path": { "type": "keyword", "index": false, "doc_values": false }, "account_id": { "type": "keyword" }, "description": { "type": "keyword", "index": false, "doc_values": false } "v_pnb": { "type": "keyword", "index": false, "doc_values": false }, "currency_id": { "type": "keyword" }, "vd_amount": { "type": "double", "index": false, "doc_values": false }, "vd_name": { "type": "keyword", "index": false, "doc_values": false }, "vd_vbdi": { "type": "keyword", "index": false, "doc_values": false } } } } } }
and this is how I plan to do queries:
GET transactions_2017,transactions_2018,transactions_2019/_search?routing=1234567890 { "query": { "bool": { "filter": [ { "match": { "account_id": "1234567890" } }, { "match": { "currency_id": "011" } } ] } }, "size": 10000, "from": 0, "sort": [ { "date_currency": { "order": "desc" } }, { "transaction_id": { "order": "desc" } } ] }
Any suggestion is welcome and I'm willing to experiment on my own, I know there is no silver bullet solution, but I want to avoid template/shards stuff because it takes 3 days to import the data. Main goal is to show last XX transactions very fast, and nothing else (no full text searches, calculations etc..) If needed, I can put more info about hardware. Tnx