What is the fastest storage type suitable for elasticsaerch than the HDD?

The data lifecycle is as follows:
When the number of documents reaches 20000000 or when it reaches 14 days old, it moves to the cold layer

Querying data is a lot

Yes, it is possible, but the most query will be on the hot layer, for the cold layer, it will be a bit light

Storage in the cold layer will be a lot all data that is older than 14 days or the effect of 200000000 documents
And I want to store data in the cold in order to reduce the cost and keep a lot of data and also make access to the data that we are in the last 14 days very quick

Yes ready for that I have data I will be in the cloud

And I want a consultation in some matters and I will pay him

I'll keep this in mind.

If you are looking for a paid consultant in future, why not now, at design phase?

There were various additions since I last looked at this thread. I do not mean to be impolite, but I still fear the focus on this cold tier feels a bit off. Just a feeling.

Off the bat, my own preference is generally for more horizontally-scaled solutions. I'd rather have 10+ smaller servers than 5 bigger servers. There are times when that general rule won't fit, but in my experience it's less common.

"the most query will be on the hot layer, for the cold layer, it will be a bit light"

I really hope the users understand the "bit light" queries are also probably going to be a LOT slower.

What sort of data are we talking here?

"SAN is better than NAS"

Is it? Every SAN is better than any NAS? No, not true, that depends on so many factors. Certainly there are advantages of SAN-based storage. There's also advantages of just plain archiving data when it's no longer required online - if you keep 5 years worth of data on your cold layer, users WILL query it if you allow that possibility.

Another thing is your ratios between CPU count, RAM, hot data, and cold data. 5 servers @ 1.5TB RAM, 2* Intel xeon platinum 8352V, 36 cores each, 70 TB SSD - thats a lot of CPU power. How many SSDs make up the "70 TB "? Are these arranged in some RAID configuration? (Please say yes, disks do fail, and you need to plan for it!).

I am asking as my hunch here is the I/O to the RAID/SSDs will (maybe surprisingly) effectively become your bottleneck, and a lot of your CPUs will be idling.

For such a hefty setup I am presuming there's a close to real-time element here? Users often are greedy - if you give them rapid access to the hot data, they'll gobble up all the available resources for fancy (and expensive) dashboards. Your "slow" data will be even slower than you think. Then they'll complain.

Last, and I'm maybe make myself unpopular with the "solution architects" here, but again from experience, all the planning and modeling in the world doesn't help as much as people think it should. There's almost always some implicit or explicit assumption that ends up being way wrong. Rather than wait for that sh!t to hit that fan, and then play a blame game, better plan for ongoing flexibility.

Anyways, its an interesting thread and I wish you success with the project.

2 Likes

I am still in the study stage in order to know some things and soon I will enter the design stage and I will consult and pay for it

I respect your feelings, but I'm not just focusing on the cold layer. I just thought it would reduce costs. The costs would be big if you stored this amount of data in SSD storage.

I respect any modification or any advice and take it into consideration
Do you have advice, modification or any method?

Yes, this method is useful and I also think it is better than dividing large servers into several nodes by default, where if the node is one in one server, all the resources will be for this node and they will be exploited for node because when splitting by default there is some loss of resources

But will it be useful? Because node is big you need time to recover

What's the solution then?

Data has many types, texts, numbers, bool and date

I don't know exactly it is only an example I read that SAN is fast and so on and I gave it as an example only and I originally asked about that

Number of drives 24.
Yes it was arranged RAID5 is this good or use RAID6.

Quick access will be for the last 14 days only and more than that will be from HDD or snapshot searchable

If you see like this, what should be done?

That's what I want.

Thank you

It would help if you could provide some additional deatils about the use case and how you will be querying the data. Lets focus on the hot tier requirements for now. You have asked questions about this in the past but I am not sure that is relevant to the use case or not, so would like to clarify.

What is the approximate average document size?

Is it correct that this is a search use case where you will be returning documents and not just aggregations?

Is it also correct that most queries will query most data, e.g. all data in the hot tier?

What type of query clauses will you be using? Will queries be relatively simple or large and complex?

How many documents will each query return on average (order of magnitude is enough)?

How many concurrent queries do you envision needing to support against the hot tier?

What is the latency expectations for these queries? Please give an indication and not just a 'fast'.

3 things for free, more maybe if you answer questions more specifically than

"Data has many types, texts, numbers, bool and date"

Please understand that this sort of answer is, er, useless.

  1. I'd almost always suggest RAID6 here. RAID5 minus a failed disk is a PITA - until you replace the disk and the rebuild is complete (can be days) your data is at risk - drinking in last chance saloon before some data is gone. Elastic itself helps you here, data is replicated, but ...

(If using RAID5, have spare disks on site and read to drop in place).

  1. I dont generally like the idea of "dividing large servers into several nodes". That a quasi-virtualisation approach. Virtualization is fantastic, but not IMHO a great fit here. As written before, my personal experience-driven preference is just more, smaller servers. Period!

  2. "I read that SAN is fast" is a worrying construction. In a previous answer I implicitly criticized solution architects, but thats not to say the role isn't helpful. Everything you are writing here suggests a sysadmin trying to put together an enterprise-level solution, while (sorry!) lacking some key core skills and experience. Questions are veering towards "what sort of concrete should I use to support the big new bridge I'm building?".

1 Like

When searching I use match, term, aggs and sometimes wildcard

Query phrases I use:
Match Match
term
Aggs
wildcard
script

There are simple queries and there are large and relatively complex queries

Approx. Document size up to 1 MB

Yes, the queries will return the documents and also the aggregations

Yes true but it depends on the need of the user
For example, the query returns from 20 thousand to 40 thousand This is in the normal form, but if the user wants to return more data, he only scrolls or specifies certain conditions for the data he wants, such as from a date to another specific date, or he filters the data according to a specific field in the project The flexibility is high for the user to display his data in the appropriate way for him, so he must be quick to return the results

The queries I use vary from one query to another, including term, match, aggs, and also sometimes using wildcard and script

The queries will vary, some of them are somewhat simple, and there are some that are large

The default in the interfaces shows 20,000 documents, but it is expected that the user will query up to 100,000 and may reach twice this number in cases of review

I don't know exactly how much, but there are many details, so we will need a lot of queries relatively

This is an example of a query.

GET emdix_name/_search
{ 
  "aggs": {
    "duplicate_values": {
      "composite": {
        "sources": [
          { "field_name": { "terms": { "field": "field_name" } } },
          { "field_name": { "terms": { "field": "field_name" } } },
          { "field_name": { "terms": { "field": "field_name" } } },
          { "field_name": { "terms": { "field": "field_name" } } }
        ],
        "size": 100
      },
      "aggs": {
        "duplicate_count": {
          "value_count": {
            "field": "field_name"
          }
        },
        "duplicate_script": {
          "bucket_script": {
            "buckets_path": {
              "count": "duplicate_count"
            },
            "script": "params.count >= 1 ? 1 : 0"
          }
        }
      }
    }
  },
  "collapse": {
    "field": "field_name"
  },
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "field_name": ""
          }
        },
        {
          "term": {
            "field_name": ""
          }
        },
        {
          "term": {
            "field_name": ""
          }
        },
        {
          "term": {
            "field_name": ""
          }
        }
      ]
    }
  },
  "track_total_hits": true
}

This is only an inquiry that took:


And this is another query

GET /index_name/_search
{
  "size": 10,
  "aggs": {
    "unique_combinations": {
      "composite": {
        "sources": [
          {"field1": {"terms": {"field": "field_name"}}},
          {"field2": {"terms": {"field": "field_name"}}},
          {"field3": {"terms": {"field": "field_name"}}}
        ]
              }
    }
  }
}

This is only an inquiry that took: 21677

I am not asking about what latencies you have seen so far but rather what an acceptable latency would be from a users perspective. If you have a medium complex query that queries just the hot tier, returning 10000 documents of 1MB each, how long are willing to let this take? 1 second? 10 seconds? 1 minute? 10 minutes?

This is critical as a large number of concurrent queries will quickly add load to the cluster and affect latencies especially if you generally are quering most of the data set and are not targeting queries.

From an Elasticsearch perspective these are very large documents and returning that many results for each query will result in a lot of IOPS. I would not be surprised if latencies will be rather long for these types of queries even if you use NVMe SSDs. This is why I am recommending you run a POC before committing to buying hardware for this project.

texts, numbers, bool, date and keyword
This is the type of data that I store specifically
But the number of fields is plentiful

Well, I'll do it.
Added... I planned that 10TB is left as a reserve is that good?

I left 10 TB

Yes you are right and it never occurred to me that some resources would be lost
Thank you for this information as well

Yes, we now have a Qnap NAS, but he thinks there are other better types.

But what do you think is right?

Acceptable response time 1-2 second

It will be 1000

Not all of them are this big, but there are some of them, and the other data is small, up to some kilobytes, meaning that it is small, but there are only some up to 1 megabyte, which is little data

Okay, I'll run the concept.

Can I just add one more general point.

This whole thread is dominated by unspecific generalities. The longer reply to Christian particularly. You can't design a solution on that basis. Numerically expressible requirements are a must, so you can balance things the solution MUST provide to be in any way useful, against other things that might be "nice", but are not requirements.

As well as helping develop the elastic cluster, please use any POC to also help develop the details on the critical use cases, prioritizing those which are really key to the business case / stakeholders.

To use the bridge example again, I'm not sure yet if you are looking for a railway bridge, a road bridge, or a foot bridge across a ravine, a river, or an ocean. If you really need a road bridge over a river, start working on #cars/hour, any weight limits for trucks, how many lanes, etc, and stop worrying so much (for now) about anybody going over on foot.

2 Likes

Excuse me because it was not clear I try to describe everything a bit accurately

What are the things you need to know, I will try to answer briefly

I didn't understand what you mean by the example

If you mean that everything changes according to the use case whether the load is big or not

There is a big load

It was just an analogy, dont worry.

The point is you need to tightly define as soon/early as possible what are the really core requirements. And this cannot involve terms like big/small, fast, etc rather needs to be numeric and measurable. Do not be surprised if you need to tune quite a lot. e.g. A well defined cluster will still perform bad with terribly inefficient queries.

3 Likes

I'll do it.
Can you give me a simple example of this, I mean the requirements and digitally identify them?
Do you like
Response speed within 1 second? Like this?

Yes. Write down clear, unambiguous, numerical, measurable requirements. Much like you were writing an SLA.

Just examples:

The cluster/solution has to ingest on average N documents per second, measured over a 24 hour day, with a peak rate of M documents per second.

This query (and be specific on the query) should respond within an average of X1 seconds, and at worst case (less than 1% of the time) in Y1 second when executed against the hot layer.

That query (and be specific on the query) should respond within an average of X2 seconds, and at worst case (less than 5% of the time) in Y2 second when executed against the cold layer.

That other query (and be specific on the query) should respond within an average of X3 seconds, and at worst case (less than 5% of the time) in Y3 second when executed against the hot or cold layers.

Repeat for all the important use cases, queries, reports, dashboards, alerts, whatever.

Data will be queryable within N seconds of capture.

Dashboards D1 and D2 will be used to monitor the system, alerts all be fired in the following scenarios: < list when you want to know if the system is not performing correctly >

Total data volume will be so many kilo/mega/tera bytes per day/second/whatever. X GB of data will be in hot storage, which will cover period of N days. Y GB will be in cold layer, which will cover period of of M weeks/months.

Thank you very much

I'll do this.