Why is ElasticSearch SQL excruciatingly slow?

I have a very simple SQL query

select hashstring from "flink-index-deduplicated" where bearing = 60

over about 118 million documents of which around 1/19th will match bearing = 60. The bearing values are defined as Long, and each hash string is a 16 word string of around 200-250 characters. ES is taking an excruciating 5-6 seconds to extract each page of results of 1000 rows. The query is being run on the same machine, a MacBook Pro with 10G out 16G allocated to ES, as the database so network latency should not be a problem.

What is going on? This is making ES unusable.

Hi,

I think the problem is the way you are mapping the bearing field on ElasticSearch. For reference:

https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html#map-ids-as-keyword

bearing = 60 is transform to a term query in Elasticsearch. Terms queries over numeric fields are not very efficient as that field type is optimise for range queries. I would advise to map that field as a keyword. If you are doing range queries as well then it would be better to have two fields, one mapped to keyword and the other mapped too long.

Thanks for getting back to me so quickly, and I can see that may be a solution. However, how can I change the mapping to make bearing a keyword or add a new element to each document which is the same as bearing but mapped as as keyword. Bear in mind I have 118 million documents.

Unfortunately, the only way to do that would be to reindex your documents and add /change the bearing field mapped as a keyword.

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

That makes ElasticSearch unusable, it takes 13 hours to load the data from scratch. I'm finding ES unusable, there are too many gotchas hidden away in the documentation like this.

@Nick_Walton I opened an issue to discuss how to avoid this trap.
What do you feel would be the best way to help avoid the problem you ran into:

  1. Better documentation?
  2. More descriptive/specialized field types (long-for-range-queries Vs long-for-id-queries)?
  3. Better data modelling tooling (tools to suggest keyword vs long datatype mappings from looking at your data).
  4. Automatic indexing of numerics for both range and exact-match purposes (would require extra disk space for most users).

Nick, respectively, you have asked a couple of questions, which have been very quickly and clearly answered, and which have highlighted a couple of gaps in your own knowledge.

Now as it appears you're starting with some new tools, no-one expects you to know everything on day 1, or even day 1000, but "gotchas" seems more than a tad unfair. As does then claiming "that makes ElasticSearch unusable".

Let me answer your point, ES sells itself on being a fast text based search engine, running on the NoSQL principle. All well and good, but the documentation is online, unorganised in as much as start here and watch out for some simple mistakes. The latter has cost me a huge amount of time and since time is money, money. NoSQL has as a principle that no schema is need, unlike a RDBMS, unless I'm mistaken. Then it appears that I do not have to define my document structure beforehand. So, I load up my data which takes the best part of 13+ hours and that's only a subset of my full dataset. To load the data as quickly as possible, since it's streaming of a pseudo real-time system, I do not define complicated tokenisers. Then I go to add a whitespace tokeniser to a field, only to find out that I can't do so without reindexing the entire dataset which takes a similar amount of time. I then for reasons of time need to get at one slice of data, identified by a numeric value, and find I'm extracting around 1000 records every five seconds which is ludicrous given I'm intending to extract 6.2 million or thereabouts. I am then told that indexes on numeric feeds are inefficient, and I need to change the field to a keyword and that will require reindexing the data again.

ES is to be admired for what it does well i.e. inexact text indexing, and that is necessary for what I need, but these issues have made the task of loading the data and indexing it as I need it to be well nigh impossible.

The documentation which seems initially to be thorough needs a serious look at. It seems, after the fact, to be more a stream of consciousness or a single stream of thought assuming there reader will start at the beginning and read all the way through with the same level of comprehension.

In summary, I should have been able to do simple things simply. Load up a single stream of records of fixed number of fields and types into a single index, and queried them. Instead, I've found myself spending hours trying to work out how to do simple things. Even the text books seem to fail to point out such simple 'gotchas'.

I had assumed ES as simple to use for simple cases, at least from reading through one or two of the available text books. The inability to add a tokeniser after you have loaded the data without reindexing the entire index seems unforgivable. I have went plus years of RDBMs experience and I have come to expect such capabilities.

As for documentation, the online documentation while thorough is written as one long stream. I would have expected layered approach, overview, tutorial, in depth reference. A simple tutorial detailing how to map out indexes with tokenisers instead of adding them after the data was loaded would have saved me hours, and since hours equate to money, money also.

Hi Nick

I thank you for taking the time to reply and in detail.

The "Elastic Team Members" will likely consider your points carefully and reply helpfully and diplomatically.

I suspect an (perhaps subconscious, maybe based on misleading NoSQL "marketing"?) expectation of some magic bullet. There's a learning curve. It's not a RDBMS. I became aware of need (and costs) of re-indexing on mapping changes within a few weeks of being exposed to ELK stack, and simply didn't and don't consider it as a "gotcha". I personally find the online documentation excellently organised, almost best-in-breed in fact, but that's not to say it cant still be improved.

YMMV on all these things.

Last, your Q was: on a very simple SQL query

select hashstring from "flink-index-deduplicated" where bearing = 60

over about 118 million documents of which around 1/19th will match bearing = 60

Without knowing any details of your use case, that's very explicitly a "RDBMS" oriented question for which it appears you likely have a decent solution? Why are you looking at ELK at all in that case? Not that I'm discouraging you from doing so, I'm just curious as to the why? If someone has said ELK will be way faster than Oracle or Postgres or whatever for this type of thing, I'm not surprised that you are disappointed.

I'd say you were mistaken. A schema is always there - we just don't require you to think about it when you first kick the tires on elasticsearch. You can throw JSON at us to index it then query it. Developers being able to get somewhere in the first 5 minutes is a big part of our adoption.

The next step is not to do a full data load without any further consideration about optimising the index schema for your data or query access patterns. You should review the schema that was inferred from your example document including field types, tokenisers etc and adapt to your needs. For large scale systems that can include optimising the storage required - e.g you may not need to tokenize all string fields (some may be keyword only type fields with no need for data structures used in partial-match searching). We don't know this up front and it's your responsibility as any kind of database admin to consider the various trade-offs required.
Going back to my previous point 4 - we could have indexed your long fields for both range and exact-matching purposes but then we'd have users complaining that we used too much disk space on data structures that weren't required. It's down to the admin again to inform us of how to best organise the data. There's no free lunch here. I admit there's maybe something we can do better with the docs for long mappings and I've opened an issue to discuss ways forward. The solution will involve a trade off between you having to read more docs/engage with modelling tools, or us making automated mapping decisions that could be wrong/wasteful.

What you gain from Lucene's immutable segment files in terms of query speeds you lose in the flexibility to update them. Trade offs, again.

Thanks for the reply. I can’t go into details of the use case but to put it simply the data is a list of hash codes (16) each around 30+ characters long, and formed into a single space separated string I need to be able to match any one hash code against any other with match being partial and ranked eg if I had a hash code "1-XXX 2-YYY …. 16-ZZZ” it must match “1-AAA 2-YYY ….. 16-BBB” , which is I believe ideal for ES esp. given a small example is 1.2 billion hash codes and which would scale up by several orders of magnitude.

At full pelt data will arrive on the database at a rate of around 360,000 values per second, but that is for the future. For the present I need to process a smaller set of around 111 million taken off a CSV file and processed before loading to the database, without considering time to process too much.

Many thanks again

Nick Walton

1 Like

another modelling consideration here - you could avoid any tokenisation processing and word-position storing if you instead mapped these as structured keyword fields and passed as a pre-tokenised array of strings eg.

"hashes": [ "1-XXX", "2-YYY" ...., "16-ZZZ"]

rather than

"hash_string": "1-XXX 2-YYY ... 16-ZZZ"
1 Like

Hey @Nick_Walton, despite all of the other feedback here, I wanted to follow up on one point that was not mentioned by the others.

You said that you are running on a Mac. Is this using Docker or the directly on the OS from the Elasticsearch tarball download?

Using a 15" MBP with the top available i9 and 32GB RAM (16GB given to the ES JVM), indexing fairly complex ~750 byte documents, I can only achieve about 850 docs/s while running in Docker. However running natively on the OS from the tarball, I can achieve over 12,000 docs/s.

This isn't an Elasticsearch issue, rather the fact that Docker containers don't run natively on MacOS, rather within a linux VM in the background. This isn't very efficient. While it is useful for some experimentation and DEV work, it isn't the best performing option.

HI,

Thanks for the note. I’m running on a 2.3Ghz i7 (2013) on bare metal, with ES(10Gb allocated on a 16Gb machine) acting as a sink from Flink pipleline. When stable and running, as I write the email, I’m seeing a throughput of around 2200 documents a second, via wireless, dropping on to two external drives over USB3.0 connections. Most of those documents are being rejected as duplicates, because of the way I need to hold the data.

Interestingly, the Mac is showing ES is using only about 10% of one core, yet after a while the Flink pipeline tends to fail with timeout issues. I don’t know the reason why; it could be Flnk’s ES sink code, platform issues or ES itself. I do have a separate job running in parallel over exactly the same data on the Mac alone but writing to a JDBC connected RDBMS which is a achieving a rate of around 2800 records a second and which has previously completed my full test set of approx 120 million records.

The RDBMS has the benefit that I can add the indexes after I have loaded and secured the data, which with 120million records is a benefit. Though, indexing is not quick.

NIck Walton