Migrate bulk lookup from Oracle - performance problem

Hi,

I'm trying to migrate a text search tool from oracle to elasticsearch. It's working, but unfortunately searching is a lot slower than the oracle version, and I'm hoping for clues as to what can be done about it.

We're indexing millions of text documents, but rather than indexing the text directly, we index a number of fingerprints for each document. Each fingerprint is an integer of appr. 50 bits, and each document produces in the order of hundreds of fingerprints (thousands, occasionally).

The query part is to find documents that are similar to an input document. Two documents are similar if there is any overlap in their fingerprints. So the query is a terms query with the terms from the input document.
This query is the slow part.

The problem is mainly CPU usage. When oracle and ES/windows has the data in the cache, a single query takes oracle 5 to 25 ms, while it takes ES 200 to 350 ms.

Suggestions as to how to make it faster will be much appreciated.

Setup:
Windows server 2012r2.
Single node ES cluster
Elasticsearch 2.2.0
ES memory pool: 10GB
Client: NEST 1 / .NET
CPU: E5-2620 (same for oracle and ES)

Sample query:

POST /myindex-v8/document/_search
{ "size": 5000,
"fields": [ "properties" ],
"filter": { "terms": { "fingerprints": [ 458555129998123,
426113387683010,
1047636941817882,
339603061195725,
496322075763366,
898270748861439,
715145004425481,
1083745791071972,
364744235080270,

[ quite a few more of these ]

1042455538904631,
582951484394277,
264684524292891,
827184922645852 ] } } }

Mapping:

{
  "myindex-v8": {
    "mappings": {
      "document": {
        "_all": {
          "enabled": false
        },
        "properties": {
          "fingerprints": {
            "type": "double",
            "precision_step": 2147483647
          },
          "properties": {
            "type": "string",
            "index": "no"
          },
          "refKey": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

Mapping notes:
The "precision_step" setting in the mapping provided a substantial indexing performance boost.

Query notes:
Most of these queries return 1 to 20 results.

One thing to try is using a constant_score query and move your term filter inside of it, as so: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html

In general, filtering against several thousands of terms is likely not going to be so performant. it may be worth re-thinking what you are trying to accomplish and perhaps there is a better way to go about it.

Also, we would recommend not having a double type for your fingerprints when the values are integers. Using an integer instead will greatly speed up indexing time and reduce the amount of disk space needed for doc-values, because doubles don't get the same compression benefits that integers have. We would also recommend not using fields but rather use _source filtering: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-source-filtering.html

Thank you for your suggestions.

Using constant_score makes the queries go twice as fast, so that's a nice start :slight_smile:.

I'm building a new index with long fingerprints, we'll see what effect that will have.

Using _source filtering does not seem to make a difference.

It sounds to me like the current approach was designed in order to get around the fact that a relational database does not do full text search very well. If that is the case I do not understand the rationale of trying to move this approach as it is directly to Elasticsearch, as it goes against the way Elasticsearch typically handles these kind of problems.

As Elasticsearch is a search engine, it has a lot of tools to allow flexible search, e.g. more like this queries, which allows you to find documents that are similar to a specific set of documents. It does this by indexing and analysing the text of the documents.

In order to migrate this successfully to Elasticsearch I would recommend taking a step back and looking at the requirements and see how you can use Elasticsearch rather rich feature set to achieve this the Elasticsearch way.

This makes me think that these are not "doubles" at all, but really terms. You'll likely get much better performance by making them not_analyzed strings. They'll take up more space I expect, unless you encode them nicely. Numerics aren't a fast as exact term matches, IIRC.

I agree with @Christian_Dahlqvist that it looks like you are trying to shoe horn a search solution designed for another system into Elasticsearch. You are probably better off designing one for Elasticsearch. But if you can't because you want to limit the scope of your project and you really just want to not have Oracle licenses any more then I understand wanting to continue down the route you are using. But just remember "we told you this is funky" for when things don't work super well.