Elasticsearch Performance Analysis

To give you a sense - we have 16 elasticsearch nodes with "slow" disks and
easily push ~5000 docs/second without many optimizations. To get that
particular number we use:

  1. 10 processes to push the data (PHP so no threads)
  2. The bulk api
  3. We set the refresh interval to -1

We could go a ton faster but we're serving searches during this and we
don't want to hammer the machines. This adds about three points of load
average across the cluster which is fine.

A few things to keep in mind:

  • When the index is small you'll write about 2x as fast as once it has a
    couple million documents in it. It stabilizes around there for me.
  • Some write requests are serviced very fast but some will be blocked
    because the shard is flushing/merging/something else I forget. This
    happens on each shard independently so using multiple threads to input the
    documents will make a big difference.
  • My numbers come from writing whole wiki articles which, in this case,
    were in the neighborhood of 10KB each. I imagine yours are smaller.
  • I store term vectors which you won't and that costs me another 20%-30%
    performance overhead.
  • "Slow" disks means a single rotating server grade disk. I haven't seen
    it in person so I imagine it is one of those tiny things that spins more
    slowly the desktop grade stuff so it can fit in 1/2U slot or something.
  • You probably can't get away with setting the refresh interval to -1.
    That means only make documents available for searching when you've run out
    of memory to buffer them. Since we're rebuilding an index for an atomic
    swap we don't care. You probably could get away with setting it to 1m or
    something, depending.

Seriously though, use the bulk api. You might want to look at how logstash
does it. Something as simple as buffering 100 events or 10 seconds
(whichever comes first) will help.

Nik

On Wed, Mar 5, 2014 at 2:55 AM, Itamar Syn-Hershko itamar@code972.comwrote:

Writing 500 documents per second is pretty easy to achieve, given a decent
machine. Your code should just work and achieve that.

Multithreading on the client side, and splitting the index to shards
residing on different servers is usually the solution for achieving higher
write throughput. But as I said I don't think you are going to need this
now.

Other than the bulk API, using the Java native client is one optimization
you can have - it communicates with the cluster faster

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Wed, Mar 5, 2014 at 8:20 AM, Isaac Hazan isaac.yann.hazan@gmail.comwrote:

We are currently evaluating Elasticsearch as our solution for Analytics.
The main driver is the fact that once the data is populated into
Elasticsearch, the reporting comes for free with Kibana.

Before adopting it, I am tasked to do a performance analysis of the tool.

The main requirement is supporting a PUT rate of 500 evt/sec.

I am currently starting with a small setup as follows just to get a sense
of the API before I upload that to a more serious lab.

My Strategy is basically, going over CSVs of analytics that correspond to
the format I need and putting them into elasticsearch. I am not using the
bulk API because in reality the events will not arrive in a bulk fashion.

Following is the main code that does this:

    // Created once, used for creating a JSON from a bean
    ObjectMapper mapper = new ObjectMapper();

    // Creating a measurement for checking the count of sent events vs
    // ES stored events
    AnalyticsMetrics metrics = new AnalyticsMetrics();
    metrics.startRecording();

    File dir = new File(mFolder);
    for (File file : dir.listFiles()) {

        CSVReader reader = new CSVReader(new FileReader(file.getAbsolutePath()), '|');
        String [] nextLine;
        while ((nextLine = reader.readNext()) != null) {
            AnalyticRecord record = new AnalyticRecord();
            record.serializeLine(nextLine);

            // Generate json
            String json = mapper.writeValueAsString(record);

            IndexResponse response = mClient.getClient().prepareIndex("sdk_sync_log", "sdk_sync")
                    .setSource(json)
                    .execute()
                    .actionGet();

            // Recording Metrics
            metrics.sent();

        }
    }

    metrics.stopRecording();

    return metrics;

I have the following questions:

  1. How do I know through the API when all the requests are completed
    and the data is saved into Elasticsearch? I could query Elasticsearch for
    the objects counts in my particular index but doing that would be a new
    performance factor by itself, hence I am eliminating this option.
  2. Is the above the fastest way to insert object to Elasticsearch or
    are there other optimizations I could do. Keep in mind the bulk API is not
    an option for now.

Thx in advance.

P.S: the Elasticsearch version I am using on both client and server is
1.0.0.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2afba995-4ae2-41a5-b395-7a90ea2fcc86%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/2afba995-4ae2-41a5-b395-7a90ea2fcc86%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZstMAHBaGrr1bjBCw_jJiCJkJrm%3DX6usk6rNw9JcPhEzg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAHTr4ZstMAHBaGrr1bjBCw_jJiCJkJrm%3DX6usk6rNw9JcPhEzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd05Y8Q9mHqayKjtx6Jyz6PT6L5QkkWOtdkzS%3DFO4a56Kg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.