Elastic Search as cache

Hi, all,

I am interested in using Elastic Search to replace memcached. That way, I
have one less software to maintain.

First of all, I learn that I can retrieve a doc by id right after writing
it into an index. There is no need to wait for indexing to finish.

Second, I set "index" to "no" in the mapping of the index reserved for
caching. I played with that and things work great. I can retrieve a doc by
id but nothing shows up when searching.

The last thing I worry is that: when updating a doc, Elastic Search creates
a new doc and marks the original doc for deletion. The actual deletion
happens during a merge. In that case, if I need to constantly update a
value in a cache, would it cause significant delays and/or performance
issues? Elastic Search may not perform as good as memcached, but if things
are not that case, I still want to do everything in Elastic Search.

I appreciate your insights, especially on the last item.

Thanks a lot!
Jingzhao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/680c222c-ede2-40cd-b3b8-323ac22fd660%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you set "index: no" to all fields and disabled _all and _source, you
have low overhead because Lucene does not need to index or merge anything.

But your concerns are correct. Elasticsearch does not have in-place
updates, there is a tradeoff between time and space for that operation. If
you have a large number of deletes, you will notice slower response times.
If you have a large number of gets overweighing deletes, you can get
response times as memcached, once the data is loaded.

I recommend you set up a test cluster (with mlockall and large heap) and
find out the numbers for yourself.

Jörg

On Sat, Nov 22, 2014 at 8:16 PM, Jingzhao Ou jingzhao.ou@gmail.com wrote:

Hi, all,

I am interested in using Elastic Search to replace memcached. That way, I
have one less software to maintain.

First of all, I learn that I can retrieve a doc by id right after writing
it into an index. There is no need to wait for indexing to finish.

Second, I set "index" to "no" in the mapping of the index reserved for
caching. I played with that and things work great. I can retrieve a doc by
id but nothing shows up when searching.

The last thing I worry is that: when updating a doc, Elastic Search
creates a new doc and marks the original doc for deletion. The actual
deletion happens during a merge. In that case, if I need to constantly
update a value in a cache, would it cause significant delays and/or
performance issues? Elastic Search may not perform as good as memcached,
but if things are not that case, I still want to do everything in Elastic
Search.

I appreciate your insights, especially on the last item.

Thanks a lot!
Jingzhao

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/680c222c-ede2-40cd-b3b8-323ac22fd660%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/680c222c-ede2-40cd-b3b8-323ac22fd660%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEug1ed9u45qi8MRn9ysykr0fTVC6Rg%2BjMN0dxnJqa%2BeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi, Jörg,

Thanks a lot for your reply and the useful info. I have more reads than
writes in my case. So, the Elastic Search solution looks promising. I will
go ahead with my tests just as you suggested.

Best regards,
Jingzhao

On Saturday, November 22, 2014 2:57:40 PM UTC-8, Jörg Prante wrote:

If you set "index: no" to all fields and disabled _all and _source, you
have low overhead because Lucene does not need to index or merge anything.

But your concerns are correct. Elasticsearch does not have in-place
updates, there is a tradeoff between time and space for that operation. If
you have a large number of deletes, you will notice slower response times.
If you have a large number of gets overweighing deletes, you can get
response times as memcached, once the data is loaded.

I recommend you set up a test cluster (with mlockall and large heap) and
find out the numbers for yourself.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b40be42c-530d-445a-9dba-2cd16af46236%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi, Jörg,

I did some experiments following your advice. One issue is that: when
"_source" is disabled, when I get a document by its id, the original JSON
is not there. How am I supposed to get my data back with "_source" set to
false?

For example, I have

{
"mappings":{
"default":{
"_source": { "enabled": false },
"_all": { "enabled": false },
"dynamic_templates":[
{
"noindex": {
"match":"*",
"mapping": {
"index": "no"
}
}
}
]
}
}
}

Then, a GET returns the following. I cannot find the JSON data associated
with the id (usually, that is in the _source field).

{"_index":"keyvaluestore","_type":"001DC0B00F00","_id":"1","_version":1,"found":true}

For a key/value store, I need to be able to get the value back. Could you
give me some more suggestions?

Thanks,
Jingzhao

On Saturday, November 22, 2014 2:57:40 PM UTC-8, Jörg Prante wrote:

If you set "index: no" to all fields and disabled _all and _source, you
have low overhead because Lucene does not need to index or merge anything.

But your concerns are correct. Elasticsearch does not have in-place
updates, there is a tradeoff between time and space for that operation. If
you have a large number of deletes, you will notice slower response times.
If you have a large number of gets overweighing deletes, you can get
response times as memcached, once the data is loaded.

I recommend you set up a test cluster (with mlockall and large heap) and
find out the numbers for yourself.

Jörg

On Sat, Nov 22, 2014 at 8:16 PM, Jingzhao Ou <jingz...@gmail.com
<javascript:>> wrote:

Hi, all,

I am interested in using Elastic Search to replace memcached. That way, I
have one less software to maintain.

First of all, I learn that I can retrieve a doc by id right after writing
it into an index. There is no need to wait for indexing to finish.

Second, I set "index" to "no" in the mapping of the index reserved for
caching. I played with that and things work great. I can retrieve a doc by
id but nothing shows up when searching.

The last thing I worry is that: when updating a doc, Elastic Search
creates a new doc and marks the original doc for deletion. The actual
deletion happens during a merge. In that case, if I need to constantly
update a value in a cache, would it cause significant delays and/or
performance issues? Elastic Search may not perform as good as memcached,
but if things are not that case, I still want to do everything in Elastic
Search.

I appreciate your insights, especially on the last item.

Thanks a lot!
Jingzhao

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/680c222c-ede2-40cd-b3b8-323ac22fd660%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/680c222c-ede2-40cd-b3b8-323ac22fd660%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/13bb6e9e-f672-4c09-b518-8565876df383%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I did some experiments following your advice. One issue is that: when
"_source" is disabled, when I get a document by its id, the original JSON
is not there. How am I supposed to get my data back with "_source" set to
false?

I figured out a way to get data back when "_source" is disabled. That is by
setting "store" to true on a field. My question is whether setting "store"
to true for all the fields is faster than setting "_source" to true in my
specific case.

Thanks a lot!
Jingzhao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a4eb1ed6-dd75-4d71-901c-be649085c1cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Actually the _source field is a regular field defined by default as stored, do you need to fetch all the fields you're sending to ES? By default if you don't have any other field specified as stored when you request any particular field ES parses the _source field and returns it. So I don't think that specifying the fields as stored will be slower than using the _source field and you get more control over what you store.

Greetings,

----- Original Message -----

From: "Jingzhao Ou" jingzhao.ou@gmail.com
To: elasticsearch@googlegroups.com
Sent: Saturday, November 22, 2014 11:54:27 PM
Subject: Re: Elastic Search as cache

I did some experiments following your advice. One issue is that: when "_source" is disabled, when I get a document by its id, the original JSON is not there. How am I supposed to get my data back with "_source" set to false?

I figured out a way to get data back when "_source" is disabled. That is by setting "store" to true on a field. My question is whether setting "store" to true for all the fields is faster than setting "_source" to true in my specific case.

Thanks a lot!
Jingzhao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a4eb1ed6-dd75-4d71-901c-be649085c1cf%40googlegroups.com .
For more options, visit https://groups.google.com/d/optout .

Proceso de Acreditación de la Maestría en Gestión de Proyectos Informáticos.
En busca de la Excelencia. Del 24 al 28 de noviembre de 2014.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1222705123.113462.1416729446569.JavaMail.zimbra%40uci.cu.
For more options, visit https://groups.google.com/d/optout.

You should use _source in the following cases:

  • you have a larger, nested JSON data model, not only a single field. ES
    sets automatically _source as stored

  • you want to return all original data automatically and maybe filter
    fields by the _source in search responses

  • and you want to keep the original form of the data in the ES index beside
    the indexed form e.g. for backup purpose or further processing

For a simple memcached key/value store substitution, I think _source is not
required.

To your question, in general, with "index:yes", setting "store:true" uses
more space than "store:false" in the index and therefore slows everything
down in the get request/response, this holds for _source and other fields.
But with "index:no" you have no choice, otherwise you won't be able to
return the data.

Look at doc values (uninverted field data) if they fit better for your
purpose of key/value store than stored fields:

Because doc values can reside in file system cache outside of JVM heap, I
assume heavy load tests will reveal they are faster than stored fields,
which always have to be moved over the heap (which includes GC overhead).

Jörg

On Sun, Nov 23, 2014 at 5:54 AM, Jingzhao Ou jingzhao.ou@gmail.com wrote:

I did some experiments following your advice. One issue is that: when

"_source" is disabled, when I get a document by its id, the original JSON
is not there. How am I supposed to get my data back with "_source" set to
false?

I figured out a way to get data back when "_source" is disabled. That is
by setting "store" to true on a field. My question is whether setting
"store" to true for all the fields is faster than setting "_source" to true
in my specific case.

Thanks a lot!
Jingzhao

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a4eb1ed6-dd75-4d71-901c-be649085c1cf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a4eb1ed6-dd75-4d71-901c-be649085c1cf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF_bGQ0aBXZEHT2aBbLtB%2BgNPVLJk7mKPXFsO9tMYFErg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi, Jörg,

"doc_values" is an awesome feature even for my normal uses! I have lots of
numeric data that does not need string analyzer. I can move these data to
be managed by the OS file system.

For my memcached case, I have "_source": { "enabled": false } to the whole
index and enable "doc_values" for all the fields. Note "store" needs to be
true to get the field data really stored.

"store": true,
"fielddata": {
"format": "doc_values"
}

Things are working fine and look very fast. My remaining question is: how
can I verify the field data is needed on disk? I cannot find much clue by
checking the index stats and through Mavel.

Thanks a lot!
Jingzhao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3fa6bf92-eaec-433f-acbd-d2471a36ae4e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Can you rephrase your question, I do not understand....

Jörg

On Sun, Nov 23, 2014 at 6:43 PM, Jingzhao Ou jingzhao.ou@gmail.com wrote:

My remaining question is: how can I verify the field data is needed on
disk?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEtyNm1FVa%3DOgqOE7pdcym4ojY-k927OHp-WepoGki5PA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sorry about that.

I meant to ask: with "doc_values" set to true, how to verify that the field
data is stored in file system cache, instead of JVM heap.

Thanks,
Jingzhao

On Sunday, November 23, 2014 2:39:32 PM UTC-8, Jörg Prante wrote:

Can you rephrase your question, I do not understand....

Jörg

On Sun, Nov 23, 2014 at 6:43 PM, Jingzhao Ou <jingz...@gmail.com
<javascript:>> wrote:

My remaining question is: how can I verify the field data is needed on
disk?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6c95df28-1c93-4b56-beef-979f6678a86f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Doc values are in the .dvd file, these are loaded by Lucene on-demand, and
the OS manages the file system cache for that.

The only method I know is to study the process map of the ES process (pmap,
lsof) whether .dvd files are in use or not and how much memory pages they
use.

Jörg

On Mon, Nov 24, 2014 at 1:26 AM, Jingzhao Ou jingzhao.ou@gmail.com wrote:

Sorry about that.

I meant to ask: with "doc_values" set to true, how to verify that the
field data is stored in file system cache, instead of JVM heap.

Thanks,
Jingzhao

On Sunday, November 23, 2014 2:39:32 PM UTC-8, Jörg Prante wrote:

Can you rephrase your question, I do not understand....

Jörg

On Sun, Nov 23, 2014 at 6:43 PM, Jingzhao Ou jingz...@gmail.com wrote:

My remaining question is: how can I verify the field data is needed on
disk?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6c95df28-1c93-4b56-beef-979f6678a86f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6c95df28-1c93-4b56-beef-979f6678a86f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGcLvJ0XP0RqPWUNHbXwSt3qS4-tgBc5cV7Kq99%3DOCgNw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I am on Windows 8. So, I used VMMAP to dump the process memory map. I found
that there are too many .dvd files loaded by Lucene, for example

Almost every index has a few .dvd files in use. I am unable to investigate
further due to my limited knowledge. I hope that later on, tools like Mavel
can reveal such data in a better way since this looks like a very important
feature.

I appreciate your help!
Jingzhao

On Monday, November 24, 2014 1:18:16 AM UTC-8, Jörg Prante wrote:

Doc values are in the .dvd file, these are loaded by Lucene on-demand, and
the OS manages the file system cache for that.

The only method I know is to study the process map of the ES process
(pmap, lsof) whether .dvd files are in use or not and how much memory pages
they use.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/71286dc8-c4b6-4775-ab50-674182d3c780%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.