Realtime search + fast indexing


(Nico Krijnen) #1

Hi,

We have recently migrated our application from 'bare Lucene + Zoie for
realtime search' to Elastic Search. Elastic search is awesome and next to
scalability, it gives us lots of additional features. The one thing we
really miss though is realtime search.

Search is the core of our application. All our data is stored in the index
(primary data store). When a user adds a file or makes a change, their
subsequent search must reflect that change. With Zoie, the data was indexed
very quickly into a temporary Lucene memory index. Not having to write+read
it on disk makes the documents available for search much faster than NRT
Lucene. The memory index is flushed to disk asynchrounously from time to
time, not impacting indexing or search performance. Zoie also allows you to
wait for a specific 'version of the index' to be available for searching.
That way we could make the user's thread wait until their data was indexed
in memory, only pausing the thread of that user without having any
performance impact for all the other users.

Result: realtime search and insanely fast indexing.

With Elastic Search we have to do a refresh to make data available for
search. Lots of refreshes or the 1 second refresh interval will cause
significant slower indexing speed. We don't know beforehand when our users
will import documents or make lots of changes, so we cannot really increase
the refresh interval when needed to make indexing faster. We know that
'get' is realtime and we make use of that as much as possible, but in lots
of cases we really require a search to find the data.

Our plan is to implement some mechanism in Elastic Search to get the same
realtime search + fast indexing behavior that we had with Zoie. We need
some pointers though on what would be the best place in Elastic Search to
do something like this. After all it hooks into low level Elastic Search
and Lucene stuff.

I can imagine that 'realtime-search while indexing' is important for many
other Elastic Search users too. What are the chances of something like this
getting merged back into the main branch?

I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is there
anyone attending with whom I could do some sparring with on this matter?

Thanks,
Nico

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

Zoie is not for distributed search. If you want to analyze the LinkedIn
developments for this area with Lucene, you should look at Sensei

There was also a BalancedSegmentMergePolicy donated to Lucene 2.x from the
Zoie project

https://issues.apache.org/jira/browse/LUCENE-1924

but there was not enough energy for maintaining it. Now Lucene is at
version 4, with vast improvements in the area of segment merging.

You mention the in-memory segments for fast NRT. Lucene 4 has implemented
this by default, plus Elasticsearch has some more improvements for
distributed NRT get.

Note, not all searches can be candidates for NRT. If you use mlockall and
index store type mmapfs, you can move almost all your ES/Lucene data and
files to RAM (if you can spend enough hardware). Modifying data in the
index always means to invalidate fielddata cache and maybe filter/facet
caches, and creation of new cache generations, which is expensive and
destroys performance. There is a tradeoff, balancing must be done very
carefully to avoid stale results. This is hard when not much is known about
the typical search workload of an application. ES allows to cache filters
and to clear caches explicitly. Maybe this is an area to experiment with.
But it always depends.

Jörg

On Thu, Jun 26, 2014 at 11:25 AM, Nico Krijnen nkr@woodwing.com wrote:

Hi,

We have recently migrated our application from 'bare Lucene + Zoie for
realtime search' to Elastic Search. Elastic search is awesome and next to
scalability, it gives us lots of additional features. The one thing we
really miss though is realtime search.

Search is the core of our application. All our data is stored in the index
(primary data store). When a user adds a file or makes a change, their
subsequent search must reflect that change. With Zoie, the data was indexed
very quickly into a temporary Lucene memory index. Not having to write+read
it on disk makes the documents available for search much faster than NRT
Lucene. The memory index is flushed to disk asynchrounously from time to
time, not impacting indexing or search performance. Zoie also allows you to
wait for a specific 'version of the index' to be available for searching.
That way we could make the user's thread wait until their data was indexed
in memory, only pausing the thread of that user without having any
performance impact for all the other users.

Result: realtime search and insanely fast indexing.

With Elastic Search we have to do a refresh to make data available for
search. Lots of refreshes or the 1 second refresh interval will cause
significant slower indexing speed. We don't know beforehand when our users
will import documents or make lots of changes, so we cannot really increase
the refresh interval when needed to make indexing faster. We know that
'get' is realtime and we make use of that as much as possible, but in lots
of cases we really require a search to find the data.

Our plan is to implement some mechanism in Elastic Search to get the same
realtime search + fast indexing behavior that we had with Zoie. We need
some pointers though on what would be the best place in Elastic Search to
do something like this. After all it hooks into low level Elastic Search
and Lucene stuff.

I can imagine that 'realtime-search while indexing' is important for many
other Elastic Search users too. What are the chances of something like this
getting merged back into the main branch?

I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is there
anyone attending with whom I could do some sparring with on this matter?

Thanks,
Nico

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH2i09%3DdFTO0g%3Dc0Z9q%2BwCmdm7%3DtfzC3TV1-QQws8gsdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nico Krijnen) #3

Zoie is not for distributed search.

We know, that's why we replaced our search layer with Elastic Search. Zoie
and Sensei do not have as much users as Elastic Search and as such have
much less traction, which made Elastic Search an obvious choice for
handling our distributed search needs.

You mention the in-memory segments for fast NRT. Lucene 4 has implemented
this by default.

Nice. I'm reading up on the details about this. Do you know if these
in-memory segments are immediately being used for search? Or do the new
docs only become available after the segments are flushed to disk?

Last friday I also heard about some of the performance improvement being
worked at for ElasticSearch 1.3 and 1.4, sounds like steps are already
being taken to improve realtime search.

Nico

On Thursday, June 26, 2014 1:20:10 PM UTC+2, Jörg Prante wrote:

Zoie is not for distributed search. If you want to analyze the LinkedIn
developments for this area with Lucene, you should look at Sensei

There was also a BalancedSegmentMergePolicy donated to Lucene 2.x from the
Zoie project

https://issues.apache.org/jira/browse/LUCENE-1924

but there was not enough energy for maintaining it. Now Lucene is at
version 4, with vast improvements in the area of segment merging.

You mention the in-memory segments for fast NRT. Lucene 4 has implemented
this by default, plus Elasticsearch has some more improvements for
distributed NRT get.

Note, not all searches can be candidates for NRT. If you use mlockall and
index store type mmapfs, you can move almost all your ES/Lucene data and
files to RAM (if you can spend enough hardware). Modifying data in the
index always means to invalidate fielddata cache and maybe filter/facet
caches, and creation of new cache generations, which is expensive and
destroys performance. There is a tradeoff, balancing must be done very
carefully to avoid stale results. This is hard when not much is known about
the typical search workload of an application. ES allows to cache filters
and to clear caches explicitly. Maybe this is an area to experiment with.
But it always depends.

Jörg

On Thu, Jun 26, 2014 at 11:25 AM, Nico Krijnen <n...@woodwing.com
<javascript:>> wrote:

Hi,

We have recently migrated our application from 'bare Lucene + Zoie for
realtime search' to Elastic Search. Elastic search is awesome and next to
scalability, it gives us lots of additional features. The one thing we
really miss though is realtime search.

Search is the core of our application. All our data is stored in the
index (primary data store). When a user adds a file or makes a change,
their subsequent search must reflect that change. With Zoie, the data was
indexed very quickly into a temporary Lucene memory index. Not having to
write+read it on disk makes the documents available for search much faster
than NRT Lucene. The memory index is flushed to disk asynchrounously from
time to time, not impacting indexing or search performance. Zoie also
allows you to wait for a specific 'version of the index' to be available
for searching. That way we could make the user's thread wait until their
data was indexed in memory, only pausing the thread of that user without
having any performance impact for all the other users.

Result: realtime search and insanely fast indexing.

With Elastic Search we have to do a refresh to make data available for
search. Lots of refreshes or the 1 second refresh interval will cause
significant slower indexing speed. We don't know beforehand when our users
will import documents or make lots of changes, so we cannot really increase
the refresh interval when needed to make indexing faster. We know that
'get' is realtime and we make use of that as much as possible, but in lots
of cases we really require a search to find the data.

Our plan is to implement some mechanism in Elastic Search to get the same
realtime search + fast indexing behavior that we had with Zoie. We need
some pointers though on what would be the best place in Elastic Search to
do something like this. After all it hooks into low level Elastic Search
and Lucene stuff.

I can imagine that 'realtime-search while indexing' is important for many
other Elastic Search users too. What are the chances of something like this
getting merged back into the main branch?

I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is there
anyone attending with whom I could do some sparring with on this matter?

Thanks,
Nico

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #4

GET requests use both the Lucene index and the transaction log to retrieve
documents. Search requests will use only Lucene since the inverted index is
not updated until the transaction log is flushed. I haven't paid too much
attention to the distributed aspects of the code in a while, but this
behavior was used prior to 1.0.

Cheers,

Ivan

On Mon, Jun 30, 2014 at 3:37 AM, Nico Krijnen nkr@woodwing.com wrote:

Zoie is not for distributed search.

We know, that's why we replaced our search layer with Elastic Search. Zoie
and Sensei do not have as much users as Elastic Search and as such have
much less traction, which made Elastic Search an obvious choice for
handling our distributed search needs.

You mention the in-memory segments for fast NRT. Lucene 4 has
implemented this by default.

Nice. I'm reading up on the details about this. Do you know if these
in-memory segments are immediately being used for search? Or do the new
docs only become available after the segments are flushed to disk?

Last friday I also heard about some of the performance improvement being
worked at for ElasticSearch 1.3 and 1.4, sounds like steps are already
being taken to improve realtime search.

Nico

On Thursday, June 26, 2014 1:20:10 PM UTC+2, Jörg Prante wrote:

Zoie is not for distributed search. If you want to analyze the LinkedIn
developments for this area with Lucene, you should look at Sensei

There was also a BalancedSegmentMergePolicy donated to Lucene 2.x from
the Zoie project

https://issues.apache.org/jira/browse/LUCENE-1924

but there was not enough energy for maintaining it. Now Lucene is at
version 4, with vast improvements in the area of segment merging.

You mention the in-memory segments for fast NRT. Lucene 4 has implemented
this by default, plus Elasticsearch has some more improvements for
distributed NRT get.

Note, not all searches can be candidates for NRT. If you use mlockall and
index store type mmapfs, you can move almost all your ES/Lucene data and
files to RAM (if you can spend enough hardware). Modifying data in the
index always means to invalidate fielddata cache and maybe filter/facet
caches, and creation of new cache generations, which is expensive and
destroys performance. There is a tradeoff, balancing must be done very
carefully to avoid stale results. This is hard when not much is known about
the typical search workload of an application. ES allows to cache filters
and to clear caches explicitly. Maybe this is an area to experiment with.
But it always depends.

Jörg

On Thu, Jun 26, 2014 at 11:25 AM, Nico Krijnen n...@woodwing.com wrote:

Hi,

We have recently migrated our application from 'bare Lucene + Zoie for
realtime search' to Elastic Search. Elastic search is awesome and next to
scalability, it gives us lots of additional features. The one thing we
really miss though is realtime search.

Search is the core of our application. All our data is stored in the
index (primary data store). When a user adds a file or makes a change,
their subsequent search must reflect that change. With Zoie, the data was
indexed very quickly into a temporary Lucene memory index. Not having to
write+read it on disk makes the documents available for search much faster
than NRT Lucene. The memory index is flushed to disk asynchrounously from
time to time, not impacting indexing or search performance. Zoie also
allows you to wait for a specific 'version of the index' to be available
for searching. That way we could make the user's thread wait until their
data was indexed in memory, only pausing the thread of that user without
having any performance impact for all the other users.

Result: realtime search and insanely fast indexing.

With Elastic Search we have to do a refresh to make data available for
search. Lots of refreshes or the 1 second refresh interval will cause
significant slower indexing speed. We don't know beforehand when our users
will import documents or make lots of changes, so we cannot really increase
the refresh interval when needed to make indexing faster. We know that
'get' is realtime and we make use of that as much as possible, but in lots
of cases we really require a search to find the data.

Our plan is to implement some mechanism in Elastic Search to get the
same realtime search + fast indexing behavior that we had with Zoie. We
need some pointers though on what would be the best place in Elastic Search
to do something like this. After all it hooks into low level Elastic Search
and Lucene stuff.

I can imagine that 'realtime-search while indexing' is important for
many other Elastic Search users too. What are the chances of something like
this getting merged back into the main branch?

I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is there
anyone attending with whom I could do some sparring with on this matter?

Thanks,
Nico

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDjts_xkh%2BWEirfkP-J1TSS9nK6yAi3dSgE5sHUHtWXRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #5

Hit reply too soon. The new segments should be available for search, but
these new segments are not created until the transaction log is flushed.

Even LinkedIn moved on from Zoie. The SNA group had many great projects,
but none of them got any traction.

--
Ivan

On Tue, Jul 1, 2014 at 8:02 AM, Ivan Brusic ivan@brusic.com wrote:

GET requests use both the Lucene index and the transaction log to retrieve
documents. Search requests will use only Lucene since the inverted index is
not updated until the transaction log is flushed. I haven't paid too much
attention to the distributed aspects of the code in a while, but this
behavior was used prior to 1.0.

Cheers,

Ivan

On Mon, Jun 30, 2014 at 3:37 AM, Nico Krijnen nkr@woodwing.com wrote:

Zoie is not for distributed search.

We know, that's why we replaced our search layer with Elastic Search.
Zoie and Sensei do not have as much users as Elastic Search and as such
have much less traction, which made Elastic Search an obvious choice for
handling our distributed search needs.

You mention the in-memory segments for fast NRT. Lucene 4 has
implemented this by default.

Nice. I'm reading up on the details about this. Do you know if these
in-memory segments are immediately being used for search? Or do the new
docs only become available after the segments are flushed to disk?

Last friday I also heard about some of the performance improvement being
worked at for ElasticSearch 1.3 and 1.4, sounds like steps are already
being taken to improve realtime search.

Nico

On Thursday, June 26, 2014 1:20:10 PM UTC+2, Jörg Prante wrote:

Zoie is not for distributed search. If you want to analyze the LinkedIn
developments for this area with Lucene, you should look at Sensei

There was also a BalancedSegmentMergePolicy donated to Lucene 2.x from
the Zoie project

https://issues.apache.org/jira/browse/LUCENE-1924

but there was not enough energy for maintaining it. Now Lucene is at
version 4, with vast improvements in the area of segment merging.

You mention the in-memory segments for fast NRT. Lucene 4 has
implemented this by default, plus Elasticsearch has some more improvements
for distributed NRT get.

Note, not all searches can be candidates for NRT. If you use mlockall
and index store type mmapfs, you can move almost all your ES/Lucene data
and files to RAM (if you can spend enough hardware). Modifying data in the
index always means to invalidate fielddata cache and maybe filter/facet
caches, and creation of new cache generations, which is expensive and
destroys performance. There is a tradeoff, balancing must be done very
carefully to avoid stale results. This is hard when not much is known about
the typical search workload of an application. ES allows to cache filters
and to clear caches explicitly. Maybe this is an area to experiment with.
But it always depends.

Jörg

On Thu, Jun 26, 2014 at 11:25 AM, Nico Krijnen n...@woodwing.com
wrote:

Hi,

We have recently migrated our application from 'bare Lucene + Zoie for
realtime search' to Elastic Search. Elastic search is awesome and next to
scalability, it gives us lots of additional features. The one thing we
really miss though is realtime search.

Search is the core of our application. All our data is stored in the
index (primary data store). When a user adds a file or makes a change,
their subsequent search must reflect that change. With Zoie, the data was
indexed very quickly into a temporary Lucene memory index. Not having to
write+read it on disk makes the documents available for search much faster
than NRT Lucene. The memory index is flushed to disk asynchrounously from
time to time, not impacting indexing or search performance. Zoie also
allows you to wait for a specific 'version of the index' to be available
for searching. That way we could make the user's thread wait until their
data was indexed in memory, only pausing the thread of that user without
having any performance impact for all the other users.

Result: realtime search and insanely fast indexing.

With Elastic Search we have to do a refresh to make data available for
search. Lots of refreshes or the 1 second refresh interval will cause
significant slower indexing speed. We don't know beforehand when our users
will import documents or make lots of changes, so we cannot really increase
the refresh interval when needed to make indexing faster. We know that
'get' is realtime and we make use of that as much as possible, but in lots
of cases we really require a search to find the data.

Our plan is to implement some mechanism in Elastic Search to get the
same realtime search + fast indexing behavior that we had with Zoie. We
need some pointers though on what would be the best place in Elastic Search
to do something like this. After all it hooks into low level Elastic Search
and Lucene stuff.

I can imagine that 'realtime-search while indexing' is important for
many other Elastic Search users too. What are the chances of something like
this getting merged back into the main branch?

I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is there
anyone attending with whom I could do some sparring with on this matter?

Thanks,
Nico

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA%3DPqK7pnhxx5-LLv_2ti2xwUWBg-5x5dcbBJcLUTn7cw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adam Zell) #6

LinkedIn's unified search offering is described
at https://engineering.linkedin.com/search/did-you-mean-galene. Relevant
snippet:

"Our professional graph evolves in real time, and our search results have
to remain current with these changes. Lucene supports changes to entities
by deleting the existing version of the entity, and then adding the new
version. However, when only a single inverted index term changes in an
entity, we need to obtain all the other inverted index terms that map to
this entity in order to create the new version of the entity.
Unfortunately, we cannot obtain this information from Lucene. We therefore
built a system called the Search Content Store to maintain all inverted
index terms keyed by the entity. Live updates are sent to the Search
Content Store, which first updates itself, and then performs the
corresponding removal and addition operations on the Lucene index.

Lucene had (until recently) another limitation with live updates – the
changes to the index have to be committed before they are visible to
readers of the index. The commit process is an expensive operation and can
only be performed occasionally. To address this, we built (and open
sourced) Zoie – which maintains an in-memory copy of the uncommitted
portions of the index. This can be used for reading until the
corresponding data has been committed in the Lucene index."

On Tuesday, July 1, 2014 8:10:16 AM UTC-7, Ivan Brusic wrote:

Hit reply too soon. The new segments should be available for search, but
these new segments are not created until the transaction log is flushed.

Even LinkedIn moved on from Zoie. The SNA group had many great projects,
but none of them got any traction.

--
Ivan

On Tue, Jul 1, 2014 at 8:02 AM, Ivan Brusic <iv...@brusic.com
<javascript:>> wrote:

GET requests use both the Lucene index and the transaction log to
retrieve documents. Search requests will use only Lucene since the inverted
index is not updated until the transaction log is flushed. I haven't paid
too much attention to the distributed aspects of the code in a while, but
this behavior was used prior to 1.0.

Cheers,

Ivan

On Mon, Jun 30, 2014 at 3:37 AM, Nico Krijnen <n...@woodwing.com
<javascript:>> wrote:

Zoie is not for distributed search.

We know, that's why we replaced our search layer with Elastic Search.
Zoie and Sensei do not have as much users as Elastic Search and as such
have much less traction, which made Elastic Search an obvious choice for
handling our distributed search needs.

You mention the in-memory segments for fast NRT. Lucene 4 has
implemented this by default.

Nice. I'm reading up on the details about this. Do you know if these
in-memory segments are immediately being used for search? Or do the new
docs only become available after the segments are flushed to disk?

Last friday I also heard about some of the performance improvement being
worked at for ElasticSearch 1.3 and 1.4, sounds like steps are already
being taken to improve realtime search.

Nico

On Thursday, June 26, 2014 1:20:10 PM UTC+2, Jörg Prante wrote:

Zoie is not for distributed search. If you want to analyze the LinkedIn
developments for this area with Lucene, you should look at Sensei

There was also a BalancedSegmentMergePolicy donated to Lucene 2.x from
the Zoie project

https://issues.apache.org/jira/browse/LUCENE-1924

but there was not enough energy for maintaining it. Now Lucene is at
version 4, with vast improvements in the area of segment merging.

You mention the in-memory segments for fast NRT. Lucene 4 has
implemented this by default, plus Elasticsearch has some more improvements
for distributed NRT get.

Note, not all searches can be candidates for NRT. If you use mlockall
and index store type mmapfs, you can move almost all your ES/Lucene data
and files to RAM (if you can spend enough hardware). Modifying data in the
index always means to invalidate fielddata cache and maybe filter/facet
caches, and creation of new cache generations, which is expensive and
destroys performance. There is a tradeoff, balancing must be done very
carefully to avoid stale results. This is hard when not much is known about
the typical search workload of an application. ES allows to cache filters
and to clear caches explicitly. Maybe this is an area to experiment with.
But it always depends.

Jörg

On Thu, Jun 26, 2014 at 11:25 AM, Nico Krijnen n...@woodwing.com
wrote:

Hi,

We have recently migrated our application from 'bare Lucene + Zoie for
realtime search' to Elastic Search. Elastic search is awesome and next to
scalability, it gives us lots of additional features. The one thing we
really miss though is realtime search.

Search is the core of our application. All our data is stored in the
index (primary data store). When a user adds a file or makes a change,
their subsequent search must reflect that change. With Zoie, the data was
indexed very quickly into a temporary Lucene memory index. Not having to
write+read it on disk makes the documents available for search much faster
than NRT Lucene. The memory index is flushed to disk asynchrounously from
time to time, not impacting indexing or search performance. Zoie also
allows you to wait for a specific 'version of the index' to be available
for searching. That way we could make the user's thread wait until their
data was indexed in memory, only pausing the thread of that user without
having any performance impact for all the other users.

Result: realtime search and insanely fast indexing.

With Elastic Search we have to do a refresh to make data available for
search. Lots of refreshes or the 1 second refresh interval will cause
significant slower indexing speed. We don't know beforehand when our users
will import documents or make lots of changes, so we cannot really increase
the refresh interval when needed to make indexing faster. We know that
'get' is realtime and we make use of that as much as possible, but in lots
of cases we really require a search to find the data.

Our plan is to implement some mechanism in Elastic Search to get the
same realtime search + fast indexing behavior that we had with Zoie. We
need some pointers though on what would be the best place in Elastic Search
to do something like this. After all it hooks into low level Elastic Search
and Lucene stuff.

I can imagine that 'realtime-search while indexing' is important for
many other Elastic Search users too. What are the chances of something like
this getting merged back into the main branch?

I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is
there anyone attending with whom I could do some sparring with on this
matter?

Thanks,
Nico

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aa214fdd-abc1-4090-9bdb-e0483098427a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#7

One thing you can consider is calling refresh() after indexing - which has
the effect I think you are looking for.
There are probably some performance considerations others here can comment
on better than I.
In any case, calling refresh() is what we do.

On Thursday, 26 June 2014 10:25:12 UTC+1, Nico Krijnen wrote:

Hi,

We have recently migrated our application from 'bare Lucene + Zoie for
realtime search' to Elastic Search. Elastic search is awesome and next to
scalability, it gives us lots of additional features. The one thing we
really miss though is realtime search.

Search is the core of our application. All our data is stored in the index
(primary data store). When a user adds a file or makes a change, their
subsequent search must reflect that change. With Zoie, the data was indexed
very quickly into a temporary Lucene memory index. Not having to write+read
it on disk makes the documents available for search much faster than NRT
Lucene. The memory index is flushed to disk asynchrounously from time to
time, not impacting indexing or search performance. Zoie also allows you to
wait for a specific 'version of the index' to be available for searching.
That way we could make the user's thread wait until their data was indexed
in memory, only pausing the thread of that user without having any
performance impact for all the other users.

Result: realtime search and insanely fast indexing.

With Elastic Search we have to do a refresh to make data available for
search. Lots of refreshes or the 1 second refresh interval will cause
significant slower indexing speed. We don't know beforehand when our users
will import documents or make lots of changes, so we cannot really increase
the refresh interval when needed to make indexing faster. We know that
'get' is realtime and we make use of that as much as possible, but in lots
of cases we really require a search to find the data.

Our plan is to implement some mechanism in Elastic Search to get the same
realtime search + fast indexing behavior that we had with Zoie. We need
some pointers though on what would be the best place in Elastic Search to
do something like this. After all it hooks into low level Elastic Search
and Lucene stuff.

I can imagine that 'realtime-search while indexing' is important for many
other Elastic Search users too. What are the chances of something like this
getting merged back into the main branch?

I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is there
anyone attending with whom I could do some sparring with on this matter?

Thanks,
Nico

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ead036cf-7ddc-4006-8361-8d4a0f77c7c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8