Migration from Solr to ElasticSearch


(Diegito) #1

Hello all,

I'm testing the ES environment to see if a migration from Solr could bring
benefits to our system. We are considering a complete renovation of our
service, taking it from Java to Python plus a lot of new enhancements.

Currently we use Solr for indexing purposes. We store webpages from
customers and index them using solar. Within a solr document we have a
dozen of fields to keep track of the data, the data itself is indexed in
Solr in a *content *field which is set (in the schema.xml) to be
indexed="true" stored="false". In fact, I can do a text search on it but I
cannot retrieve the whole field (obviously..)

The actual content is saved on our server and it is a massive 22TB of data.
You'll understand we cannot reindex the whole thing just for testing
purposes. We're considering to use a subset of it but also this is time
consuming.

I was looking if there was any way to transfer the indexed but unstored
*content *field directly from solr to elastic search.

On another topic, when I shut down and turn on again the ES engine, I
noticed that the documents are not all available at once, but they take
time to load.
Is that an expected behavior or is there a way (configuration option..) to
have all the documents available right away? I'm thinking, for instance, if
I have to update the engine or add some more options or for whatever reason
I need to turn down the engine and turn it on again, do I need to wait for
all the documents to be loaded in the system?
With Solr I see all of them available immediately after the search engine
has been launched...

Thank you,
Diego

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c23e11d-74fd-48c0-98b0-4d75514a6a33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Otis Gospodnetić) #2

Hi,

You could migrate from Solr to ES without reindexing because at the end of
the day it is Lucene that writes data to index.
You'd want to make sure your ES mappings match your Solr schema.
You'd want to create the matching number of shards and replicas you had in
Solr(Cloud?).
You'd manually copy Lucene indexes from Solr to ES and pray.
I'm sure I'm skipping over about a dozen details you can trip over, though.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Monday, June 2, 2014 3:33:38 PM UTC-4, Diego Marchi wrote:

Hello all,

I'm testing the ES environment to see if a migration from Solr could bring
benefits to our system. We are considering a complete renovation of our
service, taking it from Java to Python plus a lot of new enhancements.

Currently we use Solr for indexing purposes. We store webpages from
customers and index them using solar. Within a solr document we have a
dozen of fields to keep track of the data, the data itself is indexed in
Solr in a *content *field which is set (in the schema.xml) to be
indexed="true" stored="false". In fact, I can do a text search on it but I
cannot retrieve the whole field (obviously..)

The actual content is saved on our server and it is a massive 22TB of
data. You'll understand we cannot reindex the whole thing just for testing
purposes. We're considering to use a subset of it but also this is time
consuming.

I was looking if there was any way to transfer the indexed but unstored
*content *field directly from solr to elastic search.

On another topic, when I shut down and turn on again the ES engine, I
noticed that the documents are not all available at once, but they take
time to load.
Is that an expected behavior or is there a way (configuration option..) to
have all the documents available right away? I'm thinking, for instance, if
I have to update the engine or add some more options or for whatever reason
I need to turn down the engine and turn it on again, do I need to wait for
all the documents to be loaded in the system?
With Solr I see all of them available immediately after the search engine
has been launched...

Thank you,
Diego

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f111b020-c8d1-4ea8-8362-76b29cac90dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Diegito) #3

Hi, thanks for the answer!

Since they both share Lucene as common underlying engine, this could be a
starting point.. but are we sure that both the engines store and structure
the information in the same way? In this case the porting should be pretty
easy..

Do you have an e-guide or a handbook you could suggest me, on how the data
in ES and solr is structured?

Thank you
Diego

Il giorno lunedì 2 giugno 2014 20:54:44 UTC-7, Otis Gospodnetic ha scritto:

Hi,

You could migrate from Solr to ES without reindexing because at the end of
the day it is Lucene that writes data to index.
You'd want to make sure your ES mappings match your Solr schema.
You'd want to create the matching number of shards and replicas you had in
Solr(Cloud?).
You'd manually copy Lucene indexes from Solr to ES and pray.
I'm sure I'm skipping over about a dozen details you can trip over, though.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Monday, June 2, 2014 3:33:38 PM UTC-4, Diego Marchi wrote:

Hello all,

I'm testing the ES environment to see if a migration from Solr could
bring benefits to our system. We are considering a complete renovation of
our service, taking it from Java to Python plus a lot of new enhancements.

Currently we use Solr for indexing purposes. We store webpages from
customers and index them using solar. Within a solr document we have a
dozen of fields to keep track of the data, the data itself is indexed in
Solr in a *content *field which is set (in the schema.xml) to be
indexed="true" stored="false". In fact, I can do a text search on it but I
cannot retrieve the whole field (obviously..)

The actual content is saved on our server and it is a massive 22TB of
data. You'll understand we cannot reindex the whole thing just for testing
purposes. We're considering to use a subset of it but also this is time
consuming.

I was looking if there was any way to transfer the indexed but unstored
*content *field directly from solr to elastic search.

On another topic, when I shut down and turn on again the ES engine, I
noticed that the documents are not all available at once, but they take
time to load.
Is that an expected behavior or is there a way (configuration option..)
to have all the documents available right away? I'm thinking, for instance,
if I have to update the engine or add some more options or for whatever
reason I need to turn down the engine and turn it on again, do I need to
wait for all the documents to be loaded in the system?
With Solr I see all of them available immediately after the search engine
has been launched...

Thank you,
Diego

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9c504ea4-3c37-4816-990f-800c3c2a0959%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

If you have indexed the data in Solr, you should consider a tool that can
traverse the Lucene index and reconstruct the documents. This is not a
straightforward process, as you know already, because analyzed fields look
different than the original input. The reconstruction may not recover the
original input, but could be used for input into Elasticsearch, when
transformed to JSON. It heavily depends on the Solr analyzers you used.

You know that Elasticsearch index is sharded, so it is obvious you have to
reindex the documents in order to take advantage of ES sharding.

What time intervals do you mean to be expected at ES startup? When shutting
down ES, you should use the _shutdown endpoint for a clean shutdown. A
clean shutdown writes checksums to disk for fast startup. When starting
with valid checksums, ES is available within a few seconds and turns to
state "green". Otherwise it performs indices recovery. After all shards
respond after invalid checksums, and this duration is due to the shard
sizes and disk I/O speed, an ES cluster starts usually within 30 seconds to
1 minute. It can not do much faster after unclean shutdowns because of the
index recovery. The recovery, like index/search depends on the overall
power of your ES cluster. There are tunables to increase recovery speed, by
suppressing search/index performance at the same time.

Jörg

Am 02.06.14 21:33, schrieb Diego Marchi:

Hello all,

I'm testing the ES environment to see if a migration from Solr could bring
benefits to our system. We are considering a complete renovation of our
service, taking it from Java to Python plus a lot of new enhancements.

Currently we use Solr for indexing purposes. We store webpages from
customers and index them using solar. Within a solr document we have a
dozen of fields to keep track of the data, the data itself is indexed in
Solr in a *content *field which is set (in the schema.xml) to be
indexed="true" stored="false". In fact, I can do a text search on it but I
cannot retrieve the whole field (obviously..)

The actual content is saved on our server and it is a massive 22TB of
data. You'll understand we cannot reindex the whole thing just for testing
purposes. We're considering to use a subset of it but also this is time
consuming.

I was looking if there was any way to transfer the indexed but unstored
*content *field directly from solr to elastic search.

On another topic, when I shut down and turn on again the ES engine, I
noticed that the documents are not all available at once, but they take
time to load.
Is that an expected behavior or is there a way (configuration option..) to
have all the documents available right away? I'm thinking, for instance, if
I have to update the engine or add some more options or for whatever reason
I need to turn down the engine and turn it on again, do I need to wait for
all the documents to be loaded in the system?
With Solr I see all of them available immediately after the search engine
has been launched...

Thank you,
Diego

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8c23e11d-74fd-48c0-98b0-4d75514a6a33%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8c23e11d-74fd-48c0-98b0-4d75514a6a33%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGZuAfVV6nr74EpT3DpBH6jMfryoKefLR8YTbd13HEG0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Diegito) #5

Thank you Jorg,

I'll start from the second question: Thanks! My problem was that I didn't
know about the _shutdown option so I was simply killing the process
therefore forcing the system to recover the indices.

As far as the migration from solr to elasticsearch is concerned, I
basically want the indexed/analyzed but unstored field to be transferred
from solr to ES, so I can perform a full-text search on it.
So are there tools allowing me to copy the lucene indexes over to
elasticsearch and allow me to have the same functionality?

To retrieve the actual document, I'll simply take the id and retrieve the
document from the storage. This is how the system was built before and how
I have to test it: indexed but unstored fields are kept inside solr, which
is queried for full-text searches. Actual documents are kept in a separate
filesystem. The results of the queries are taken and used to retrieve the
actual documents from this filesystem.

If we decide to move with ES, then we could change the approach and have
everything stored inside ES and reindex our full archive.

Thanks for the sharding advice, I realize I cannot use sharding with the
current configuration. The current system in solr has just 1 collection
with 1 core and 1 instance.

We are confronting performances between ES and SOLR multicore on
distributed system (not cloud, but simply having several instances and
balance the load using a custom algorithm, to have more control on where
the data goes) and after this we'll decide where we should go.

Thanks

Il giorno martedì 3 giugno 2014 09:55:21 UTC-7, Jörg Prante ha scritto:

If you have indexed the data in Solr, you should consider a tool that
can traverse the Lucene index and reconstruct the documents. This is not a
straightforward process, as you know already, because analyzed fields look
different than the original input. The reconstruction may not recover the
original input, but could be used for input into Elasticsearch, when
transformed to JSON. It heavily depends on the Solr analyzers you used.

You know that Elasticsearch index is sharded, so it is obvious you have to
reindex the documents in order to take advantage of ES sharding.

What time intervals do you mean to be expected at ES startup? When
shutting down ES, you should use the _shutdown endpoint for a clean
shutdown. A clean shutdown writes checksums to disk for fast startup. When
starting with valid checksums, ES is available within a few seconds and
turns to state "green". Otherwise it performs indices recovery. After all
shards respond after invalid checksums, and this duration is due to the
shard sizes and disk I/O speed, an ES cluster starts usually within 30
seconds to 1 minute. It can not do much faster after unclean shutdowns
because of the index recovery. The recovery, like index/search depends on
the overall power of your ES cluster. There are tunables to increase
recovery speed, by suppressing search/index performance at the same time.

Jörg

Am 02.06.14 21:33, schrieb Diego Marchi:

Hello all,

I'm testing the ES environment to see if a migration from Solr could
bring benefits to our system. We are considering a complete renovation of
our service, taking it from Java to Python plus a lot of new enhancements.

Currently we use Solr for indexing purposes. We store webpages from
customers and index them using solar. Within a solr document we have a
dozen of fields to keep track of the data, the data itself is indexed in
Solr in a *content *field which is set (in the schema.xml) to be
indexed="true" stored="false". In fact, I can do a text search on it but I
cannot retrieve the whole field (obviously..)

The actual content is saved on our server and it is a massive 22TB of
data. You'll understand we cannot reindex the whole thing just for testing
purposes. We're considering to use a subset of it but also this is time
consuming.

I was looking if there was any way to transfer the indexed but unstored
*content *field directly from solr to elastic search.

On another topic, when I shut down and turn on again the ES engine, I
noticed that the documents are not all available at once, but they take
time to load.
Is that an expected behavior or is there a way (configuration option..) to
have all the documents available right away? I'm thinking, for instance, if
I have to update the engine or add some more options or for whatever reason
I need to turn down the engine and turn it on again, do I need to wait for
all the documents to be loaded in the system?
With Solr I see all of them available immediately after the search
engine has been launched...

Thank you,
Diego

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8c23e11d-74fd-48c0-98b0-4d75514a6a33%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8c23e11d-74fd-48c0-98b0-4d75514a6a33%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ce468f5d-c784-46d4-8d74-965c9447696d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #6

If you can iterate over the Solr index doc ids and fetch the source docs
from a secondary storage, you should consider doing this first. This is the
most straightforward method for reindexing.

Otherwise, if you can not access the filesystem storage for the docs (for
whatever reason), the idea would be to create a more complex tool, maybe
with help of https://github.com/DmitryKey/luke/ The Luke code should be
useful for document reconstruction, but there is no code I am aware of for
reindexing results into Elasticsearch. Such a reconstructor should take
also the Solr schema as input. But as said, such a tool heavily depends on
the Solr analyzers, so it must be evaluated first if the Solr index is
usable at all for reindexing.

Jörg

On Tue, Jun 3, 2014 at 7:24 PM, Diego Marchi diego.marchi.ud@gmail.com
wrote:

Thank you Jorg,

I'll start from the second question: Thanks! My problem was that I didn't
know about the _shutdown option so I was simply killing the process
therefore forcing the system to recover the indices.

As far as the migration from solr to elasticsearch is concerned, I
basically want the indexed/analyzed but unstored field to be transferred
from solr to ES, so I can perform a full-text search on it.
So are there tools allowing me to copy the lucene indexes over to
elasticsearch and allow me to have the same functionality?

To retrieve the actual document, I'll simply take the id and retrieve the
document from the storage. This is how the system was built before and how
I have to test it: indexed but unstored fields are kept inside solr, which
is queried for full-text searches. Actual documents are kept in a separate
filesystem. The results of the queries are taken and used to retrieve the
actual documents from this filesystem.

If we decide to move with ES, then we could change the approach and have
everything stored inside ES and reindex our full archive.

Thanks for the sharding advice, I realize I cannot use sharding with the
current configuration. The current system in solr has just 1 collection
with 1 core and 1 instance.

We are confronting performances between ES and SOLR multicore on
distributed system (not cloud, but simply having several instances and
balance the load using a custom algorithm, to have more control on where
the data goes) and after this we'll decide where we should go.

Thanks

Il giorno martedì 3 giugno 2014 09:55:21 UTC-7, Jörg Prante ha scritto:

If you have indexed the data in Solr, you should consider a tool that
can traverse the Lucene index and reconstruct the documents. This is not a
straightforward process, as you know already, because analyzed fields look
different than the original input. The reconstruction may not recover the
original input, but could be used for input into Elasticsearch, when
transformed to JSON. It heavily depends on the Solr analyzers you used.

You know that Elasticsearch index is sharded, so it is obvious you have
to reindex the documents in order to take advantage of ES sharding.

What time intervals do you mean to be expected at ES startup? When
shutting down ES, you should use the _shutdown endpoint for a clean
shutdown. A clean shutdown writes checksums to disk for fast startup. When
starting with valid checksums, ES is available within a few seconds and
turns to state "green". Otherwise it performs indices recovery. After all
shards respond after invalid checksums, and this duration is due to the
shard sizes and disk I/O speed, an ES cluster starts usually within 30
seconds to 1 minute. It can not do much faster after unclean shutdowns
because of the index recovery. The recovery, like index/search depends on
the overall power of your ES cluster. There are tunables to increase
recovery speed, by suppressing search/index performance at the same time.

Jörg

Am 02.06.14 21:33, schrieb Diego Marchi:

Hello all,

I'm testing the ES environment to see if a migration from Solr could
bring benefits to our system. We are considering a complete renovation of
our service, taking it from Java to Python plus a lot of new enhancements.

Currently we use Solr for indexing purposes. We store webpages from
customers and index them using solar. Within a solr document we have a
dozen of fields to keep track of the data, the data itself is indexed in
Solr in a *content *field which is set (in the schema.xml) to be
indexed="true" stored="false". In fact, I can do a text search on it but I
cannot retrieve the whole field (obviously..)

The actual content is saved on our server and it is a massive 22TB of
data. You'll understand we cannot reindex the whole thing just for testing
purposes. We're considering to use a subset of it but also this is time
consuming.

I was looking if there was any way to transfer the indexed but unstored
*content *field directly from solr to elastic search.

On another topic, when I shut down and turn on again the ES engine, I
noticed that the documents are not all available at once, but they take
time to load.
Is that an expected behavior or is there a way (configuration option..)
to have all the documents available right away? I'm thinking, for instance,
if I have to update the engine or add some more options or for whatever
reason I need to turn down the engine and turn it on again, do I need to
wait for all the documents to be loaded in the system?
With Solr I see all of them available immediately after the search
engine has been launched...

Thank you,
Diego

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/8c23e11d-74fd-48c0-98b0-4d75514a6a33%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8c23e11d-74fd-48c0-98b0-4d75514a6a33%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ce468f5d-c784-46d4-8d74-965c9447696d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ce468f5d-c784-46d4-8d74-965c9447696d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHVmvwS%3DgTm%3DfEwWWGS_1B6J_MqoB2FP6nQDcWK5%2BjWbg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #7