Converting schema.xml from solr to ES


(Bernd Fehling) #1

Is there a guide for converting a schema.xml from solr to ES?

e.g. in solr I have a fieldType of class solr.TextField with
positionIncrementGap of 100.
How is the setting for this in ES?

How to set the precisionStep for date, long or float?

Regards


(Otis Gospodnetić) #2

Have a look at SearchSchemer - https://github.com/sematext/SearchSchemer - a
multi-directional schema converter for Solr, ElasticSearch, and Sensei.
If something is not supported, patches/pull requests are very welcome, of
course!

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, July 25, 2012 9:14:53 AM UTC-4, Bernd Fehling wrote:

Is there a guide for converting a schema.xml from solr to ES?

e.g. in solr I have a fieldType of class solr.TextField with
positionIncrementGap of 100.
How is the setting for this in ES?

How to set the precisionStep for date, long or float?

Regards


(simonw-2) #3

On Wednesday, July 25, 2012 3:14:53 PM UTC+2, Bernd Fehling wrote:

Is there a guide for converting a schema.xml from solr to ES?

e.g. in solr I have a fieldType of class solr.TextField with
positionIncrementGap of 100.
How is the setting for this in ES?

use "position_offset_gap" : 100 in your custom analyzer configuration
see this for details:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/custom-analyzer.html

How to set the precisionStep for date, long or float?

you can set a "precision_step" : 666 in your mapping
see: http://www.elasticsearch.org/guide/reference/mapping/core-types.html

simon

Regards


(Bernd Fehling) #4

Hi Otis,

thanks for the tool.
After gitting it into eclipse, pom-mishing through the config and finally
packaging a jar with dependencies i got my first conversion.
And surprise there is no TextField. ES has only "string".
So why does solr has TextField and "string" where ES can handle everything
just with "string"?

May be ES can't handle paragraphs and sentences because everything is just
one "sring"?

Am Mittwoch, 25. Juli 2012 20:33:00 UTC+2 schrieb Otis Gospodnetic:

Have a look at SearchSchemer - https://github.com/sematext/SearchSchemer - a
multi-directional schema converter for Solr, ElasticSearch, and Sensei.
If something is not supported, patches/pull requests are very welcome, of
course!

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html


(Bernd Fehling) #5

Hi Simon,

it looks like the documentation is not up to date, at least I can't find
anything about "position_offset_gap".

Ah ok, google said it was introduced with issue #1812.

Am Mittwoch, 25. Juli 2012 21:06:38 UTC+2 schrieb simonw:

On Wednesday, July 25, 2012 3:14:53 PM UTC+2, Bernd Fehling wrote:

Is there a guide for converting a schema.xml from solr to ES?

e.g. in solr I have a fieldType of class solr.TextField with
positionIncrementGap of 100.
How is the setting for this in ES?

use "position_offset_gap" : 100 in your custom analyzer configuration
see this for details:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/custom-analyzer.html

How to set the precisionStep for date, long or float?

you can set a "precision_step" : 666 in your mapping
see: http://www.elasticsearch.org/guide/reference/mapping/core-types.html

simon

Regards


(Clinton Gormley) #6

On Thu, 2012-07-26 at 03:28 -0700, Bernd Fehling wrote:

Hi Otis,

thanks for the tool.
After gitting it into eclipse, pom-mishing through the config and
finally packaging a jar with dependencies i got my first conversion.
And surprise there is no TextField. ES has only "string".
So why does solr has TextField and "string" where ES can handle
everything just with "string"?

May be ES can't handle paragraphs and sentences because everything is
just one "sring"?

In ES you would use:

{ type: "string", index: "analyzed" } # default
{ type: "string", index: "not_analyzed" }

clint

Am Mittwoch, 25. Juli 2012 20:33:00 UTC+2 schrieb Otis Gospodnetic:
Have a look at SearchSchemer -
https://github.com/sematext/SearchSchemer - a
multi-directional schema converter for Solr, ElasticSearch,
and Sensei.
If something is not supported, patches/pull requests are very
welcome, of course!

    Otis
    --
    Search Analytics -
    http://sematext.com/search-analytics/index.html
    Scalable Performance Monitoring -
    http://sematext.com/spm/index.html

(Jörg Prante) #7

Hi Bernd,

solr.StrField corresponds to { "type": "string", "index": "not_analyzed"
} in Elasticsarch, where solr.TextField corresponds to { "type": "string",
"index" : "analyzed", "index_analyzer" : "...", "search_analyzer" : "..." }

Best regards,

Jörg

On Thursday, July 26, 2012 12:42:10 PM UTC+2, Clinton Gormley wrote:

On Thu, 2012-07-26 at 03:28 -0700, Bernd Fehling wrote:

Hi Otis,

thanks for the tool.
After gitting it into eclipse, pom-mishing through the config and
finally packaging a jar with dependencies i got my first conversion.
And surprise there is no TextField. ES has only "string".
So why does solr has TextField and "string" where ES can handle
everything just with "string"?

May be ES can't handle paragraphs and sentences because everything is
just one "sring"?

In ES you would use:

{ type: "string", index: "analyzed" } # default
{ type: "string", index: "not_analyzed" }

clint

Am Mittwoch, 25. Juli 2012 20:33:00 UTC+2 schrieb Otis Gospodnetic:
Have a look at SearchSchemer -
https://github.com/sematext/SearchSchemer - a
multi-directional schema converter for Solr, ElasticSearch,
and Sensei.
If something is not supported, patches/pull requests are very
welcome, of course!

    Otis 
    -- 
    Search Analytics - 
    http://sematext.com/search-analytics/index.html 
    Scalable Performance Monitoring - 
    http://sematext.com/spm/index.html 

(Bernd Fehling) #8

Hi Jörg, how is going?

I already figured out how to convert my schema.xml from solr to
elasticsearch. It still needs some handwork.
I've reached 210 GB index size and looking now for splitting the index
which has to be done before reaching 250 GB index size.
The question is now "solr cloud" or elasticsearch?
Any suggestions?

Biggest problem so far, ES can only load JSON, unbelievable!!!
Just for testing ES i have to write either a XML2JSON River or convert my
test data to JSON.

I don't know, may be i will contact you by phone and we can discuss this
misery.

Regards,
Bernd

Am Freitag, 27. Juli 2012 12:00:14 UTC+2 schrieb Jörg Prante:

Hi Bernd,

solr.StrField corresponds to { "type": "string", "index": "not_analyzed"
} in Elasticsarch, where solr.TextField corresponds to { "type": "string",
"index" : "analyzed", "index_analyzer" : "...", "search_analyzer" : "..." }

Best regards,

Jörg

On Thursday, July 26, 2012 12:42:10 PM UTC+2, Clinton Gormley wrote:

On Thu, 2012-07-26 at 03:28 -0700, Bernd Fehling wrote:

Hi Otis,

thanks for the tool.
After gitting it into eclipse, pom-mishing through the config and
finally packaging a jar with dependencies i got my first conversion.
And surprise there is no TextField. ES has only "string".
So why does solr has TextField and "string" where ES can handle
everything just with "string"?

May be ES can't handle paragraphs and sentences because everything is
just one "sring"?

In ES you would use:

{ type: "string", index: "analyzed" } # default
{ type: "string", index: "not_analyzed" }

clint

Am Mittwoch, 25. Juli 2012 20:33:00 UTC+2 schrieb Otis Gospodnetic:
Have a look at SearchSchemer -
https://github.com/sematext/SearchSchemer - a
multi-directional schema converter for Solr, ElasticSearch,
and Sensei.
If something is not supported, patches/pull requests are very
welcome, of course!

    Otis 
    -- 
    Search Analytics - 
    http://sematext.com/search-analytics/index.html 
    Scalable Performance Monitoring - 
    http://sematext.com/spm/index.html 

(simonw-2) #9

hey Bernd,

On Friday, July 27, 2012 8:51:02 PM UTC+2, Bernd Fehling wrote:

Hi Jörg, how is going?

I already figured out how to convert my schema.xml from solr to
elasticsearch. It still needs some handwork.
I've reached 210 GB index size and looking now for splitting the index
which has to be done before reaching 250 GB index size.
The question is now "solr cloud" or elasticsearch?
Any suggestions?

if you are going into a distributed environment I highly recommend you
using ES. you can simply create an index with N shards and ES will
distributed them over your machines running a ES daemon of the same cluster
name. Searching a single shard is not different from searching N shards
from an API perspective so nothing needs to change along those lines. I'd
highly recommend you ES over Solr Cloud at this stage given the maturity of
ES with Distributed Search compared to Solr Cloud.

Biggest problem so far, ES can only load JSON, unbelievable!!!
Just for testing ES i have to write either a XML2JSON River or convert my
test data to JSON.

This is a major design decision. You will never fit all needs like one
needs XML the next needs YAML or Python objects etc. from that perspective
treating anybody equally and offering one communication format is straight
forward and a reasonable design decision. if your XML has the right format
(logically) converting it to a ES Json is a very very straight forward
thing in almost any language. If not you'd need to do it anyway no? if you
are in java here is a very simple 3 liner to convert xml to
json: https://github.com/tobrien/sample-json-parsing

I don't know, may be i will contact you by phone and we can discuss this
misery.

maybe you can share your problems here so everybody can benefit from your
problems and their solutions.

simon

Regards,
Bernd

Am Freitag, 27. Juli 2012 12:00:14 UTC+2 schrieb Jörg Prante:

Hi Bernd,

solr.StrField corresponds to { "type": "string", "index": "not_analyzed"
} in Elasticsarch, where solr.TextField corresponds to { "type": "string",
"index" : "analyzed", "index_analyzer" : "...", "search_analyzer" : "..." }

Best regards,

Jörg

On Thursday, July 26, 2012 12:42:10 PM UTC+2, Clinton Gormley wrote:

On Thu, 2012-07-26 at 03:28 -0700, Bernd Fehling wrote:

Hi Otis,

thanks for the tool.
After gitting it into eclipse, pom-mishing through the config and
finally packaging a jar with dependencies i got my first conversion.
And surprise there is no TextField. ES has only "string".
So why does solr has TextField and "string" where ES can handle
everything just with "string"?

May be ES can't handle paragraphs and sentences because everything is
just one "sring"?

In ES you would use:

{ type: "string", index: "analyzed" } # default
{ type: "string", index: "not_analyzed" }

clint

Am Mittwoch, 25. Juli 2012 20:33:00 UTC+2 schrieb Otis Gospodnetic:
Have a look at SearchSchemer -
https://github.com/sematext/SearchSchemer - a
multi-directional schema converter for Solr, ElasticSearch,
and Sensei.
If something is not supported, patches/pull requests are very
welcome, of course!

    Otis 
    -- 
    Search Analytics - 
    http://sematext.com/search-analytics/index.html 
    Scalable Performance Monitoring - 
    http://sematext.com/spm/index.html 

(David Pilato) #10

I strongly agree with Simon.
Using JSon was the best design decision for ES.

BTW, if you use Jackson, you can serialize and deserialize java beans easily in XML and JSon.

Let me troll a little: I don't like XML ! I don't like constraints :wink:

--

Le 28 juil. 2012 à 11:23, simonw simon.willnauer@elasticsearch.com a écrit :

hey Bernd,

On Friday, July 27, 2012 8:51:02 PM UTC+2, Bernd Fehling wrote:
Hi Jörg, how is going?

I already figured out how to convert my schema.xml from solr to elasticsearch. It still needs some handwork.
I've reached 210 GB index size and looking now for splitting the index which has to be done before reaching 250 GB index size.
The question is now "solr cloud" or elasticsearch?
Any suggestions?

if you are going into a distributed environment I highly recommend you using ES. you can simply create an index with N shards and ES will distributed them over your machines running a ES daemon of the same cluster name. Searching a single shard is not different from searching N shards from an API perspective so nothing needs to change along those lines. I'd highly recommend you ES over Solr Cloud at this stage given the maturity of ES with Distributed Search compared to Solr Cloud.

Biggest problem so far, ES can only load JSON, unbelievable!!!
Just for testing ES i have to write either a XML2JSON River or convert my test data to JSON.
This is a major design decision. You will never fit all needs like one needs XML the next needs YAML or Python objects etc. from that perspective treating anybody equally and offering one communication format is straight forward and a reasonable design decision. if your XML has the right format (logically) converting it to a ES Json is a very very straight forward thing in almost any language. If not you'd need to do it anyway no? if you are in java here is a very simple 3 liner to convert xml to json: https://github.com/tobrien/sample-json-parsing

I don't know, may be i will contact you by phone and we can discuss this misery.

maybe you can share your problems here so everybody can benefit from your problems and their solutions.

simon

Regards,
Bernd

Am Freitag, 27. Juli 2012 12:00:14 UTC+2 schrieb Jörg Prante:
Hi Bernd,

solr.StrField corresponds to { "type": "string", "index": "not_analyzed" } in Elasticsarch, where solr.TextField corresponds to { "type": "string", "index" : "analyzed", "index_analyzer" : "...", "search_analyzer" : "..." }

Best regards,

Jörg

On Thursday, July 26, 2012 12:42:10 PM UTC+2, Clinton Gormley wrote:
On Thu, 2012-07-26 at 03:28 -0700, Bernd Fehling wrote:

Hi Otis,

thanks for the tool.
After gitting it into eclipse, pom-mishing through the config and
finally packaging a jar with dependencies i got my first conversion.
And surprise there is no TextField. ES has only "string".
So why does solr has TextField and "string" where ES can handle
everything just with "string"?

May be ES can't handle paragraphs and sentences because everything is
just one "sring"?

In ES you would use:

{ type: "string", index: "analyzed" } # default
{ type: "string", index: "not_analyzed" }

clint

Am Mittwoch, 25. Juli 2012 20:33:00 UTC+2 schrieb Otis Gospodnetic:
Have a look at SearchSchemer -
https://github.com/sematext/SearchSchemer - a
multi-directional schema converter for Solr, ElasticSearch,
and Sensei.
If something is not supported, patches/pull requests are very
welcome, of course!

    Otis 
    -- 
    Search Analytics - 
    http://sematext.com/search-analytics/index.html 
    Scalable Performance Monitoring - 
    http://sematext.com/spm/index.html 

(Bernd Fehling) #11

Hi Simon,

Am Samstag, 28. Juli 2012 11:23:45 UTC+2 schrieb simonw:

hey Bernd,

if you are going into a distributed environment I highly recommend you
using ES. you can simply create an index with N shards and ES will
distributed them over your machines running a ES daemon of the same cluster
name. Searching a single shard is not different from searching N shards
from an API perspective so nothing needs to change along those lines. I'd
highly recommend you ES over Solr Cloud at this stage given the maturity of
ES with Distributed Search compared to Solr Cloud.

Thanks for your advice. First tests impressed me very much, also the work
done by other users e.g. for bigdesk and ES-head.
Also the easy installing of plugins is great and much more.

Biggest problem so far, ES can only load JSON, unbelievable!!!
Just for testing ES i have to write either a XML2JSON River or convert my
test data to JSON.

This is a major design decision. You will never fit all needs like one
needs XML the next needs YAML or Python objects etc. from that perspective
treating anybody equally and offering one communication format is straight
forward and a reasonable design decision. if your XML has the right format
(logically) converting it to a ES Json is a very very straight forward
thing in almost any language. If not you'd need to do it anyway no? if you
are in java here is a very simple 3 liner to convert xml to json:
https://github.com/tobrien/sample-json-parsing

I will look into this.
Using currently XML is historically because we started years ago with FAST
Search and switched two years ago to solr. This needed a lot of work,
especially for some functionality which was already in FAST but not in
solr. FAST was Python so I had to rewrite several modules with JAVA for
solr.
Nevetheless if we switch to ES there will be again a lot of work even so ES
and solr are based on Lucene.

I don't know, may be i will contact you by phone and we can discuss this
misery.

maybe you can share your problems here so everybody can benefit from your
problems and their solutions.

Share my problems? here we go with the biggest ones:

  • FAST Search has a filetraverser which can easily add/modify/delete
    documents from the file system to the index, this has been partially
    rebuild using DIH for Solr. So no DIH or any similar in ES. What do you
    suggest?

  • we had a multilingual Thesaurus via a synonym dictionary hooked up to
    query processing with FAST. This was rewritten and ported to Solr but
    needed a lot of tricks to get it working as it did with FAST. Problems are
    handling of multilingual multiword synonyms during query processing.
    Have to look into ES sources to see what is possible. Is ES as flexible as
    Solr in the area of query components?

Bernd


(David Pilato) #12

Hi Bernd,

There's a Filesystem river that can answer to:

FAST Search has a filetraverser which can easily add/modify/delete documents from the file system to the index, this has been partially rebuild using DIH for Solr. So no DIH or any similar in ES.

http://www.pilato.fr/fsriver/

HTH
David

--

Le 28 juil. 2012 à 14:25, Bernd Fehling bernd.fehling@googlemail.com a écrit :

Hi Simon,

Am Samstag, 28. Juli 2012 11:23:45 UTC+2 schrieb simonw:
hey Bernd,

if you are going into a distributed environment I highly recommend you using ES. you can simply create an index with N shards and ES will distributed them over your machines running a ES daemon of the same cluster name. Searching a single shard is not different from searching N shards from an API perspective so nothing needs to change along those lines. I'd highly recommend you ES over Solr Cloud at this stage given the maturity of ES with Distributed Search compared to Solr Cloud.

Thanks for your advice. First tests impressed me very much, also the work done by other users e.g. for bigdesk and ES-head.
Also the easy installing of plugins is great and much more.

Biggest problem so far, ES can only load JSON, unbelievable!!!
Just for testing ES i have to write either a XML2JSON River or convert my test data to JSON.
This is a major design decision. You will never fit all needs like one needs XML the next needs YAML or Python objects etc. from that perspective treating anybody equally and offering one communication format is straight forward and a reasonable design decision. if your XML has the right format (logically) converting it to a ES Json is a very very straight forward thing in almost any language. If not you'd need to do it anyway no? if you are in java here is a very simple 3 liner to convert xml to json: https://github.com/tobrien/sample-json-parsing

I will look into this.
Using currently XML is historically because we started years ago with FAST Search and switched two years ago to solr. This needed a lot of work, especially for some functionality which was already in FAST but not in solr. FAST was Python so I had to rewrite several modules with JAVA for solr.
Nevetheless if we switch to ES there will be again a lot of work even so ES and solr are based on Lucene.

I don't know, may be i will contact you by phone and we can discuss this misery.

maybe you can share your problems here so everybody can benefit from your problems and their solutions.

Share my problems? here we go with the biggest ones:

  • FAST Search has a filetraverser which can easily add/modify/delete documents from the file system to the index, this has been partially rebuild using DIH for Solr. So no DIH or any similar in ES. What do you suggest?

  • we had a multilingual Thesaurus via a synonym dictionary hooked up to query processing with FAST. This was rewritten and ported to Solr but needed a lot of tricks to get it working as it did with FAST. Problems are handling of multilingual multiword synonyms during query processing.
    Have to look into ES sources to see what is possible. Is ES as flexible as Solr in the area of query components?

Bernd


(Jörg Prante) #13

Hi Bernd,

nice to see you are trying Elasticsearch in Bielefeld!

Since here in Cologne we also moved from FAST to Elasticsearch I hope I can
give some hints.

On Friday, July 27, 2012 8:51:02 PM UTC+2, Bernd Fehling wrote:

Hi Jörg, how is going?

I already figured out how to convert my schema.xml from solr to
elasticsearch. It still needs some handwork.
I've reached 210 GB index size and looking now for splitting the index
which has to be done before reaching 250 GB index size.
The question is now "solr cloud" or elasticsearch?

Elasticsearch is easy on installing and full of nice cloud features, I did
never regret the decision I made early 2010 when Elasticsearch was at 0.5.1.

From what I could read, SolrCloud makes progress, but still needs much
love, in configuration and administration. We all look forward to
Lucene/Solr 4!

Any suggestions?

Biggest problem so far, ES can only load JSON, unbelievable!!!
Just for testing ES i have to write either a XML2JSON River or convert my
test data to JSON.

I don't know, may be i will contact you by phone and we can discuss this
misery.

Would indeed be nice, yes! Maybe that's an opportunity for a
Bielefeld/Cologne search technology meeting again?

XML input was never a real problem here, since I always used an abstraction
layer to process bibliographic data, even back in FAST ESP times (it is
based on a resource/property model, very close to RDF, but without SPARQL,
I'm not using Jena or OpenRDF).

An XML river would be an idea! But as XML is just a syntax for "data in a
container format", such a river is mostly useless without the feature of
custom processing extensions for the data (similar to the XML pipeline
processing in FAST). Maybe by scripting XML to JSON? Do you have preference
for a JVM scripting language? Groovy would be a straightforward option,
since I am integrating Groovy scripts into my MAB/MARC converter.

Best regards,

Jörg


(Jörg Prante) #14

(continued)

On Saturday, July 28, 2012 2:25:50 PM UTC+2, Bernd Fehling wrote:

Using currently XML is historically because we started years ago with FAST
Search and switched two years ago to solr. This needed a lot of work,
especially for some functionality which was already in FAST but not in
solr. FAST was Python so I had to rewrite several modules with JAVA for
solr.
Nevetheless if we switch to ES there will be again a lot of work even so
ES and solr are based on Lucene.

How do you instrument Solr, maybe with solr4j? There is a tool that tries
to emulate solr4j for Elasticsearch:

Share my problems? here we go with the biggest ones:

  • FAST Search has a filetraverser which can easily add/modify/delete
    documents from the file system to the index, this has been partially
    rebuild using DIH for Solr. So no DIH or any similar in ES. What do you
    suggest?

The Filesystem river of David is harvesting documents on disk, to index
them with the help of the mapper attachment plugin, using Tika.

For input documents of highly structured metadata (like we use in our
applications for libraries), additional processing is required before the
docs can be passed on to the index.

  • we had a multilingual Thesaurus via a synonym dictionary hooked up to
    query processing with FAST. This was rewritten and ported to Solr but
    needed a lot of tricks to get it working as it did with FAST. Problems are
    handling of multilingual multiword synonyms during query processing.
    Have to look into ES sources to see what is possible. Is ES as flexible as
    Solr in the area of query components?

ES can use synonyms from text files,
see http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html

With the multitude of possible settings for analyzers, ES is very flexible.
There are Lucene/Solr features that may not be implemented, but moving them
to ES is usually not hard.

For multilingual synonyms, ES would need to know the "intended language the
query was given in", mostly the user interface language. If the index is
organized into many fields that correspond to the user interface languages,
a single analyzer can be wired to each language, for example an analyzer
using a language-specific synonym file. This approach is not my favorite
one since it does not scale, it needs work each time a new language is
added. With a single field for multiple languages, there are some other
tricks to think of (combo analyzer) but, the synonym list will not have the
context of the language given.

Best regards,

Jörg


(Bernd Fehling) #15

Hi David,

thanks, I will check that out.

Bernd

Am Samstag, 28. Juli 2012 14:36:48 UTC+2 schrieb David Pilato:

Hi Bernd,

There's a Filesystem river that can answer to:

FAST Search has a filetraverser which can easily add/modify/delete
documents from the file system to the index, this has been partially
rebuild using DIH for Solr. So no DIH or any similar in ES.

http://www.pilato.fr/fsriver/

HTH
David

--

Le 28 juil. 2012 à 14:25, Bernd Fehling bernd.fehling@googlemail.com a
écrit :

Hi Simon,

Am Samstag, 28. Juli 2012 11:23:45 UTC+2 schrieb simonw:

hey Bernd,

if you are going into a distributed environment I highly recommend you
using ES. you can simply create an index with N shards and ES will
distributed them over your machines running a ES daemon of the same cluster
name. Searching a single shard is not different from searching N shards
from an API perspective so nothing needs to change along those lines. I'd
highly recommend you ES over Solr Cloud at this stage given the maturity of
ES with Distributed Search compared to Solr Cloud.

Thanks for your advice. First tests impressed me very much, also the work
done by other users e.g. for bigdesk and ES-head.
Also the easy installing of plugins is great and much more.

Biggest problem so far, ES can only load JSON, unbelievable!!!
Just for testing ES i have to write either a XML2JSON River or convert
my test data to JSON.

This is a major design decision. You will never fit all needs like one
needs XML the next needs YAML or Python objects etc. from that perspective
treating anybody equally and offering one communication format is straight
forward and a reasonable design decision. if your XML has the right format
(logically) converting it to a ES Json is a very very straight forward
thing in almost any language. If not you'd need to do it anyway no? if you
are in java here is a very simple 3 liner to convert xml to json:
https://github.com/tobrien/sample-json-parsing

I will look into this.
Using currently XML is historically because we started years ago with FAST
Search and switched two years ago to solr. This needed a lot of work,
especially for some functionality which was already in FAST but not in
solr. FAST was Python so I had to rewrite several modules with JAVA for
solr.
Nevetheless if we switch to ES there will be again a lot of work even so
ES and solr are based on Lucene.

I don't know, may be i will contact you by phone and we can discuss this
misery.

maybe you can share your problems here so everybody can benefit from your
problems and their solutions.

Share my problems? here we go with the biggest ones:

  • FAST Search has a filetraverser which can easily add/modify/delete
    documents from the file system to the index, this has been partially
    rebuild using DIH for Solr. So no DIH or any similar in ES. What do you
    suggest?

  • we had a multilingual Thesaurus via a synonym dictionary hooked up to
    query processing with FAST. This was rewritten and ported to Solr but
    needed a lot of tricks to get it working as it did with FAST. Problems are
    handling of multilingual multiword synonyms during query processing.
    Have to look into ES sources to see what is possible. Is ES as flexible as
    Solr in the area of query components?

Bernd


(Bernd Fehling) #16

Hi Jörg,

Am Samstag, 28. Juli 2012 16:22:33 UTC+2 schrieb Jörg Prante:

...
An XML river would be an idea! But as XML is just a syntax for "data in a
container format", such a river is mostly useless without the feature of
custom processing extensions for the data (similar to the XML pipeline
processing in FAST). Maybe by scripting XML to JSON? Do you have preference
for a JVM scripting language? Groovy would be a straightforward option,
since I am integrating Groovy scripts into my MAB/MARC converter.

never looked to deep into JSON, just used it somehow.
XML has the advantage that it can be validated before/while loading,
especially if you work with full Unicode via UTF-8.
This also means Unicode above Basic Multilingual Plane.
Is this also covered with JSON?

My idea of a XML river is:

  • taking XML records from file system
  • validating
  • reporting invalid records and dropping from queue
  • packaging records to batches of size X
  • sending batches to the index (if possible im parallel if ES supports this)

Is indexing of ES aware of multithreading?

Regards,
Bernd


(Bernd Fehling) #17

Am Samstag, 28. Juli 2012 16:50:08 UTC+2 schrieb Jörg Prante:

(continued)
For multilingual synonyms, ES would need to know the "intended language
the query was given in", mostly the user interface language. If the index
is organized into many fields that correspond to the user interface
languages, a single analyzer can be wired to each language, for example an
analyzer using a language-specific synonym file. This approach is not my
favorite one since it does not scale, it needs work each time a new
language is added. With a single field for multiple languages, there are
some other tricks to think of (combo analyzer) but, the synonym list will
not have the context of the language given.

Disagree, you are thinking to complex :slight_smile:
I've developed it for our FAST installation and ported to Solr. You don't
need to know the language within Europe only if you add Asian, Arabic, ...
So just only one single field for special multilingual Thesaurus treatment.
This is basically just an query side synonym expansion which gets it
multilingual funtionallity from the multilingual content of the field and
the multilingual Thesaurus. The special solution is to have it multiword.

Bernd


(simonw-2) #18

On Sunday, July 29, 2012 11:51:02 AM UTC+2, Bernd Fehling wrote:

Hi Jörg,

Am Samstag, 28. Juli 2012 16:22:33 UTC+2 schrieb Jörg Prante:

...
An XML river would be an idea! But as XML is just a syntax for "data in a
container format", such a river is mostly useless without the feature of
custom processing extensions for the data (similar to the XML pipeline
processing in FAST). Maybe by scripting XML to JSON? Do you have preference
for a JVM scripting language? Groovy would be a straightforward option,
since I am integrating Groovy scripts into my MAB/MARC converter.

never looked to deep into JSON, just used it somehow.
XML has the advantage that it can be validated before/while loading,
especially if you work with full Unicode via UTF-8.
This also means Unicode above Basic Multilingual Plane.

If you are using Java you can encode non BMP characters since java 1.5. Yet
this has nothing todo with XML or JSON. Json is recommended to be UTF8 and
if you decide so it will be just pass the right CharacterEncoding to your
Json generator. The validation you refer to with XML is implicit in json
for the types. JSON encodes numbers, boolean, binary and character
sequences explicitly and your reading code should validate you json
document. No need for a schema or something like that (while there is such
a thing but I am not sure if its used much).

Is this also covered with JSON?

My idea of a XML river is:

  • taking XML records from file system
  • validating
  • reporting invalid records and dropping from queue
  • packaging records to batches of size X
  • sending batches to the index (if possible im parallel if ES supports
    this)

Is indexing of ES aware of multithreading?

yes its threadsafe you can just throw documents against it concurrently.

simon

Regards,
Bernd


(system) #19