Converting schema.xml from solr to ES

On Sunday, July 29, 2012 11:51:02 AM UTC+2, Bernd Fehling wrote:

Hi Jörg,

Am Samstag, 28. Juli 2012 16:22:33 UTC+2 schrieb Jörg Prante:

...
An XML river would be an idea! But as XML is just a syntax for "data in a
container format", such a river is mostly useless without the feature of
custom processing extensions for the data (similar to the XML pipeline
processing in FAST). Maybe by scripting XML to JSON? Do you have preference
for a JVM scripting language? Groovy would be a straightforward option,
since I am integrating Groovy scripts into my MAB/MARC converter.

never looked to deep into JSON, just used it somehow.
XML has the advantage that it can be validated before/while loading,
especially if you work with full Unicode via UTF-8.
This also means Unicode above Basic Multilingual Plane.

If you are using Java you can encode non BMP characters since java 1.5. Yet
this has nothing todo with XML or JSON. Json is recommended to be UTF8 and
if you decide so it will be just pass the right CharacterEncoding to your
Json generator. The validation you refer to with XML is implicit in json
for the types. JSON encodes numbers, boolean, binary and character
sequences explicitly and your reading code should validate you json
document. No need for a schema or something like that (while there is such
a thing but I am not sure if its used much).

Is this also covered with JSON?

My idea of a XML river is:

  • taking XML records from file system
  • validating
  • reporting invalid records and dropping from queue
  • packaging records to batches of size X
  • sending batches to the index (if possible im parallel if ES supports
    this)

Is indexing of ES aware of multithreading?

yes its threadsafe you can just throw documents against it concurrently.

simon

Regards,
Bernd