Planned project/Compass replacement


(Stefan Fußenegger) #1

Hi all,

I am one of the poor guys who is stuck on Compass. Like most others,
I'm desperately waiting for a similar framework built on
elasticsearch. But now I've finally decided to stop waiting and start
building instead.

Currently, I'm investigating existing projects, efforts and
approaches. What I'm mainly looking for is object to JSON mapping
(jackson, elasticsearch-osem), Hibernate integration and reusable
index rebuilding code. Current efforts seem to be driven by single
developers without much coordination (am I wrong?). It might therefore
be valuable to simply join forces towards a common goal.

Anyway, I'm the planning and design phase, so I'd be happy for input
of any sort (existing projects, features, requirements, ...),
additional contributors, testers, helpers or pure and simple interest
in the topic. I hope to come up with some ideas during the holidays -
using sources of inspiration like slopes, mountains and the obligatory
consumption of alcoholic beverages during family gatherings :wink:

Cheers, Stefan


(Karussell) #2

I'm not aware of a more recent component for Java.

Hibernate integration

jdbc:

and reusable index rebuilding code

this is simple if you have the _source field: use a scan search.

Peter.

PS: not sure if this helps:

http://cloudscale.blogspot.com/2011/06/elasticsearch-json-and-json-schema.html

also worth to checking out:

On 21 Dez., 13:34, Stefan Fußenegger s...@molindo.at wrote:

Hi all,

I am one of the poor guys who is stuck on Compass. Like most others,
I'm desperately waiting for a similar framework built on
elasticsearch. But now I've finally decided to stop waiting and start
building instead.

Currently, I'm investigating existing projects, efforts and
approaches. What I'm mainly looking for is object to JSON mapping
(jackson, elasticsearch-osem), Hibernate integration and reusable
index rebuilding code. Current efforts seem to be driven by single
developers without much coordination (am I wrong?). It might therefore
be valuable to simply join forces towards a common goal.

Anyway, I'm the planning and design phase, so I'd be happy for input
of any sort (existing projects, features, requirements, ...),
additional contributors, testers, helpers or pure and simple interest
in the topic. I hope to come up with some ideas during the holidays -
using sources of inspiration like slopes, mountains and the obligatory
consumption of alcoholic beverages during family gatherings :wink:

Cheers, Stefan


(Stefan Fußenegger) #3

and reusable index rebuilding code

this is simple if you have the _source field: use a scan search.

I meant rebuilding the index from a primary data store, e.g. database
or CSV files, ... That would include fetching the data, submitting the
data to ES and - most importantly - do this transparently while the
application is still running. Therefore it must be possible to search
the previous index and and submit/queue additional indexing
operations. That makes the whole process much more complicated and the
code definitely worth being reused. Anybody know of a project that
provides this? I have just heard that Andrew Regan's elasticsearch-
osem fork (https://github.com/poblish/elasticsearch-osem) provides a
Compass fork on top of elasticsearch-osem that he uses in production.
I'll have to have another look though.

Cheers


(David Pilato) #4

Hi Stefan,

Let me tell you my short experience with Hibernate and ES.
At first, I tried to work with Alois OSEM annotations and with Hibernate
listeners to push to ES every entity which is @Searchable.
But, listeners are not the right place to do it, as Hibernate Search does
it. Because, when you update a child entity (which is not annotated by
Searchable), you will not push the update to ES... (It's the short story).

So, as it's really easy to create JSON documents from Hibernate objects
using Jackson, I now manage directly my push requests to ES from the service
layer.
As I'm using Spring, I simply inject an ES Node in my Service Class and play
with it every time I need to create/update or delete a document in ES.

I tried to work on a maven plugin for ES based on OSEM annotations in order
to generate mapping files but I'm stucked with classloading problems, so I
can't release it :frowning:

Let me say also, that I use now an ActiveMQ queue to make everything
asynchronous.
So when I want to reindex all my postgresql database, I simply read all my
entities with Hibernate and push them to the ActiveMQ queue. The queue
listener takes the entity, build the JSON content (BTW, compute some datas,
create a PDF file, send some stuff to a couchDB database, ...) AND finally
push it to ES.
This process is running while the application is live for users.

If you have only to push your entities to ES, you can build a very simple
batch that fetch all your entities, build the JSON and push to ES. Depending
on your entities complexity (the read time in postgresql could be very
high), you can have nice bulk load times.
On a "single windows computer", I indexed 5 to 10 entities per second for
very complex entities (many collections, ...).
I indexed about more than 300 entities per second for simple entities. So
the main problem is really the YesSQL read time.

I'm not sure I answered to your questions and I hope you can understand my
bad english :frowning:

David

-----Message d'origine-----
De : elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com]
De la part de Stefan Fußenegger
Envoyé : mercredi 21 décembre 2011 13:35
À : elasticsearch
Objet : Planned project/Compass replacement

Hi all,

I am one of the poor guys who is stuck on Compass. Like most others,
I'm desperately waiting for a similar framework built on
elasticsearch. But now I've finally decided to stop waiting and start
building instead.

Currently, I'm investigating existing projects, efforts and
approaches. What I'm mainly looking for is object to JSON mapping
(jackson, elasticsearch-osem), Hibernate integration and reusable
index rebuilding code. Current efforts seem to be driven by single
developers without much coordination (am I wrong?). It might therefore
be valuable to simply join forces towards a common goal.

Anyway, I'm the planning and design phase, so I'd be happy for input
of any sort (existing projects, features, requirements, ...),
additional contributors, testers, helpers or pure and simple interest
in the topic. I hope to come up with some ideas during the holidays -
using sources of inspiration like slopes, mountains and the obligatory
consumption of alcoholic beverages during family gatherings :wink:

Cheers, Stefan


(Stefan Fußenegger) #5

Hi David,

Let me tell you my short experience with Hibernate and ES.
At first, I tried to work with Alois OSEM annotations and with Hibernate
listeners to push to ES every entity which is @Searchable.
But, listeners are not the right place to do it, as Hibernate Search does
it. Because, when you update a child entity (which is not annotated by
Searchable), you will not push the update to ES... (It's the short story).

I know this restriction but it hasn't really affected us so far.
Nevertheless, it would be possible to fix this limitation (e.g.
@Parent(update = true) on a nested component's property - or do a DB
query for 1:n or m:n relations, but that could become really heavy)

So, as it's really easy to create JSON documents from Hibernate objects
using Jackson,

What's your preferred way to configure Jackson?

I now manage directly my push requests to ES from the service
layer.

I really hope to avoid that. I'd be really scared that somebody could
do DB writes without calling ES. Watching what gets written to DB
seems to be the better choice. One might think about mixing both ways
though - automatic where possible, manually where needed.

As I'm using Spring, I simply inject an ES Node in my Service Class and play
with it every time I need to create/update or delete a document in ES.

I tried to work on a maven plugin for ES based on OSEM annotations in order
to generate mapping files but I'm stucked with classloading problems, so I
can't release it :frowning:

Sounds interesting. But wouldn't it be easier to generate them as part
of the application itself? How do you configure attributes like
term_vector, boost, analyzer, etc with Jackson?

Let me say also, that I use now an ActiveMQ queue to make everything
asynchronous.
So when I want to reindex all my postgresql database, I simply read all my
entities with Hibernate and push them to the ActiveMQ queue. The queue
listener takes the entity, build the JSON content (BTW, compute some datas,
create a PDF file, send some stuff to a couchDB database, ...) AND finally
push it to ES.
This process is running while the application is live for users.

I thought about a similar approach myself. Not particularly ActiveMQ
but different pluggable queue implementations (I'm thinking of a
persistent Hazelcast queue, maybe with a Hazelcast river?). I was
thinking of queuing only entity name and id (e.g. {'User', 42}) inside
the Hibernate listener thread and do the JSON generation and indexing
asynchronously.

If you have only to push your entities to ES, you can build a very simple
batch that fetch all your entities, build the JSON and push to ES. Depending
on your entities complexity (the read time in postgresql could be very
high), you can have nice bulk load times.
On a "single windows computer", I indexed 5 to 10 entities per second for
very complex entities (many collections, ...).
I indexed about more than 300 entities per second for simple entities. So
the main problem is really the YesSQL read time.

We're currently doing this with Hibernate/Mysql/Compass/Lucene anyway.
I think currently within less than 2 hours for several million small
entities. I certainly hope that this won't increase with my new
approach.

I'm not sure I answered to your questions and I hope you can understand my
bad english :frowning:

Currently I have that many questions, it's hard not to answer at least
one of them :wink: Thanks for your input!

David

Stefan


(David Pilato) #6

Hi,

I know this restriction but it hasn't really affected us so far.
Nevertheless, it would be possible to fix this limitation (e.g.
@Parent(update = true) on a nested component's property - or do a DB
query for 1:n or m:n relations, but that could become really heavy)
Thanks for the tip. I wasn't aware of it.

What's your preferred way to configure Jackson?
I'm using Jackson annotations on my entities and I only jsonize annotated
properties.

I really hope to avoid that. I'd be really scared that somebody could
do DB writes without calling ES. Watching what gets written to DB
seems to be the better choice. One might think about mixing both ways
though - automatic where possible, manually where needed.
Yes. As I said to my developpers : "never trust a developer !" :wink:

At first, I wrote an abstract CRUD Dao class based on hibernate and I inject
ES DAO in it.
So each time, I persist, update, delete an entity, I look if it's an ES
annotated class and send it to ES if needed.
But, if developers don't use this abstract class, I'm stucked, as for the
service layer !!!

Perhaps, the best thing is to register an ES listener in Hibernate and use
@Parent annotation ???

I stopped to find other solutions as :

  1. it's working fine for me and it's stable after 5 months in production
  2. I really think to drop all the hibernate/YesSQL stuff and use a NoSQL
    database (CouchDB)
  3. With CouchDB, I can simply add a river to ES and only manage my
    create/update/delete documents (entities). The river will transport all
    changes to ES

I tried to work on a maven plugin for ES based on OSEM annotations in
order

to generate mapping files but I'm stucked with classloading problems, so
I

can't release it :frowning:
Sounds interesting. But wouldn't it be easier to generate them as part
of the application itself?
What do you mean? Create the mapping live each time the application starts?
The maven plugin would be able to generate .json files when packaging the
project in target/es/mapping dir for example.
It's based on OSEM annotations.

How do you configure attributes like
term_vector, boost, analyzer, etc with Jackson?
There is annotations for that.
Just see :
https://github.com/aloiscochard/elasticsearch-osem/blob/master/src/main/java
/org/elasticsearch/osem/annotations/Indexable.java

So it's quite simple to use it.
My work is here : https://github.com/dadoonet/esmavenplugin
My concern is that the maven classloader is able to see classes from the
project from where you use the plugin but annotations are not found...

I thought about a similar approach myself. Not particularly ActiveMQ
but different pluggable queue implementations (I'm thinking of a
persistent Hazelcast queue, maybe with a Hazelcast river?). I was
thinking of queuing only entity name and id (e.g. {'User', 42}) inside
the Hibernate listener thread and do the JSON generation and indexing
asynchronously.
Yes. I think RabbitMQ is a nice implementation too. There is already a river
: http://www.elasticsearch.org/guide/reference/river/rabbitmq.html

I certainly hope that this won't increase with my new
approach.
I don't know how fast was Compass but i'm quite sure that you will not have
problems with ES.
Add more nodes. 1 shard per node for example, and you will have very nice
response times!

Cheers
David


(system) #7