Map CouchDB document types to ElasticSearch types


(Chase) #1

Hi folks,

Scenario: you have a single CouchDB database that holds all the
documents for your app. All documents have a "type" field. Let's say
there are two possible types: "user" and "article".

How do I set up a CouchRB River in ElasticSearch so that documents are
indexed with the correct type -- in this example, so that Users would
be searchable at localhost:9200/my_index/user and Articles at /
my_index/article?

The Couch River guide page[1] shows that you can set the index and
type when creating a river, but it will slurp all Couch documents
into that ES type indiscriminately unless I set up a filter on the
Couch side for each type, and specify that filter in my river config.
Is that what I have to do? Set up a separate filter for each type? And
then create separate rivers for each of them? Seems kind of unwieldy.

I feel like you should be able to tell ES that, for example, the type
field in your database is "doc_type", and whenever it sees a document
with a doc_type of "blah_blah", it should store it with an ES type of
"blah_blah". (It can already be set to do something similarly magical
with _boost.)

-Chase

PS. I don't mind being told to RTFM if you have a link to the page I
need! I checked, I couldn't find it.

[1] http://www.elasticsearch.org/guide/reference/river/couchdb.html


(Shay Banon) #2

Yes, you will need to specify two rivers in this case, with two different filters. I guess it can be added, just need to check if oyu get the doc back on a deleted event in the _changes stream (otherwise, there is no way to tell which type to delete it from).
On Saturday, May 14, 2011 at 5:57 AM, Chase wrote:

Hi folks,

Scenario: you have a single CouchDB database that holds all the
documents for your app. All documents have a "type" field. Let's say
there are two possible types: "user" and "article".

How do I set up a CouchRB River in ElasticSearch so that documents are
indexed with the correct type -- in this example, so that Users would
be searchable at localhost:9200/my_index/user and Articles at /
my_index/article?

The Couch River guide page[1] shows that you can set the index and
type when creating a river, but it will slurp all Couch documents
into that ES type indiscriminately unless I set up a filter on the
Couch side for each type, and specify that filter in my river config.
Is that what I have to do? Set up a separate filter for each type? And
then create separate rivers for each of them? Seems kind of unwieldy.

I feel like you should be able to tell ES that, for example, the type
field in your database is "doc_type", and whenever it sees a document
with a doc_type of "blah_blah", it should store it with an ES type of
"blah_blah". (It can already be set to do something similarly magical
with _boost.)

-Chase

PS. I don't mind being told to RTFM if you have a link to the page I
need! I checked, I couldn't find it.

[1] http://www.elasticsearch.org/guide/reference/river/couchdb.html


(Chase) #3

Thanks for the quick response, Shay.

I checked the behavior of the _changes feed. Evidently, even with
include_docs[1] turned on, you only get the _id and _rev back for
deleted docs. You could set up a filter for each of your types in
CouchDB and then query the _changes feed with that filter (something
like /my_couch_db/_changes?filters=User/all), but then you haven't
saved yourself any work compared to setting up multiple rivers, and
you might as well be doing that.

In case anyone's interested, I've decided to let my app take care of
pushing changes to ElasticSearch (Ruby on Rails with the Tire gem)
instead of using the CouchDB River. Aside from solving this type
problem easy, it lets me keep all my mappings inside my app's models,
where it feels like they belong.

How are you supposed to keep track of your mappings, anyway? Pushing
the settings to ES with curl in the terminal feels very fleeting, and
keeping copies in a text file that I occasionally paste into the
terminal seems awkward. Do I just not grok the workflow? Should I be
doing something differently?

Anyway, thanks again!

-Chase

[1] http://wiki.apache.org/couchdb/HTTP_database_API#Changes

On May 14, 8:06 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you will need to specify two rivers in this case, with two different filters. I guess it can be added, just need to check if oyu get the doc back on a deleted event in the _changes stream (otherwise, there is no way to tell which type to delete it from).

On Saturday, May 14, 2011 at 5:57 AM, Chase wrote:

Hi folks,

Scenario: you have a single CouchDB database that holds all the
documents for your app. All documents have a "type" field. Let's say
there are two possible types: "user" and "article".

How do I set up a CouchRB River in ElasticSearch so that documents are
indexed with the correct type -- in this example, so that Users would
be searchable at localhost:9200/my_index/user and Articles at /
my_index/article?

The Couch River guide page[1] shows that you can set the index and
type when creating a river, but it will slurp all Couch documents
into that ES type indiscriminately unless I set up a filter on the
Couch side for each type, and specify that filter in my river config.
Is that what I have to do? Set up a separate filter for each type? And
then create separate rivers for each of them? Seems kind of unwieldy.

I feel like you should be able to tell ES that, for example, the type
field in your database is "doc_type", and whenever it sees a document
with a doc_type of "blah_blah", it should store it with an ES type of
"blah_blah". (It can already be set to do something similarly magical
with _boost.)

-Chase

PS. I don't mind being told to RTFM if you have a link to the page I
need! I checked, I couldn't find it.

[1]http://www.elasticsearch.org/guide/reference/river/couchdb.html


(Shay Banon) #4

Heya,

There is no reason why you won't be able to use Tire to manage the rivers, index and mappings, no? The APIs for creating an index and putting mappings are hte same, and the API to create a river and delete a river is the same as "creating and deleting a document".
On Sunday, May 15, 2011 at 6:02 PM, Chase wrote:

Thanks for the quick response, Shay.

I checked the behavior of the _changes feed. Evidently, even with
include_docs[1] turned on, you only get the _id and _rev back for
deleted docs. You could set up a filter for each of your types in
CouchDB and then query the _changes feed with that filter (something
like /my_couch_db/_changes?filters=User/all), but then you haven't
saved yourself any work compared to setting up multiple rivers, and
you might as well be doing that.

In case anyone's interested, I've decided to let my app take care of
pushing changes to ElasticSearch (Ruby on Rails with the Tire gem)
instead of using the CouchDB River. Aside from solving this type
problem easy, it lets me keep all my mappings inside my app's models,
where it feels like they belong.

How are you supposed to keep track of your mappings, anyway? Pushing
the settings to ES with curl in the terminal feels very fleeting, and
keeping copies in a text file that I occasionally paste into the
terminal seems awkward. Do I just not grok the workflow? Should I be
doing something differently?

Anyway, thanks again!

-Chase

[1] http://wiki.apache.org/couchdb/HTTP_database_API#Changes

On May 14, 8:06 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you will need to specify two rivers in this case, with two different filters. I guess it can be added, just need to check if oyu get the doc back on a deleted event in the _changes stream (otherwise, there is no way to tell which type to delete it from).

On Saturday, May 14, 2011 at 5:57 AM, Chase wrote:

Hi folks,

Scenario: you have a single CouchDB database that holds all the
documents for your app. All documents have a "type" field. Let's say
there are two possible types: "user" and "article".

How do I set up a CouchRB River in ElasticSearch so that documents are
indexed with the correct type -- in this example, so that Users would
be searchable at localhost:9200/my_index/user and Articles at /
my_index/article?

The Couch River guide page[1] shows that you can set the index and
type when creating a river, but it will slurp all Couch documents
into that ES type indiscriminately unless I set up a filter on the
Couch side for each type, and specify that filter in my river config.
Is that what I have to do? Set up a separate filter for each type? And
then create separate rivers for each of them? Seems kind of unwieldy.

I feel like you should be able to tell ES that, for example, the type
field in your database is "doc_type", and whenever it sees a document
with a doc_type of "blah_blah", it should store it with an ES type of
"blah_blah". (It can already be set to do something similarly magical
with _boost.)

-Chase

PS. I don't mind being told to RTFM if you have a link to the page I
need! I checked, I couldn't find it.

[1]http://www.elasticsearch.org/guide/reference/river/couchdb.html


(Karel Minarik) #5

Hi!

1/ I'd rather keep the CouchDB _river slim -- creating a filter
function, and passing the filtered feed to ES is perfectly reasonable
from CouchDB's perspective. That's how things work there.

2/ The less "magic", the better. Why doc_type, and not
document_type or type? I hate having to peek into the docs all the
time to get the naming right.

3/ In our case (CouchDB + Rails + Tire) we have decided against _river
from the same reasons -- individual updates are super-sonic fast in
ES, and when pushing loads of data, we use the Tire import/bulk_store
support.

4/ In Rails' context, the best place for a mapping is definitely in
the model definition. In general, I think index templates in ES are
an awesome idea, and they'd keep your settings/mappings neatly
stored, and consistent across cluster nodes.

5/ There's no support for aliases in Tire, yet. It's definitely
something I'd like to add soon, because for lots of use-cases
(reindexing, shifting windows, ...), aliases are another awesome
idea.

I'd like to hear any feedback on Tire here or at Github issues
[https://github.com/karmi/tire/issues]. The gem still needs lots of
feedback and work.

Best!,

Karel

On May 15, 5:07 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

There is no reason why you won't be able to use Tire to manage the rivers, index and mappings, no? The APIs for creating an index and putting mappings are hte same, and the API to create a river and delete a river is the same as "creating and deleting a document".

On Sunday, May 15, 2011 at 6:02 PM, Chase wrote:

Thanks for the quick response, Shay.

I checked the behavior of the _changes feed. Evidently, even with
include_docs[1] turned on, you only get the _id and _rev back for
deleted docs. You could set up a filter for each of your types in
CouchDB and then query the _changes feed with that filter (something
like /my_couch_db/_changes?filters=User/all), but then you haven't
saved yourself any work compared to setting up multiple rivers, and
you might as well be doing that.

In case anyone's interested, I've decided to let my app take care of
pushing changes to ElasticSearch (Ruby on Rails with the Tire gem)
instead of using the CouchDB River. Aside from solving this type
problem easy, it lets me keep all my mappings inside my app's models,
where it feels like they belong.

How are you supposed to keep track of your mappings, anyway? Pushing
the settings to ES with curl in the terminal feels very fleeting, and
keeping copies in a text file that I occasionally paste into the
terminal seems awkward. Do I just not grok the workflow? Should I be
doing something differently?

Anyway, thanks again!

-Chase

[1]http://wiki.apache.org/couchdb/HTTP_database_API#Changes

On May 14, 8:06 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you will need to specify two rivers in this case, with two different filters. I guess it can be added, just need to check if oyu get the doc back on a deleted event in the _changes stream (otherwise, there is no way to tell which type to delete it from).

On Saturday, May 14, 2011 at 5:57 AM, Chase wrote:

Hi folks,

Scenario: you have a single CouchDB database that holds all the
documents for your app. All documents have a "type" field. Let's say
there are two possible types: "user" and "article".

How do I set up a CouchRB River in ElasticSearch so that documents are
indexed with the correct type -- in this example, so that Users would
be searchable at localhost:9200/my_index/user and Articles at /
my_index/article?

The Couch River guide page[1] shows that you can set the index and
type when creating a river, but it will slurp all Couch documents
into that ES type indiscriminately unless I set up a filter on the
Couch side for each type, and specify that filter in my river config.
Is that what I have to do? Set up a separate filter for each type? And
then create separate rivers for each of them? Seems kind of unwieldy.

I feel like you should be able to tell ES that, for example, the type
field in your database is "doc_type", and whenever it sees a document
with a doc_type of "blah_blah", it should store it with an ES type of
"blah_blah". (It can already be set to do something similarly magical
with _boost.)

-Chase

PS. I don't mind being told to RTFM if you have a link to the page I
need! I checked, I couldn't find it.

[1]http://www.elasticsearch.org/guide/reference/river/couchdb.html


(system) #6