Indexing different types from a CouchDB database


(Mark Huang-2) #1

Hi,

Reading http://www.elasticsearch.org/blog/2010/02/12/yourdatayoursearch.html,
I realized my problem is exactly the same as the article except for one
huge difference: Instead of indexing one book or cd at a time, I already
have a bunch of data in my couchdb database. I'm adding elastic search
capabilities to my project now and I'm having a ton of problems finding out
how to index my couchDB database by type.

The documentation says that one way to index ALL documents is like the
following:

curl -XPUT 'localhost:9200/_river/amazon/_meta' -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : null
},
"index" : {
"index" : "amazon",
"type" : "book",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'

Here I specified the index being amazon and book being the type. But this still indexes ALL the documents in CouchDB! How do I make it ONLY index the book type documents? Is this what the filter is for? Or can this be accomplished with script field: ctx.type = "book" <-- Tried this but this didn't work either.

Thanks in advance!

--


(David Pilato) #2

I think that you can use ctx.ignore=true
To ignore some documents.

Read here https://github.com/elasticsearch/elasticsearch-river-couchdb

BTW, I wrote a support for CouchDb views but it's not merged.

David

--

Le 13 août 2012 à 11:38, Mark Huang zhenghao12@gmail.com a écrit :

Hi,

Reading http://www.elasticsearch.org/blog/2010/02/12/yourdatayoursearch.html, I realized my problem is exactly the same as the article except for one huge difference: Instead of indexing one book or cd at a time, I already have a bunch of data in my couchdb database. I'm adding elastic search capabilities to my project now and I'm having a ton of problems finding out how to index my couchDB database by type.

The documentation says that one way to index ALL documents is like the following:

curl -XPUT 'localhost:9200/_river/amazon/_meta' -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : null
},
"index" : {
"index" : "amazon",
"type" : "book",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'
Here I specified the index being amazon and book being the type. But this still indexes ALL the documents in CouchDB! How do I make it ONLY index the book type documents? Is this what the filter is for? Or can this be accomplished with script field: ctx.type = "book" <-- Tried this but this didn't work either.
Thanks in advance!

--


(Mark Huang-2) #3

Are you saying it should be like this:

curl -XPUT 'localhost:9200/_river/amazon/_meta' -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : null,
"script": "ctx.ignore = true; ctx.doc.name = ctx.doc.content.name
;"
},
"index" : {
"index" : "amazon",
"type" : "book",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'

The documentation says : Also, if ctx.ignore is set to true, the change
seq will be ignore and not applied.
Wouldn't this mean that the book document will not be applied? Can I add
if else statements? if (ctx.type !== 'book') ctx.ignore = True; ?

On Mon, Aug 13, 2012 at 5:01 AM, David Pilato david@pilato.fr wrote:

I think that you can use ctx*.ignore=true *
To ignore some documents.

Read here https://github.com/elasticsearch/elasticsearch-river-couchdb

BTW, I wrote a support for CouchDb views but it's not merged.
https://github.com/elasticsearch/elasticsearch-river-couchdb/pull/2

David

--

Le 13 août 2012 à 11:38, Mark Huang zhenghao12@gmail.com a écrit :

Hi,

Reading
http://www.elasticsearch.org/blog/2010/02/12/yourdatayoursearch.html, I
realized my problem is exactly the same as the article except for one huge
difference: Instead of indexing one book or cd at a time, I already have a
bunch of data in my couchdb database. I'm adding elastic search
capabilities to my project now and I'm having a ton of problems finding out
how to index my couchDB database by type.

The documentation says that one way to index ALL documents is like the
following:

curl -XPUT 'localhost:9200/_river/amazon/_meta' -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : null
},
"index" : {
"index" : "amazon",
"type" : "book",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'

Here I specified the index being amazon and book being the type. But this still indexes ALL the documents in CouchDB! How do I make it ONLY index the book type documents? Is this what the filter is for? Or can this be accomplished with script field: ctx.type = "book" <-- Tried this but this didn't work either.

Thanks in advance!

--

--

--


(Dan Everton) #4

I'm assuming you have a CouchDB database that has multiple types of
documents in it and you only want to index one of those types? You can do
that in two ways:

  1. A filter function on the CouchDB
  2. A filter function on the Elasticsearch side.

The first way requires you to add a filter function to a design document in
the CouchDB database. So in your design document you'll have something like:

{
"_id" : "_design/books",
"filters" : {
"only" : "function(doc, req) { return doc.type == 'book' }"
}
}

And then your river definition looks like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : "books/only"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

The second way to do it is to use a script on the Elasticsearch side. This
is essentially the same as the previous way, but puts the script execution
burden on ES not CouchDB. So your river definition in this case looks like:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx.ignore = ctx.doc.type != 'book'"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

Note that this river definition requires the Javascript plugin to be added
to your Elasticsearch install.

Cheers,
Dan

--


(Mark Huang-2) #5

Thanks a million Dan! You have saved me from complete despair to full of
hope!

ElasticSearch CouchDB river is great, but (no offense), I feel it lacks
some proper usage examples. Maybe it's just me, but I doubt any n00b can
understand what to do with ctx.ignore just by reading the documentation.

One FINAL thing:

Is it possible to save the javascript "script" field value into a file and
then reference it from the river definition?

On Monday, 13 August 2012 18:44:51 UTC-5, Dan Everton wrote:

I'm assuming you have a CouchDB database that has multiple types of
documents in it and you only want to index one of those types? You can do
that in two ways:

  1. A filter function on the CouchDB
  2. A filter function on the Elasticsearch side.

The first way requires you to add a filter function to a design document
in the CouchDB database. So in your design document you'll have something
like:

{
"_id" : "_design/books",
"filters" : {
"only" : "function(doc, req) { return doc.type == 'book' }"
}
}

And then your river definition looks like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : "books/only"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

The second way to do it is to use a script on the Elasticsearch side. This
is essentially the same as the previous way, but puts the script execution
burden on ES not CouchDB. So your river definition in this case looks like:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx.ignore = ctx.doc.type != 'book'"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

Note that this river definition requires the Javascript plugin to be added
to your Elasticsearch install.

Cheers,
Dan

--


(Mark Huang-2) #6

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

On Monday, 13 August 2012 18:44:51 UTC-5, Dan Everton wrote:

I'm assuming you have a CouchDB database that has multiple types of
documents in it and you only want to index one of those types? You can do
that in two ways:

  1. A filter function on the CouchDB
  2. A filter function on the Elasticsearch side.

The first way requires you to add a filter function to a design document
in the CouchDB database. So in your design document you'll have something
like:

{
"_id" : "_design/books",
"filters" : {
"only" : "function(doc, req) { return doc.type == 'book' }"
}
}

And then your river definition looks like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : "books/only"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

The second way to do it is to use a script on the Elasticsearch side. This
is essentially the same as the previous way, but puts the script execution
burden on ES not CouchDB. So your river definition in this case looks like:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx.ignore = ctx.doc.type != 'book'"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

Note that this river definition requires the Javascript plugin to be added
to your Elasticsearch install.

Cheers,
Dan

--


(Dan Everton) #7

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So if
doc.type for a CouchDB changeset is "book" then Elasticsearch will index it
as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

--


(Mark Huang-2) #8

Damn....I feel so stupid for not realizing that this could be done. I read
about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So if
doc.type for a CouchDB changeset is "book" then Elasticsearch will index it
as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

--


(Mark Huang-2) #9

So I had this requirement which David Pilato solved for me:
https://groups.google.com/d/msg/elasticsearch/KRPldny3wUk/nrfJtkOsv-oJ

As you can see, a book may have a different output fields when compared
with a cd. And hence I wanted a way to index different types (matching my
couchdb type with elastic search type) separately. If I were to use:

script: ctx._type = ctx.doc.type, I wouldn't be able to set the different
output fields per document type. Someone else told me that I could use
partial fields to accomplish this, is this right?

On Tuesday, 14 August 2012 00:39:51 UTC-5, Mark Huang wrote:

Damn....I feel so stupid for not realizing that this could be done. I
read about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So
if doc.type for a CouchDB changeset is "book" then Elasticsearch will index
it as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

On Tuesday, 14 August 2012 00:39:51 UTC-5, Mark Huang wrote:

Damn....I feel so stupid for not realizing that this could be done. I
read about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So
if doc.type for a CouchDB changeset is "book" then Elasticsearch will index
it as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

On Tuesday, 14 August 2012 00:39:51 UTC-5, Mark Huang wrote:

Damn....I feel so stupid for not realizing that this could be done. I
read about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So
if doc.type for a CouchDB changeset is "book" then Elasticsearch will index
it as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

On Tuesday, 14 August 2012 00:39:51 UTC-5, Mark Huang wrote:

Damn....I feel so stupid for not realizing that this could be done. I
read about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So
if doc.type for a CouchDB changeset is "book" then Elasticsearch will index
it as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

On Tuesday, 14 August 2012 00:39:51 UTC-5, Mark Huang wrote:

Damn....I feel so stupid for not realizing that this could be done. I
read about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So
if doc.type for a CouchDB changeset is "book" then Elasticsearch will index
it as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

On Tuesday, 14 August 2012 00:39:51 UTC-5, Mark Huang wrote:

Damn....I feel so stupid for not realizing that this could be done. I
read about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So
if doc.type for a CouchDB changeset is "book" then Elasticsearch will index
it as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

On Tuesday, 14 August 2012 00:39:51 UTC-5, Mark Huang wrote:

Damn....I feel so stupid for not realizing that this could be done. I
read about the ElasticSearch _type.....but wasn't sure what to make of it.

Thank you once again

On Tuesday, 14 August 2012 00:27:13 UTC-5, Dan Everton wrote:

On Tuesday, August 14, 2012 2:37:58 PM UTC+10, Mark Huang wrote:

What if after indexing the book type, I now want to index another type
called "cd". Do I just create a second _river definition? I tried
creating a second _river definition to index cd but nothing seemed to
happen. Help?

If everything's in the same CouchDB database you can just use a single
river definition and use a script on the Elasticsearch side to set the
type. So say you have a "type" field in your documents in CouchDB that is
also the same as the type you want to use in Elasticsearch. Your river
definition would look like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx._type = ctx.doc.type"
},
"index" : {
"index" : "amazon"
}
}

Setting ctx._type tells Elasticsearch what type in the index to use. So
if doc.type for a CouchDB changeset is "book" then Elasticsearch will index
it as the "book" type. Obviously you can make the script a bit more
sophisticated if your types don't map one to one. I'm not aware of any way
to externalise that script unfortunately.

Cheers,
Dan

--


(David Pilato) #10

Hi Dan,

Nice and complete answer.

May I suggest that you send a pull request with that full example on the couchDb river (README file)?
It could help many other users.

David

--

Le 14 août 2012 à 01:44, Dan Everton dan@iocaine.org a écrit :

I'm assuming you have a CouchDB database that has multiple types of documents in it and you only want to index one of those types? You can do that in two ways:

  1. A filter function on the CouchDB
  2. A filter function on the Elasticsearch side.

The first way requires you to add a filter function to a design document in the CouchDB database. So in your design document you'll have something like:

{
"_id" : "_design/books",
"filters" : {
"only" : "function(doc, req) { return doc.type == 'book' }"
}
}

And then your river definition looks like this:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"filter" : "books/only"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

The second way to do it is to use a script on the Elasticsearch side. This is essentially the same as the previous way, but puts the script execution burden on ES not CouchDB. So your river definition in this case looks like:

{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "amazon",
"script" : "ctx.ignore = ctx.doc.type != 'book'"
},
"index" : {
"index" : "amazon",
"type" : "book"
}
}

Note that this river definition requires the Javascript plugin to be added to your Elasticsearch install.

Cheers,
Dan

--

--


(Dan Everton) #11

On Tuesday, August 14, 2012 6:36:33 PM UTC+10, David Pilato wrote:

May I suggest that you send a pull request with that full example on the
couchDb river (README file)?
It could help many other users.

Hrm, good idea.
See https://github.com/elasticsearch/elasticsearch-river-couchdb/pull/15

Cheers,
Dan

--


(system) #12