How to avoid duplication of data while indexing

I need to index data of 4 million users in elasticsearch using JAVA API.

for each object these are the fields that are getting indexed:-

Name -
Address -

I need to put a check at the time of indexing. If another json object with
the same name as this object has already been indexed, then either discard
this object or overwrite the previous one.

Can someone help me out here?

--

Hello,

You can use the name as an ID:

Then, if you want to overwrite, you can simply index your documents, and
you'll see in the version field how many times you got that overwritten.
Some more information about versioning here:

If you want to discard the new object, you can specify op_type=create while
indexing:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das divyanshu.das@gmail.comwrote:

I need to index data of 4 million users in elasticsearch using JAVA API.

for each object these are the fields that are getting indexed:-

Name -
Address -

I need to put a check at the time of indexing. If another json object with
the same name as this object has already been indexed, then either discard
this object or overwrite the previous one.

Can someone help me out here?

--

You can use a hash of both as the document ID

On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das divyanshu.das@gmail.comwrote:

I need to index data of 4 million users in elasticsearch using JAVA API.

for each object these are the fields that are getting indexed:-

Name -
Address -

I need to put a check at the time of indexing. If another json object with
the same name as this object has already been indexed, then either discard
this object or overwrite the previous one.

Can someone help me out here?

--

--

I am new to this mapping concept on how and where to define it.

I am creating the index like this, met.xb where xb is the json builder,
contains the fields to be indexed.

IndexResponse response = client.prepareIndex("nod",
"rel").setSource(met.xb).execute().actionGet();

I dont see any example in Client API regarding mapping.

How do I make met.xb.field("name", name) as the default ID?

On Thursday, 29 November 2012 15:08:02 UTC+5:30, Itamar Syn-Hershko wrote:

You can use a hash of both as the document ID

On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das <divyan...@gmail.com<javascript:>

wrote:

I need to index data of 4 million users in elasticsearch using JAVA API.

for each object these are the fields that are getting indexed:-

Name -
Address -

I need to put a check at the time of indexing. If another json object
with the same name as this object has already been indexed, then either
discard this object or overwrite the previous one.

Can someone help me out here?

--

--

Use something like

IndexResponse response = client.prepareIndex("nod", "rel",
"id").setSource(met.xb).execute().actionGet();

where id is your name (or best as suggested by Itamar, a HASH of your name).

HTH
David.

Le 29 novembre 2012 à 10:53, divyanshu das divyanshu.das@gmail.com a écrit :

I am new to this mapping concept on how and where to define it.

I am creating the index like this, met.xb where xb is the json builder,
contains the fields to be indexed.

IndexResponse response = client.prepareIndex("nod",
"rel").setSource(met.xb).execute().actionGet();

I dont see any example in Client API regarding mapping.

How do I make met.xb.field("name", name) as the default ID?

On Thursday, 29 November 2012 15:08:02 UTC+5:30, Itamar Syn-Hershko wrote:

You can use a hash of both as the document ID

On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das divyan...@gmail.com
wrote:
> > > I need to index data of 4 million users in elasticsearch using
> > > JAVA API.

 for each object these are the fields that are getting indexed:-

 Name -
 Address -

 I need to put a check at the time of indexing. If another json object

with the same name as this object has already been indexed, then either
discard this object or overwrite the previous one.

 Can someone help me out here?




 --

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

@David,

so, lets say i define a new string,

String newid = abc.getName() // abc.getName() returns the name

IndexResponse response = client.prepareIndex("nod", "rel",
newid).setSource(met.xb).execute().actionGet();

Will this work?

On Thursday, 29 November 2012 15:46:14 UTC+5:30, David Pilato wrote:

Use something like

IndexResponse response = client.prepareIndex("nod",
"rel").setSource(met.xb).execute().actionGet();

where id is your name (or best as suggested by Itamar, a HASH of your
name).

HTH
David.

Le 29 novembre 2012 à 10:53, divyanshu das <divyan...@gmail.com<javascript:>>
a écrit :

I am new to this mapping concept on how and where to define it.

I am creating the index like this, met.xb where xb is the json builder,
contains the fields to be indexed.

IndexResponse response = client.prepareIndex("nod",
"rel").setSource(met.xb).execute().actionGet();

I dont see any example in Client API regarding mapping.

How do I make met.xb.field("name", name) as the default ID?

On Thursday, 29 November 2012 15:08:02 UTC+5:30, Itamar Syn-Hershko wrote:

You can use a hash of both as the document ID

On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das divyan...@gmail.comwrote:

I need to index data of 4 million users in elasticsearch using JAVA API.

for each object these are the fields that are getting indexed:-

Name -
Address -

I need to put a check at the time of indexing. If another json object
with the same name as this object has already been indexed, then either
discard this object or overwrite the previous one.

Can someone help me out here?

--

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

If I understand what you are after (say index with a given ID): yes.
But, if name is a string like "Hello I'm happy to meet you!", It will not work.
So you have to encode your ID.

To encode it, you can use something like :

Or perhaps : org.elasticsearch.common.UUID.fromString(name) will help;

David

Le 29 novembre 2012 à 11:49, divyanshu das divyanshu.das@gmail.com a écrit :

@David,

so, lets say i define a new string,

String newid = abc.getName() // abc.getName() returns the name

IndexResponse response = client.prepareIndex("nod", "rel",
newid).setSource(met.xb).execute().actionGet();

Will this work?

On Thursday, 29 November 2012 15:46:14 UTC+5:30, David Pilato wrote:

Use something like

   > > > 

IndexResponse response = client.prepareIndex("nod",
"rel").setSource(met.xb).execute().actionGet();

where id is your name (or best as suggested by Itamar, a HASH of your
name).

HTH
David.

Le 29 novembre 2012 à 10:53, divyanshu das < divyan...@gmail.com> a écrit
:

> > > I am new to this mapping concept on how and where to define it.
I am creating the index like this, met.xb where xb is the json

builder, contains the fields to be indexed.

IndexResponse response = client.prepareIndex("nod",

"rel").setSource(met.xb).execute().actionGet();

I dont see any example in Client API regarding mapping.

How do I make met.xb.field("name", name) as the default ID?


On Thursday, 29 November 2012 15:08:02 UTC+5:30, Itamar Syn-Hershko

wrote:
> > > > You can use a hash of both as the document ID

  On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das

divyan...@gmail.com wrote:
> > > > > I need to index data of 4 million users in
> > > > > elasticsearch using JAVA API.

    for each object these are the fields that are getting

indexed:-

    Name -
    Address -

    I need to put a check at the time of indexing. If another json

object with the same name as this object has already been indexed,
then either discard this object or overwrite the previous one.

    Can someone help me out here?




    --


  > > > >     > > > 
--

--
David Pilato
http://www.scrutmydocs.org/ http://www.scrutmydocs.org/
http://dev.david.pilato.fr/ http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

yes, I want to index with a given ID.

So, if the name is one single word "xyz", then it should work.

or if lets say I replace name with a particular ID (not Elasticsearch id
but my own id).

String newid = abc.getID();

IndexResponse response = client.prepareIndex("nod", "rel",
newid).setSource(met.xb).execute().actionGet();

Will this work?

On Thu, Nov 29, 2012 at 4:35 PM, David Pilato david@pilato.fr wrote:

**
If I understand what you are after (say index with a given ID): yes.
But, if name is a string like "Hello I'm happy to meet you!", It will not
work.
So you have to encode your ID.

To encode it, you can use something like :
https://github.com/scrutmydocs/scrutmydocs/blob/master/src/main/java/org/scrutmydocs/webapp/util/SignTool.java

Or perhaps : org.elasticsearch.common.UUID.fromString(name) will help;

David

Le 29 novembre 2012 à 11:49, divyanshu das divyanshu.das@gmail.com a
écrit :

@David,

so, lets say i define a new string,

String newid = abc.getName() // abc.getName() returns the name

IndexResponse response = client.prepareIndex("nod", "rel",
newid).setSource(met.xb).execute().actionGet();

Will this work?

On Thursday, 29 November 2012 15:46:14 UTC+5:30, David Pilato wrote:

Use something like

IndexResponse response = client.prepareIndex("nod",
"rel").setSource(met.xb).execute().actionGet();

where id is your name (or best as suggested by Itamar, a HASH of your
name).

HTH
David.

Le 29 novembre 2012 à 10:53, divyanshu das < divyan...@gmail.com> a
écrit :

I am new to this mapping concept on how and where to define it.

I am creating the index like this, met.xb where xb is the json builder,
contains the fields to be indexed.

IndexResponse response = client.prepareIndex("nod",
"rel").setSource(met.xb).execute().actionGet();

I dont see any example in Client API regarding mapping.

How do I make met.xb.field("name", name) as the default ID?

On Thursday, 29 November 2012 15:08:02 UTC+5:30, Itamar Syn-Hershko wrote:

You can use a hash of both as the document ID

On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das divyan...@gmail.comwrote:

I need to index data of 4 million users in elasticsearch using JAVA API.

for each object these are the fields that are getting indexed:-

Name -
Address -

I need to put a check at the time of indexing. If another json object
with the same name as this object has already been indexed, then either
discard this object or overwrite the previous one.

Can someone help me out here?

--

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--

Yes...

I would say: "just test it!" It's so easy to start an ES node in Java.

Have a look at: GitHub - elasticsearchfr/hands-on: Hands On Lab

It will help you to start in Java with ES.

HTH
David.

Le 29 novembre 2012 à 12:39, divyanshu das divyanshu.das@gmail.com a écrit :

yes, I want to index with a given ID.

So, if the name is one single word "xyz", then it should work.

or if lets say I replace name with a particular ID (not Elasticsearch id but
my own id).

String newid = abc.getID();

IndexResponse response = client.prepareIndex("nod", "rel",
newid).setSource(met.xb). execute().actionGet();

Will this work?

On Thu, Nov 29, 2012 at 4:35 PM, David Pilato <david@pilato.fr
mailto:david@pilato.fr > wrote:

If I understand what you are after (say index with a given ID): yes.
But, if name is a string like "Hello I'm happy to meet you!", It will not
work.
So you have to encode your ID.

To encode it, you can use something like :
https://github.com/scrutmydocs/scrutmydocs/blob/master/src/main/java/org/scrutmydocs/webapp/util/SignTool.java
https://github.com/scrutmydocs/scrutmydocs/blob/master/src/main/java/org/scrutmydocs/webapp/util/SignTool.java

Or perhaps : org.elasticsearch.common.UUID.fromString(name) will help;

David

Le 29 novembre 2012 à 11:49, divyanshu das < divyanshu.das@gmail.com
mailto:divyanshu.das@gmail.com > a écrit :

> > > @David,
so, lets say i define a new string,

String newid = abc.getName() // abc.getName() returns the name

IndexResponse response = client.prepareIndex("nod", "rel",

newid).setSource(met.xb).execute().actionGet();

Will this work?

On Thursday, 29 November 2012 15:46:14 UTC+5:30, David Pilato wrote:
  > > > > 
  Use something like

      > > > > > 
  > > > > IndexResponse response = client.prepareIndex("nod",
  > > > > "rel").setSource(met.xb).execute().actionGet();
  where id is your name (or best as suggested by Itamar, a HASH of

your name).

  HTH
  David.

  Le 29 novembre 2012 à 10:53, divyanshu das < divyan...@gmail.com>

a écrit :

   > > > > > I am new to this mapping concept on how and where to
   > > > > > define it.
   I am creating the index like this, met.xb where xb is the json

builder, contains the fields to be indexed.

   IndexResponse response = client.prepareIndex("nod",

"rel").setSource(met.xb).execute().actionGet();

   I dont see any example in Client API regarding mapping.

   How do I make met.xb.field("name", name) as the default ID?


   On Thursday, 29 November 2012 15:08:02 UTC+5:30, Itamar

Syn-Hershko wrote:
> > > > > > You can use a hash of both as the
> > > > > > document ID

     On Thu, Nov 29, 2012 at 11:12 AM, divyanshu das

divyan...@gmail.com wrote:
> > > > > > > I need to index data of 4 million users in
> > > > > > > elasticsearch using JAVA API.

       for each object these are the fields that are getting

indexed:-

       Name -
       Address -

       I need to put a check at the time of indexing. If

another json object with the same name as this object has already
been indexed, then either discard this object or overwrite the
previous one.

       Can someone help me out here?




       --


     > > > > > >        > > > > > 
   --



  > > > > 
  --
  David Pilato
  http://www.scrutmydocs.org/ <http://www.scrutmydocs.org/>
  http://dev.david.pilato.fr/ <http://dev.david.pilato.fr/>
  Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
> > > 
--

--
David Pilato
http://www.scrutmydocs.org/ http://www.scrutmydocs.org/
http://dev.david.pilato.fr/ http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--