Duplicate data in ES


(Anurag Phadke) #1

Hello all,
ES maintains one copy of each json for a given index/type. I ran the
following search query:
http://<es_server>/logs/builds/_search?pretty=true&q=mozilla-central-linux-debug-1292432368-mochitest-other.gz

and it returned the following jsons:
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.5457389,
"hits" : [ {
"_index" : "logs",
"_type" : "builds",
"_id" : "qiZEn2fMQBSKHAbA2UDa7A",
"_score" : 1.5457389, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2",
"starttime": "1292436935", "revision": "d0234003e042"} }
}, {
"_index" : "logs",
"_type" : "builds",
"_id" : "gNBDktUJTieNNPvO2Jn_qg",
"_score" : 1.4803665, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2",
"starttime": "1292436935", "revision": "d0234003e042"} }
} ]
}
}

both the jsons look same (to me atleast).... any idea why this might
be happening?

-anurag


(Shay Banon) #2

ES will only update docs if they have the same id. If you are indexing into
elasticsearch without specifying an id, then one will be automatically be
created and it is considered as new doc. In your case, there were probably
two docs indexed with this content.

If you have something that can uniquely identify the document, then you can
provide it as the id, and then documents will be updated if they already
exists.

On Thu, Dec 16, 2010 at 12:10 AM, Anurag anurag.phadke@gmail.com wrote:

Hello all,
ES maintains one copy of each json for a given index/type. I ran the
following search query:
http://
<es_server>/logs/builds/_search?pretty=true&q=mozilla-central-linux-debug-1292432368-mochitest-other.gz

and it returned the following jsons:
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.5457389,
"hits" : [ {
"_index" : "logs",
"_type" : "builds",
"_id" : "qiZEn2fMQBSKHAbA2UDa7A",
"_score" : 1.5457389, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"
http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2
",
"starttime": "1292436935", "revision": "d0234003e042"} }
}, {
"_index" : "logs",
"_type" : "builds",
"_id" : "gNBDktUJTieNNPvO2Jn_qg",
"_score" : 1.4803665, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"
http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2
",
"starttime": "1292436935", "revision": "d0234003e042"} }
} ]
}
}

both the jsons look same (to me atleast).... any idea why this might
be happening?

-anurag


(Anurag Phadke) #3

Shay,
The data is getting indexed via multiple puts without an id (from our end). We don't have a way of maintaining an id internally, would it be possible to have a getAndPut call that compares the contents of whole json and only inserts a new one inside the index if it's unique?
We can probably do this by having two API calls for an individual json, not sure how this can be done when for multiple inserts via single or two API calls?

-anurag

On Dec 15, 2010, at 2:43 PM, Shay Banon shay.banon@elasticsearch.com wrote:

ES will only update docs if they have the same id. If you are indexing into elasticsearch without specifying an id, then one will be automatically be created and it is considered as new doc. In your case, there were probably two docs indexed with this content.

If you have something that can uniquely identify the document, then you can provide it as the id, and then documents will be updated if they already exists.

On Thu, Dec 16, 2010 at 12:10 AM, Anurag anurag.phadke@gmail.com wrote:
Hello all,
ES maintains one copy of each json for a given index/type. I ran the
following search query:
http://<es_server>/logs/builds/_search?pretty=true&q=mozilla-central-linux-debug-1292432368-mochitest-other.gz

and it returned the following jsons:
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.5457389,
"hits" : [ {
"_index" : "logs",
"_type" : "builds",
"_id" : "qiZEn2fMQBSKHAbA2UDa7A",
"_score" : 1.5457389, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2",
"starttime": "1292436935", "revision": "d0234003e042"} }
}, {
"_index" : "logs",
"_type" : "builds",
"_id" : "gNBDktUJTieNNPvO2Jn_qg",
"_score" : 1.4803665, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2",
"starttime": "1292436935", "revision": "d0234003e042"} }
} ]
}
}

both the jsons look same (to me atleast).... any idea why this might
be happening?

-anurag


(Shay Banon) #4

There is no way to get a document based on source content matching. You can
md5 the document, and check against it. A strategy to do get then index is
problematic since it will really slow down indexing, and the near real time
aspect will be problematic.

On Thu, Dec 16, 2010 at 12:57 AM, Anurag Phadke anurag.phadke@gmail.comwrote:

Shay,
The data is getting indexed via multiple puts without an id (from our end).
We don't have a way of maintaining an id internally, would it be possible to
have a getAndPut call that compares the contents of whole json and only
inserts a new one inside the index if it's unique?
We can probably do this by having two API calls for an individual json, not
sure how this can be done when for multiple inserts via single or two API
calls?

-anurag

On Dec 15, 2010, at 2:43 PM, Shay Banon shay.banon@elasticsearch.com
wrote:

ES will only update docs if they have the same id. If you are indexing into
elasticsearch without specifying an id, then one will be automatically be
created and it is considered as new doc. In your case, there were probably
two docs indexed with this content.

If you have something that can uniquely identify the document, then you can
provide it as the id, and then documents will be updated if they already
exists.

On Thu, Dec 16, 2010 at 12:10 AM, Anurag < anurag.phadke@gmail.com
anurag.phadke@gmail.com> wrote:

Hello all,
ES maintains one copy of each json for a given index/type. I ran the
following search query:
http://
<es_server>/logs/builds/_search?pretty=true&q=mozilla-central-linux-debug-1292432368-mochitest-other.gz

and it returned the following jsons:
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.5457389,
"hits" : [ {
"_index" : "logs",
"_type" : "builds",
"_id" : "qiZEn2fMQBSKHAbA2UDa7A",
"_score" : 1.5457389, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2
http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2
",
"starttime": "1292436935", "revision": "d0234003e042"} }
}, {
"_index" : "logs",
"_type" : "builds",
"_id" : "gNBDktUJTieNNPvO2Jn_qg",
"_score" : 1.4803665, "_source" : { "builds" : {"machine":
"talos-r3-fed-049", "buildtype": "debug", "testrun_count": 4,
"buildid": "20101215085928", "builder":
"mozilla-central_fedora-debug_test-mochitest-other", "filename":
"mozilla-central-linux-debug-1292432368-mochitest-other.gz", "repo":
"mozilla-central", "platform": "linux", "buildurl":
"http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2
http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1292432368/firefox-4.0b9pre.en-US.linux-i686.tar.bz2
",
"starttime": "1292436935", "revision": "d0234003e042"} }
} ]
}
}

both the jsons look same (to me atleast).... any idea why this might
be happening?

-anurag


(system) #5