All ids


(Albin Stigo) #1

Hi,

What is the easiest/most efficient way of getting all document ids in an index?

Backstory:
I have csv file where I want to index each (unique) line. I do a sha1
on each line and use that as the id when I create a document.
Periodically the csv file is updated and then I plan to rehash all
lines and compare the set of ids from the file with the set of ids in
the index, and in that way know which documents to delete and which to
add.

--Albin


(Clinton Gormley) #2

Hi Albin

What is the easiest/most efficient way of getting all document ids in an index?

Backstory:
I have csv file where I want to index each (unique) line. I do a sha1
on each line and use that as the id when I create a document.
Periodically the csv file is updated and then I plan to rehash all
lines and compare the set of ids from the file with the set of ids in
the index, and in that way know which documents to delete and which to
add.

How many IDs are we talking about? 100, 1000, 10 million?

If small numbers, then you could use search_type=scan and scroll to
retrieve the ids for all your docs, eg:

curl -XGET 'http://127.0.0.1:9200/_all/_search?scroll=5m&search_type=scan' -d '
{
"fields" : {},
"query" : {
"match_all" : {}
},
"size" : 100
}
'

The above would return 100 records from each shard at a time. You would
need to get the scroll id from the response to the above query, and then
retrieve 'tranches' of records using:

curl -XGET 'http://127.0.0.1:9200/_search/scroll?scroll=5m&scroll_id=xxx'

On every request, you will get a new scroll_id which you need to pass to
the next scroll request.

If we're talking about lots of records, then you probably don't want to
retrieve all IDs at the same time, in which case you should probably
read (eg) 1000 IDs from the CSV and search for those IDs.

By default, since 0.16, the _id is no longer indexed, which means that
you can't retrieve it with eg { term => { _id => 123 }}

You can change that when you create the mapping by setting the _id field
to { index: "not_analyzed" } (instead of the default "no")

However, if you know the _type of your documents, than you can use the
"ids" query instead, eg:

{ query: { ids: { type: "mydoc", values: [123,124,125] }}}

That doesn't help you with deleting IDs for records that no longer exist
in the CSV file, because you need some way of marking an existing doc as
'seen', which requires reindexing that doc.

Which begs the question: might it not be easier to just reindex all your
data to a new index and then use an alias to point 'myindex' to
'myindex_2011_05_11'. That way you can reindex every day, repoint the
alias, and just delete the old index.

hth

clint


(Albin Stigo) #3

Thanks for a very good answer.

We're talking about 20.000 very small documents. It's basically a
dictionary with codes and their definition.

Right now I'm leaning towards your conclusion of recreating and using
an alias. All in line with the KISS principle :slight_smile:

Another idea I had was to use couchdb and the elasticsearch river
plugin to keep them in sync... Since its very easy to get a list of
all keys in couchdb. Any thoughts on that?

--Albin

On Fri, May 20, 2011 at 7:18 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:

Hi Albin

What is the easiest/most efficient way of getting all document ids in an index?

Backstory:
I have csv file where I want to index each (unique) line. I do a sha1
on each line and use that as the id when I create a document.
Periodically the csv file is updated and then I plan to rehash all
lines and compare the set of ids from the file with the set of ids in
the index, and in that way know which documents to delete and which to
add.

How many IDs are we talking about? 100, 1000, 10 million?

If small numbers, then you could use search_type=scan and scroll to
retrieve the ids for all your docs, eg:

curl -XGET 'http://127.0.0.1:9200/_all/_search?scroll=5m&search_type=scan' -d '
{
"fields" : {},
"query" : {
"match_all" : {}
},
"size" : 100
}
'

The above would return 100 records from each shard at a time. You would
need to get the scroll id from the response to the above query, and then
retrieve 'tranches' of records using:

curl -XGET 'http://127.0.0.1:9200/_search/scroll?scroll=5m&scroll_id=xxx'

On every request, you will get a new scroll_id which you need to pass to
the next scroll request.

If we're talking about lots of records, then you probably don't want to
retrieve all IDs at the same time, in which case you should probably
read (eg) 1000 IDs from the CSV and search for those IDs.

By default, since 0.16, the _id is no longer indexed, which means that
you can't retrieve it with eg { term => { _id => 123 }}

You can change that when you create the mapping by setting the _id field
to { index: "not_analyzed" } (instead of the default "no")

However, if you know the _type of your documents, than you can use the
"ids" query instead, eg:

{ query: { ids: { type: "mydoc", values: [123,124,125] }}}

That doesn't help you with deleting IDs for records that no longer exist
in the CSV file, because you need some way of marking an existing doc as
'seen', which requires reindexing that doc.

Which begs the question: might it not be easier to just reindex all your
data to a new index and then use an alias to point 'myindex' to
'myindex_2011_05_11'. That way you can reindex every day, repoint the
alias, and just delete the old index.

hth

clint


(system) #4