Export to file


(Adam Estrada) #1

I have read a lot of posts in this group about what the best method of
getting data back out of ES is but no one method seems to be definitive.
Scan seems that fastest way but I have not tried it. I have been using
query_then_fetch to grab data out of my index. I am grabbing 1000 records
at a time but oh man is it slow. My index has 100million records in it and
I need to write them to file based on a date range query.

at a high level, it looks like this.
http://.../_search?sort=created_at:asc&search_type=query_then_fetch&from0&size=1000.
I have the records returning in an ascending order which makes the paging
happen, right? The results are then written to a flat text file that I use
in another process. I would like to try scan but according to the
documentaiton it doesn't sort so I am confused about how it knows how to
grab unique data in each batch of 1000.

Thoughts?
Adam

--


(Derry O' Sullivan) #2

Hi Adam,

We use Scan/scroll to run through a number of indexes and pull back all the
values for some client side processing (not the same volume as you
though!). We do it via the java API with code similar to:
http://www.elasticsearch.org/guide/reference/java-api/search.html

I noticed that you are able to add sort/searchType criteria in the call
e.g.:
client.prepareSearch().setIndices()
.setQuery(matchAllQuery()).addSort("created_at",SortOrder.ASC).setSize().execute().actionGet();

Maybe worth testing that out? (if you have a date range, you could change
the query to return the required rows as they may be a much smaller range -
vs 100m records?)

Also worth noting that sorting pushes the data into memory (bottom of page):
http://www.elasticsearch.org/guide/reference/api/search/sort.html

On Monday, 24 September 2012 03:43:31 UTC+1, Adam Estrada wrote:

I have read a lot of posts in this group about what the best method of
getting data back out of ES is but no one method seems to be definitive.
Scan seems that fastest way but I have not tried it. I have been using
query_then_fetch to grab data out of my index. I am grabbing 1000 records
at a time but oh man is it slow. My index has 100million records in it and
I need to write them to file based on a date range query.

at a high level, it looks like this.
http://.../_search?sort=created_at:asc&search_type=query_then_fetch&from0&size=1000.
I have the records returning in an ascending order which makes the paging
happen, right? The results are then written to a flat text file that I use
in another process. I would like to try scan but according to the
documentaiton it doesn't sort so I am confused about how it knows how to
grab unique data in each batch of 1000.

Thoughts?
Adam

--


(BillyEm) #3

Hi Adam and Derry,

I'm still getting up to speed on ES, but with quite a bit of Lucene
experience in the past, I'm wondering if ES sort parameter will function to
your needs. In lucene sort specifications only pull into the priority queue
(the result set in accumulating form) the fields specified. 3.x appears to
have added the option to retain the score sortable when this is done. But
you are still stuck with a pretty heavyweight object flowing through the
system.

Many of the shops I've worked at in fact maintain the canonical version of
the document source in CMS systems, precisely for the need to to reindex or
retrieve the full record as fast as possible based on a docid. Obviously
this requires a field in the indexed view that uniquely identifies access
to the CMS version. All kinda convoluted, but not inconsistent with
traditional notions of a search engine as an index, more than a document
store.

If you can afford to mmapdirectory the index on each shard, the cost of at
least creating that full sort field result object might be acceptable. And
of course, its not really necessary to use a CMS, if you use the filesystem
and explictly mmap the collections of original documents you might be able
to get better performance than your current "fetch" operation that
instantiates the document from Lucene segments.

I wonder if you can use the "logical" (as I call it) field _source to get
better performance than requesting the document itself. tho they may just
be an alias for each other.

On Tuesday, September 25, 2012 3:59:16 AM UTC-4, Derry O' Sullivan wrote:

Hi Adam,

We use Scan/scroll to run through a number of indexes and pull back all
the values for some client side processing (not the same volume as you
though!). We do it via the java API with code similar to:
http://www.elasticsearch.org/guide/reference/java-api/search.html

I noticed that you are able to add sort/searchType criteria in the call
e.g.:
client.prepareSearch().setIndices()

.setQuery(matchAllQuery()).addSort("created_at",SortOrder.ASC).setSize().execute().actionGet();

Maybe worth testing that out? (if you have a date range, you could change
the query to return the required rows as they may be a much smaller range -
vs 100m records?)

Also worth noting that sorting pushes the data into memory (bottom of
page):
http://www.elasticsearch.org/guide/reference/api/search/sort.html

On Monday, 24 September 2012 03:43:31 UTC+1, Adam Estrada wrote:

I have read a lot of posts in this group about what the best method of
getting data back out of ES is but no one method seems to be definitive.
Scan seems that fastest way but I have not tried it. I have been using
query_then_fetch to grab data out of my index. I am grabbing 1000 records
at a time but oh man is it slow. My index has 100million records in it and
I need to write them to file based on a date range query.

at a high level, it looks like this.
http://.../_search?sort=created_at:asc&search_type=query_then_fetch&from0&size=1000.
I have the records returning in an ascending order which makes the paging
happen, right? The results are then written to a flat text file that I use
in another process. I would like to try scan but according to the
documentaiton it doesn't sort so I am confused about how it knows how to
grab unique data in each batch of 1000.

Thoughts?
Adam

--


(Adam Estrada) #4

Thanks for the feedback, guys! I tried query_and_fetch, query_then_fetch
and scan. I have another issue in that I am also streaming tweets in using
the river. This is happening at the same time I am trying to query out my
data. The JVM is blowing up on me there which is not cool. I am trying to
work around that now too...

Adam

On Tue, Sep 25, 2012 at 9:09 PM, BillyEm wmartinusa@gmail.com wrote:

Hi Adam and Derry,

I'm still getting up to speed on ES, but with quite a bit of Lucene
experience in the past, I'm wondering if ES sort parameter will function to
your needs. In lucene sort specifications only pull into the priority queue
(the result set in accumulating form) the fields specified. 3.x appears to
have added the option to retain the score sortable when this is done. But
you are still stuck with a pretty heavyweight object flowing through the
system.

Many of the shops I've worked at in fact maintain the canonical version of
the document source in CMS systems, precisely for the need to to reindex or
retrieve the full record as fast as possible based on a docid. Obviously
this requires a field in the indexed view that uniquely identifies access
to the CMS version. All kinda convoluted, but not inconsistent with
traditional notions of a search engine as an index, more than a document
store.

If you can afford to mmapdirectory the index on each shard, the cost of at
least creating that full sort field result object might be acceptable. And
of course, its not really necessary to use a CMS, if you use the filesystem
and explictly mmap the collections of original documents you might be able
to get better performance than your current "fetch" operation that
instantiates the document from Lucene segments.

I wonder if you can use the "logical" (as I call it) field _source to get
better performance than requesting the document itself. tho they may just
be an alias for each other.

On Tuesday, September 25, 2012 3:59:16 AM UTC-4, Derry O' Sullivan wrote:

Hi Adam,

We use Scan/scroll to run through a number of indexes and pull back all
the values for some client side processing (not the same volume as you
though!). We do it via the java API with code similar to:
http://www.elasticsearch.org/**guide/reference/java-api/**search.htmlhttp://www.elasticsearch.org/guide/reference/java-api/search.html

I noticed that you are able to add sort/searchType criteria in the call
e.g.:
client.prepareSearch().setIndices()
.setQuery(matchAllQuery()).addSort("created_at",
SortOrder.ASC).setSize()
.execute().actionGet();

Maybe worth testing that out? (if you have a date range, you could change
the query to return the required rows as they may be a much smaller range -
vs 100m records?)

Also worth noting that sorting pushes the data into memory (bottom of
page):
http://www.elasticsearch.org/**guide/reference/api/search/**sort.htmlhttp://www.elasticsearch.org/guide/reference/api/search/sort.html

On Monday, 24 September 2012 03:43:31 UTC+1, Adam Estrada wrote:

I have read a lot of posts in this group about what the best method of
getting data back out of ES is but no one method seems to be definitive.
Scan seems that fastest way but I have not tried it. I have been using
query_then_fetch to grab data out of my index. I am grabbing 1000 records
at a time but oh man is it slow. My index has 100million records in it and
I need to write them to file based on a date range query.

at a high level, it looks like this. http://.../_search?sort=**
created_at:asc&search_type=**query_then_fetch&from0&size=**1000. I have
the records returning in an ascending order which makes the paging happen,
right? The results are then written to a flat text file that I use in
another process. I would like to try scan but according to the
documentaiton it doesn't sort so I am confused about how it knows how to
grab unique data in each batch of 1000.

Thoughts?
Adam

--

--


(Derry O' Sullivan) #5

Best of luck! We use david pilato's RSS river (s) in our system and do
sorting while ongoing loads are happening but again, our data volume is not
the same as yours.

I think that BillyEm has a point - your data load is pretty high to do any
large scale sorting on it. Simplest solution is to try and pull back id's,
sort values based on scan/scroll calls but for 100m records, this could be
pretty impractical.

BillyEm, could Adam, just use real-time get instead of the mapping to a CMS
like system (i'm not clear of the advantage of maintaining the doc
externally apart from re-indexing support)?

On Wednesday, 26 September 2012 02:18:55 UTC+1, Adam Estrada wrote:

Thanks for the feedback, guys! I tried query_and_fetch, query_then_fetch
and scan. I have another issue in that I am also streaming tweets in using
the river. This is happening at the same time I am trying to query out my
data. The JVM is blowing up on me there which is not cool. I am trying to
work around that now too...

Adam

On Tue, Sep 25, 2012 at 9:09 PM, BillyEm <wmart...@gmail.com <javascript:>

wrote:

Hi Adam and Derry,

I'm still getting up to speed on ES, but with quite a bit of Lucene
experience in the past, I'm wondering if ES sort parameter will function to
your needs. In lucene sort specifications only pull into the priority queue
(the result set in accumulating form) the fields specified. 3.x appears to
have added the option to retain the score sortable when this is done. But
you are still stuck with a pretty heavyweight object flowing through the
system.

Many of the shops I've worked at in fact maintain the canonical version
of the document source in CMS systems, precisely for the need to to reindex
or retrieve the full record as fast as possible based on a docid. Obviously
this requires a field in the indexed view that uniquely identifies access
to the CMS version. All kinda convoluted, but not inconsistent with
traditional notions of a search engine as an index, more than a document
store.

If you can afford to mmapdirectory the index on each shard, the cost of
at least creating that full sort field result object might be acceptable.
And of course, its not really necessary to use a CMS, if you use the
filesystem and explictly mmap the collections of original documents you
might be able to get better performance than your current "fetch" operation
that instantiates the document from Lucene segments.

I wonder if you can use the "logical" (as I call it) field _source to get
better performance than requesting the document itself. tho they may just
be an alias for each other.

On Tuesday, September 25, 2012 3:59:16 AM UTC-4, Derry O' Sullivan wrote:

Hi Adam,

We use Scan/scroll to run through a number of indexes and pull back all
the values for some client side processing (not the same volume as you
though!). We do it via the java API with code similar to:
http://www.elasticsearch.org/**guide/reference/java-api/**search.htmlhttp://www.elasticsearch.org/guide/reference/java-api/search.html

I noticed that you are able to add sort/searchType criteria in the call
e.g.:
client.prepareSearch().setIndices()
.setQuery(matchAllQuery()).addSort("created_at",
SortOrder.ASC).setSize()
.execute().actionGet();

Maybe worth testing that out? (if you have a date range, you could
change the query to return the required rows as they may be a much smaller
range - vs 100m records?)

Also worth noting that sorting pushes the data into memory (bottom of
page):
http://www.elasticsearch.org/**guide/reference/api/search/**sort.htmlhttp://www.elasticsearch.org/guide/reference/api/search/sort.html

On Monday, 24 September 2012 03:43:31 UTC+1, Adam Estrada wrote:

I have read a lot of posts in this group about what the best method of
getting data back out of ES is but no one method seems to be definitive.
Scan seems that fastest way but I have not tried it. I have been using
query_then_fetch to grab data out of my index. I am grabbing 1000 records
at a time but oh man is it slow. My index has 100million records in it and
I need to write them to file based on a date range query.

at a high level, it looks like this. http://.../_search?sort=**
created_at:asc&search_type=**query_then_fetch&from0&size=**1000. I
have the records returning in an ascending order which makes the paging
happen, right? The results are then written to a flat text file that I use
in another process. I would like to try scan but according to the
documentaiton it doesn't sort so I am confused about how it knows how to
grab unique data in each batch of 1000.

Thoughts?
Adam

--

--


(system) #6