Number of Docs Returned

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

    file = open('c:\\Temp\\corpus.txt', 'wb')

    for x in xrange(total):
        recs = hits['hits']
        for x in recs:
            source = x['_source']
            str = source['properties']['translated']
            text = str.split()
            final = ' '.join(text)
            file.writelines(final + '\n')
    file.close()

except Exception as err:
    print err

Thanks,
Adam

You need to page though the results using from and size. Defaults are from
= 0, and size = 10. So by default you get the first 10 results, do get the
next 10 results you will set from = 10 and leave size alone. Never used
pyes before but just glancing at it you will need to get a search object and
then set the from parameter on that. Don't return 48k results in a single
call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada estrada.adam@gmail.comwrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')

   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam

Sent too soon.. after your first search you will know the total hit count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber matt@mattweber.org wrote:

You need to page though the results using from and size. Defaults are from
= 0, and size = 10. So by default you get the first 10 results, do get the
next 10 results you will set from = 10 and leave size alone. Never used
pyes before but just glancing at it you will need to get a search object and
then set the from parameter on that. Don't return 48k results in a single
call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada estrada.adam@gmail.comwrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')

   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']

    file = open('c:\\Temp\\corpus.csv', 'wb')
    print total
    totalReq = total / 10
    file.write('id' + ',' + 'text' + '\n')
    for i in xrange(1, totalReq):
        qs.start = i * 10
        rslt = conn.search(query=qs, indices=["getdata"])
        hits = rslt['hits']['hits']

        for x in hits:
            id = x['_id']
            source = x['_source']
            str = source['properties']['translated']
            text = str.split()
            final = ' '.join(text).strip()
            #file.writelines(final + '\n')
            file.writelines(id + ',' + final + '\n')
    file.close()

except Exception as err:
    print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are from
= 0, and size = 10. So by default you get the first 10 results, do get the
next 10 results you will set from = 10 and leave size alone. Never used
pyes before but just glancing at it you will need to get a search object and
then set the from parameter on that. Don't return 48k results in a single
call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada estrada.a...@gmail.comwrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.adam@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']

   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']

       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()

except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from

= 0, and size = 10. So by default you get the first 10 results, do get
the

next 10 results you will set from = 10 and leave size alone. Never
used

pyes before but just glancing at it you will need to get a search
object and

then set the from parameter on that. Don't return 48k results in a
single

call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam

Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.a...@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']

   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()

except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from

= 0, and size = 10. So by default you get the first 10 results, do get
the

next 10 results you will set from = 10 and leave size alone. Never
used

pyes before but just glancing at it you will need to get a search
object and

then set the from parameter on that. Don't return 48k results in a
single

call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam

I dug through the sources and found that you can set the size directly
in the search

rs = conn.search(query=q, indices=["getdata"], size=1000)

Cool!
A

On Sep 8, 11:31 am, Adam Estrada estrada.a...@gmail.com wrote:

Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.a...@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']

   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()

except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from

= 0, and size = 10. So by default you get the first 10 results, do get
the

next 10 results you will set from = 10 and leave size alone. Never
used

pyes before but just glancing at it you will need to get a search
object and

then set the from parameter on that. Don't return 48k results in a
single

call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam

By the way, a more performant way is to use scan search type here, not sure
how its exposed in pyes.
http://www.elasticsearch.org/guide/reference/api/search/search-type.html.

On Thu, Sep 8, 2011 at 6:59 PM, Adam Estrada estrada.adam@gmail.com wrote:

I dug through the sources and found that you can set the size directly
in the search

rs = conn.search(query=q, indices=["getdata"], size=1000)

Cool!
A

On Sep 8, 11:31 am, Adam Estrada estrada.a...@gmail.com wrote:

Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make
sure you

process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada <estrada.a...@gmail.com
wrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to
work

with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']

   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()

except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org
wrote:

You need to page though the results using from and size.
Defaults are

from

= 0, and size = 10. So by default you get the first 10 results,
do get

the

next 10 results you will set from = 10 and leave size alone.
Never

used

pyes before but just glancing at it you will need to get a search
object and

then set the from parameter on that. Don't return 48k results in
a

single

call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <
estrada.a...@gmail.com

wrote:

All,

I am doing a search using the pyes geodistancefilter and I am
only

getting the top 10 results at a time. That is by design, right?
I want

to be able to retrieve all my the docs at a time and write them
to a

file. There are about 48k of them. Am I missing anything from
the

following? You can see that I am getting the total which is
iterating

by the number of docs found but its writing the same records
over and

over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam

To use scan you can simple pass the scan type:

    resultset = conn.search(Search(q, size=1000),
                         "get_data,
                      search_type='scan',
                      scroll="10m"):

for obj in resultset:
	str = obj['properties']['translated']
            ....

The resultset allows at the fist call to fire the scan and at the second to iterate on your results (hiding the scroll id iteration).

Hi,
Alberto

Il giorno 08/set/2011, alle ore 19:49, Shay Banon ha scritto:

By the way, a more performant way is to use scan search type here, not sure how its exposed in pyes. http://www.elasticsearch.org/guide/reference/api/search/search-type.html.

On Thu, Sep 8, 2011 at 6:59 PM, Adam Estrada estrada.adam@gmail.com wrote:
I dug through the sources and found that you can set the size directly
in the search

rs = conn.search(query=q, indices=["getdata"], size=1000)

Cool!
A

On Sep 8, 11:31 am, Adam Estrada estrada.a...@gmail.com wrote:

Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.a...@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']

   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()

except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from

= 0, and size = 10. So by default you get the first 10 results, do get
the

next 10 results you will set from = 10 and leave size alone. Never
used

pyes before but just glancing at it you will need to get a search
object and

then set the from parameter on that. Don't return 48k results in a
single

call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()

except Exception as err:
print err

Thanks,
Adam