Number of Docs Returned

Adam_Estrada · September 7, 2011, 6:16pm

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']

    file = open('c:\\Temp\\corpus.txt', 'wb')

    for x in xrange(total):
        recs = hits['hits']
        for x in recs:
            source = x['_source']
            str = source['properties']['translated']
            text = str.split()
            final = ' '.join(text)
            file.writelines(final + '\n')
    file.close()

except Exception as err:
    print err

Thanks,
Adam

mattweber · September 7, 2011, 6:35pm

You need to page though the results using from and size. Defaults are from
= 0, and size = 10. So by default you get the first 10 results, do get the
next 10 results you will set from = 10 and leave size alone. Never used
pyes before but just glancing at it you will need to get a search object and
then set the from parameter on that. Don't return 48k results in a single
call.

   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],

"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada estrada.adam@gmail.comwrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')

   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

mattweber · September 7, 2011, 6:39pm

Sent too soon.. after your first search you will know the total hit count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber matt@mattweber.org wrote:

You need to page though the results using from and size. Defaults are from
= 0, and size = 10. So by default you get the first 10 results, do get the
next 10 results you will set from = 10 and leave size alone. Never used
pyes before but just glancing at it you will need to get a search object and
then set the from parameter on that. Don't return 48k results in a single
call.
   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()

rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada estrada.adam@gmail.comwrote:
All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')

   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

Adam_Estrada · September 7, 2011, 7:56pm

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']

    file = open('c:\\Temp\\corpus.csv', 'wb')
    print total
    totalReq = total / 10
    file.write('id' + ',' + 'text' + '\n')
    for i in xrange(1, totalReq):
        qs.start = i * 10
        rslt = conn.search(query=qs, indices=["getdata"])
        hits = rslt['hits']['hits']

        for x in hits:
            id = x['_id']
            source = x['_source']
            str = source['properties']['translated']
            text = str.split()
            final = ' '.join(text).strip()
            #file.writelines(final + '\n')
            file.writelines(id + ',' + final + '\n')
    file.close()

except Exception as err:
    print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are from
= 0, and size = 10. So by default you get the first 10 results, do get the
next 10 results you will set from = 10 and leave size alone. Never used
pyes before but just glancing at it you will need to get a search object and
then set the from parameter on that. Don't return 48k results in a single
call.
   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada estrada.a...@gmail.comwrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

mattweber · September 7, 2011, 9:10pm

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.adam@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']
   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']

       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()
except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:
Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from
= 0, and size = 10. So by default you get the first 10 results, do get
the
next 10 results you will set from = 10 and leave size alone. Never
used
pyes before but just glancing at it you will need to get a search
object and
then set the from parameter on that. Don't return 48k results in a
single
call.
   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

Adam_Estrada · September 8, 2011, 3:31pm

Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.a...@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']
   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()
except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from
= 0, and size = 10. So by default you get the first 10 results, do get
the
next 10 results you will set from = 10 and leave size alone. Never
used
pyes before but just glancing at it you will need to get a search
object and
then set the from parameter on that. Don't return 48k results in a
single
call.
   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

Adam_Estrada · September 8, 2011, 3:59pm

I dug through the sources and found that you can set the size directly
in the search

rs = conn.search(query=q, indices=["getdata"], size=1000)

Cool!
A

On Sep 8, 11:31 am, Adam Estrada estrada.a...@gmail.com wrote:

Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.a...@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']
   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()
except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from
= 0, and size = 10. So by default you get the first 10 results, do get
the
next 10 results you will set from = 10 and leave size alone. Never
used
pyes before but just glancing at it you will need to get a search
object and
then set the from parameter on that. Don't return 48k results in a
single
call.
   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

kimchy · September 8, 2011, 5:49pm

By the way, a more performant way is to use scan search type here, not sure
how its exposed in pyes.
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Thu, Sep 8, 2011 at 6:59 PM, Adam Estrada estrada.adam@gmail.com wrote:

I dug through the sources and found that you can set the size directly
in the search

rs = conn.search(query=q, indices=["getdata"], size=1000)

Cool!
A

On Sep 8, 11:31 am, Adam Estrada estrada.a...@gmail.com wrote:
Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make
sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada <estrada.a...@gmail.com
wrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to
work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']
   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()
except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org
wrote:

You need to page though the results using from and size.
Defaults are
from
= 0, and size = 10. So by default you get the first 10 results,
do get
the
next 10 results you will set from = 10 and leave size alone.
Never
used
pyes before but just glancing at it you will need to get a search
object and
then set the from parameter on that. Don't return 48k results in
a
single
call.
   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <
estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am
only
getting the top 10 results at a time. That is by design, right?
I want
to be able to retrieve all my the docs at a time and write them
to a
file. There are about 48k of them. Am I missing anything from
the
following? You can see that I am getting the total which is
iterating
by the number of docs found but its writing the same records
over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

Alberto_Paro_2 · September 11, 2011, 8:11pm

To use scan you can simple pass the scan type:

    resultset = conn.search(Search(q, size=1000),
                         "get_data,
                      search_type='scan',
                      scroll="10m"):

for obj in resultset:
	str = obj['properties']['translated']
            ....

The resultset allows at the fist call to fire the scan and at the second to iterate on your results (hiding the scroll id iteration).

Hi,
Alberto

Il giorno 08/set/2011, alle ore 19:49, Shay Banon ha scritto:

By the way, a more performant way is to use scan search type here, not sure how its exposed in pyes. Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Thu, Sep 8, 2011 at 6:59 PM, Adam Estrada estrada.adam@gmail.com wrote:
I dug through the sources and found that you can set the size directly
in the search

rs = conn.search(query=q, indices=["getdata"], size=1000)

Cool!
A

On Sep 8, 11:31 am, Adam Estrada estrada.a...@gmail.com wrote:
Good catch! How the heck do I increase the size of the fetch? I have
increased it to 1000 but it still fetches 10 at a time.

adam

On Sep 7, 5:10 pm, Matt Weber m...@mattweber.org wrote:

Actually with that code you are skipping the first 10 results, make sure you
process the hits for the first conn.search call.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 12:56 PM, Adam Estrada estrada.a...@gmail.comwrote:

w00t! Thanks a lot! Corpus time!!! Actually, I need to do this to work
with Mahout so if anyone has any tips for extracting topics out of
there, please let me know...

A

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=q, indices=["getdata"])
total = rs['hits']['total']
   file = open('c:\\Temp\\corpus.csv', 'wb')
   print total
   totalReq = total / 10
   file.write('id' + ',' + 'text' + '\n')
   for i in xrange(1, totalReq):
       qs.start = i * 10
       rslt = conn.search(query=qs, indices=["getdata"])
       hits = rslt['hits']['hits']
       for x in hits:
           id = x['_id']
            source = x['_source']
           str = source['properties']['translated']
           text = str.split()
            final = ' '.join(text).strip()
           #file.writelines(final + '\n')
           file.writelines(id + ',' + final + '\n')
    file.close()
except Exception as err:
print err

On Sep 7, 2:39 pm, Matt Weber m...@mattweber.org wrote:

Sent too soon.. after your first search you will know the total hit
count,
divide this by size and then start executing more searches:

totalreq = total / 10
for i in range(1, totalreq):
qs.start = i * 10
rs = conn.search(query=qs, indices=["getdata"])

Something like that should work.

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:35 AM, Matt Weber m...@mattweber.org wrote:

You need to page though the results using from and size. Defaults are
from
= 0, and size = 10. So by default you get the first 10 results, do get
the
next 10 results you will set from = 10 and leave size alone. Never
used
pyes before but just glancing at it you will need to get a search
object and
then set the from parameter on that. Don't return 48k results in a
single
call.
   gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
qs = q.search()
rs = conn.search(query=qs, indices=["getdata"])

Thanks,
Matt Weber

On Wed, Sep 7, 2011 at 11:16 AM, Adam Estrada <estrada.a...@gmail.com
wrote:

All,

I am doing a search using the pyes geodistancefilter and I am only
getting the top 10 results at a time. That is by design, right? I want
to be able to retrieve all my the docs at a time and write them to a
file. There are about 48k of them. Am I missing anything from the
following? You can see that I am getting the total which is iterating
by the number of docs found but its writing the same records over and
over again. How can I get the next set of records in the series?

from pyes import ES
from pyes import GeoBoundingBoxFilter, GeoDistanceFilter,
GeoPolygonFilter, FilteredQuery, MatchAllQuery

conn = ES('localhost:9200')

def make_corpus():
try:
gq = GeoDistanceFilter("geometry.coordinates", [72, 31],
"400km")
q = FilteredQuery(MatchAllQuery(), gq)
rs = conn.search(query=q, indices=["getdata"])
hits = rs['hits']
total = hits['total']
   file = open('c:\\Temp\\corpus.txt', 'wb')
   for x in xrange(total):
       recs = hits['hits']
       for x in recs:
           source = x['_source']
           str = source['properties']['translated']
           text = str.split()
           final = ' '.join(text)
           file.writelines(final + '\n')
   file.close()
except Exception as err:
print err

Thanks,
Adam

Topic		Replies	Views
Number of Docs Returned Elasticsearch	2	311	July 6, 2017
Always view all results from search? (Java Client) Elasticsearch	6	1214	July 6, 2017
Is ES capable of doing pagination? Elasticsearch	16	12722	July 6, 2017
Fetch command in elasticsearch queries, limit query results Elasticsearch	23	3359	August 15, 2020
Number of returned results and search time Elasticsearch	12	640	July 6, 2017

Number of Docs Returned

Related topics