Wikipedia River: Index only title

Matt_Arkin · June 25, 2013, 12:35am

I'm using the Wikipedia River, is there a way that I can index only the
title of Wikipedia article?

If I do this can I still get access of the entire article text through
ElasticSearch using standard /index/type/id?

Thanks so much,

Matt

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 25, 2013, 12:43am

IIRC, the wikipedia river does not provide a mapping at start up, so you
can supply your own mapping with only one field and dynamic set to false.

--
Ivan

On Mon, Jun 24, 2013 at 5:35 PM, Matt Arkin arkin@endlessm.com wrote:

I'm using the Wikipedia River, is there a way that I can index only the
title of Wikipedia article?

If I do this can I still get access of the entire article text through
Elasticsearch using standard /index/type/id?

Thanks so much,

Matt

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matt_Arkin · June 25, 2013, 12:50am

So probably a very stupid question (its my 2nd day in Elastic Search), can
I supply a mapping when I make this curl request, or do I need to make the
river first, create a mapping then do the below call?

curl -XPUT localhost:9200/ptwikiss/page/_meta -d '
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "
http://dumps.wikimedia.org/ptwiki/20130622/ptwiki-20130622-pages-articles.xml.bz2
"
}
}
'

Thanks,
Matt

On Monday, June 24, 2013 5:43:59 PM UTC-7, Ivan Brusic wrote:

IIRC, the wikipedia river does not provide a mapping at start up, so you
can supply your own mapping with only one field and dynamic set to false.

Elasticsearch Platform — Find real-time answers at scale | Elastic

--
Ivan

On Mon, Jun 24, 2013 at 5:35 PM, Matt Arkin <ar...@endlessm.com<javascript:>

wrote:

I'm using the Wikipedia River, is there a way that I can index only the
title of Wikipedia article?

If I do this can I still get access of the entire article text through
Elasticsearch using standard /index/type/id?

Thanks so much,

Matt

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 25, 2013, 1:24am

Create the index with the correct mapping first, then create the river
using the same index name. The river will attempt to create the index, so
make sure it exists first.

I have never used the Wikipedia river, but that is what I can gather by
looking at the source:

github.com

elastic/elasticsearch-river-wikipedia/blob/master/src/main/java/org/elasticsearch/river/wikipedia/WikipediaRiver.java#L107


      
              this.bulkSize = XContentMapValues.nodeIntegerValue(indexSettings.get("bulk_size"), 100);
              this.bulkFlushInterval = TimeValue.parseTimeValue(XContentMapValues.nodeStringValue(
                      indexSettings.get("flush_interval"), "5s"), TimeValue.timeValueSeconds(5));
              this.maxConcurrentBulk = XContentMapValues.nodeIntegerValue(indexSettings.get("max_concurrent_bulk"), 1);
          } else {
              this.indexName = riverName.name();
              this.typeName = "page";
              this.bulkSize = 100;
              this.maxConcurrentBulk = 1;
              this.bulkFlushInterval = TimeValue.timeValueSeconds(5);
          }
          
          WikiXMLParser xmlParser = WikiXMLParserFactory.getSAXParser(this.url);
          try {
              xmlParser.setPageCallback(new PageCallback());
          } catch (Exception e) {
              logger.error("failed to create xmlParser", e);
              return;
          }
          parser = new Parser(xmlParser);
          thread = EsExecutors.daemonThreadFactory(settings.globalSettings(), "wikipedia_slurper").newThread(parser);

On Mon, Jun 24, 2013 at 5:50 PM, Matt Arkin arkin@endlessm.com wrote:

So probably a very stupid question (its my 2nd day in Elastic Search), can
I supply a mapping when I make this curl request, or do I need to make the
river first, create a mapping then do the below call?

curl -XPUT localhost:9200/ptwikiss/page/_meta -d '
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "http://dumps.wikimedia.org/**ptwiki/20130622/ptwiki-
20130622-pages-articles.xml.**bz2http://dumps.wikimedia.org/ptwiki/20130622/ptwiki-20130622-pages-articles.xml.bz2
"
}
}
'

Thanks,
Matt

On Monday, June 24, 2013 5:43:59 PM UTC-7, Ivan Brusic wrote:

IIRC, the wikipedia river does not provide a mapping at start up, so you
can supply your own mapping with only one field and dynamic set to false.

Elasticsearch Platform — Find real-time answers at scale | Elastic http://www.elasticsearch.org/guide/reference/mapping/object-type/

--
Ivan

On Mon, Jun 24, 2013 at 5:35 PM, Matt Arkin ar...@endlessm.com wrote:

I'm using the Wikipedia River, is there a way that I can index only the
title of Wikipedia article?

If I do this can I still get access of the entire article text through
Elasticsearch using standard /index/type/id?

Thanks so much,

Matt

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
How to change the default index settings for Wikipedia river? Elasticsearch	5	517	July 6, 2017
Some questions about Wikipedia river Elasticsearch	1	302	July 6, 2017
Some questions on wikipedia river and cluster config Elasticsearch	7	402	July 6, 2017
Pull data from wikipedia Logstash	2	801	July 6, 2017
[ANN] Elasticsearch Wikipedia River plugin 2.4.0 released Elasticsearch	1	381	July 6, 2017

Wikipedia River: Index only title

Related topics