Convert into JSON for PERL module

Jerome · May 15, 2012, 8:32am

Hi !

I've Genbank Flat file (example : http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)
and i want index it.
My problem is the FEATURES, as you can see they're on 2 levels and
they're différent with the file used. So i can't use the index
function.

I used it for indexing all data which are present in any document :

$result = $es->index(
index => $index,
type => $type,
id => $num_acc,
data => {
ACC => $num_acc,
DESC => $desc,
VERSION => $version,
GI => $gi,
ORGANISME => $espece,
CLASSIFICATION => $classification,
SEQUENCE => $seq
},
);

For the FEATURE i use the update funtion :

loop_for_the_primaryTag {
loop_for_the_tag_and_value_from_this_primaryTag {
$result = $es->update(
index => $index,
type => $type,
id => $num_acc,

		script => "if(ctx._source.$primary_tag){if(ctx._source.$primary_tag\

["$tag"] == null){ctx._source.$primary_tag["$tag"] = "$value"}
else {ctx._source.$primary_tag["$tag"] += " $value"}}else
{ctx._source.$primary_tag["$tag"] = "$value"}",
);
}
}

I wrote you the script at the bottom.
The objective is :
{
primary_tag : {
tag : value,
tag : value
},
primary_tag2 : {
tag : value
}
}

But the script doesn't works. My question is : Where are the errors in
the script ? Is it possible to do that ?

I think i can use the JSON to index it without this problem but i try
and failed to convert my file in JSON. My second and most important
question is : how can i convert this file in JSON (i read the doc on
CPAN but...) and how can i index it in PERL ?

The pretty script

if(ctx._source.$primary_tag){
if(ctx._source.$primary_tag["$tag"] == null){
ctx._source.$primary_tag["$tag"] = "$value"
}
else {
ctx._source.$primary_tag["$tag"] += " $value"
}
}
else {
ctx._source.$primary_tag["$tag"] = "$value"
}

Thanks for help !

I'll send an SOS to the World
I hope that someone gets my
Message in a forum

Clinton_Gormley · May 15, 2012, 9:41am

Hi Jerome

I've Genbank Flat file (example : GenBank Sample Record)
and i want index it.
My problem is the FEATURES, as you can see they're on 2 levels and
they're diffÃ©rent with the file used. So i can't use the index
function.

I'm afraid I don't really understand your post - rather difficult
without real data.

But... I think what you're trying to do may be a lot easier using Perl
directly, rather than trying to use the update() method.

I think you may be under the impression that you can only update docs in
Elasticsearch via update(). This is incorrect. You can retrieve the
doc using get(), change it, and reindex it using index().

You could just do something like:

my %doc = get_next_doc_from_flat_file_or_from_elasticsearch();
while (my ($tag,$value) = get_next_tag()) {
push @{$doc{$tag}},$value
}
$es->index( id => $num_acc, data => %doc );

(this is of course pseudo-code, because I have no idea how you read the
raw data).

clint

I used it for indexing all data which are present in any document :

$result = $es->index(
index => $index,
type => $type,
id => $num_acc,
data => {
ACC => $num_acc,
DESC => $desc,
VERSION => $version,
GI => $gi,
ORGANISME => $espece,
CLASSIFICATION => $classification,
SEQUENCE => $seq
},
);

For the FEATURE i use the update funtion :

loop_for_the_primaryTag {
loop_for_the_tag_and_value_from_this_primaryTag {
$result = $es->update(
index => $index,
type => $type,
id => $num_acc,
  	script => "if(ctx._source.$primary_tag){if(ctx._source.$primary_tag\
["$tag"] == null){ctx._source.$primary_tag["$tag"] = "$value"}
else {ctx._source.$primary_tag["$tag"] += " $value"}}else
{ctx._source.$primary_tag["$tag"] = "$value"}",
);
}
}

I wrote you the script at the bottom.
The objective is :
{
primary_tag : {
tag : value,
tag : value
},
primary_tag2 : {
tag : value
}
}

But the script doesn't works. My question is : Where are the errors in
the script ? Is it possible to do that ?

I think i can use the JSON to index it without this problem but i try
and failed to convert my file in JSON. My second and most important
question is : how can i convert this file in JSON (i read the doc on
CPAN but...) and how can i index it in PERL ?

The pretty script

if(ctx._source.$primary_tag){
if(ctx._source.$primary_tag["$tag"] == null){
ctx._source.$primary_tag["$tag"] = "$value"
}
else {
ctx._source.$primary_tag["$tag"] += " $value"
}
}
else {
ctx._source.$primary_tag["$tag"] = "$value"
}

Thanks for help !

I'll send an SOS to the World
I hope that someone gets my
Message in a forum

Jerome · May 15, 2012, 10:32am

Ah sorry for the difficult post.

You said i can update my docs using "get" and "index".
If I must re-index my documents, I must necessarily provide ALL the
document data or can I just provide only the new data?

This is why I used the update () function, it allows to update by
entering only the new data.

In your example, if I understand correctly, you add in a hash the
document present in Elasticsearch. Then you get couples tags /
values and add them before reindex the hash.

I think a hash with only one level of imbrication and my document has
more than one. And my real problem is this second imbrication level, I
managed to be indexed by putting everything on one level but that does
not make practical research. I also looking for a way to automatically
index, data on several levels, this datas are different each time
making it impossible to define them by hand.

You can see the script here if you need it :

On 15 mai, 11:41, Clinton Gormley cl...@traveljury.com wrote:

Hi Jerome

I've Genbank Flat file (example :GenBank Sample Record)
and i want index it.
My problem is the FEATURES, as you can see they're on 2 levels and
they're différent with the file used. So i can't use the index
function.

I'm afraid I don't really understand your post - rather difficult
without real data.

But... I think what you're trying to do may be a lot easier using Perl
directly, rather than trying to use the update() method.

I think you may be under the impression that you can only update docs in
Elasticsearch via update(). This is incorrect. You can retrieve the
doc using get(), change it, and reindex it using index().

You could just do something like:

my %doc = get_next_doc_from_flat_file_or_from_elasticsearch();
while (my ($tag,$value) = get_next_tag()) {
push @{$doc{$tag}},$value
}
$es->index( id => $num_acc, data => %doc );

(this is of course pseudo-code, because I have no idea how you read the
raw data).

clint

I used it for indexing all data which are present in any document :

$result = $es->index(
index => $index,
type => $type,
id => $num_acc,
data => {
ACC => $num_acc,
DESC => $desc,
VERSION => $version,
GI => $gi,
ORGANISME => $espece,
CLASSIFICATION => $classification,
SEQUENCE => $seq
},
);

For the FEATURE i use the update funtion :

loop_for_the_primaryTag {
loop_for_the_tag_and_value_from_this_primaryTag {
$result = $es->update(
index => $index,
type => $type,
id => $num_acc,
               script => "if(ctx._source.$primary_tag){if(ctx._source.$primary_tag\
["$tag"] == null){ctx._source.$primary_tag["$tag"] = "$value"}
else {ctx._source.$primary_tag["$tag"] += " $value"}}else
{ctx._source.$primary_tag["$tag"] = "$value"}",
);
}
}
I wrote you the script at the bottom.
The objective is :
{
primary_tag : {
tag : value,
tag : value
},
primary_tag2 : {
tag : value
}
}

But the script doesn't works. My question is : Where are the errors in
the script ? Is it possible to do that ?

I think i can use the JSON to index it without this problem but i try
and failed to convert my file in JSON. My second and most important
question is : how can i convert this file in JSON (i read the doc on
CPAN but...) and how can i index it in PERL ?

The pretty script

if(ctx._source.$primary_tag){
if(ctx._source.$primary_tag["$tag"] == null){
ctx._source.$primary_tag["$tag"] = "$value"
}
else {
ctx._source.$primary_tag["$tag"] += " $value"
}
}
else {
ctx._source.$primary_tag["$tag"] = "$value"
}

Thanks for help !

I'll send an SOS to the World
I hope that someone gets my
Message in a forum

Clinton_Gormley · May 15, 2012, 11:30am

Hi Jerome

You said i can update my docs using "get" and "index".
If I must re-index my documents, I must necessarily provide ALL the
document data or can I just provide only the new data?

This is why I used the update () function, it allows to update by
entering only the new data.

You can get() your existing doc, which will include all of the data
already in the doc, then add your new data, and reindex it.

This is essentially the same thing that update() does internally.

The advantage of of using get() plus index() is that you can do it all
in a language you are familiar with, as opposed to trying to debug mvel.

In your example, if I understand correctly, you add in a hash the
document present in Elasticsearch. Then you get couples tags /
values ââand add them before reindex the hash.

I think a hash with only one level of imbrication and my document has
more than one. And my real problem is this second imbrication level, I
managed to be indexed by putting everything on one level but that does
not make practical research. I also looking for a way to automatically
index, data on several levels, this datas are different each time
making it impossible to define them by hand.

I'm not sure what imbrication means, but I assume you're talking about a
structure like this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

This is easy to do in Perl. For instance, I could add a new tag to
'tags' with:

push @{ $doc->{one}{two}{three} }, $new_tag;

You don't need to pre-create that structure. If your $doc looked like
this:

$doc = { name => 'Foo' }

and you did this:

push @{ $doc->{one}{two}{three} }, $new_tag;

then you'd end up with this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

But this has nothing to do with Elasticsearch - it's basic Perl
references. Perhaps you should read 'perlreftut':

http://perldoc.perl.org/perlreftut.html

clint

Jerome · May 15, 2012, 12:43pm

Thanks.

I'm not sure what imbrication means, but I assume you're talking about a
structure

Yes i was talking about structure.

I never use hash like that, i learn something today. ^^

But i tried it and it's works !

Thank you very much for this help !

On 15 mai, 13:30, Clinton Gormley cl...@traveljury.com wrote:

Hi Jerome

You said i can update my docs using "get" and "index".
If I must re-index my documents, I must necessarily provide ALL the
document data or can I just provide only the new data?

This is why I used the update () function, it allows to update by
entering only the new data.

You can get() your existing doc, which will include all of the data
already in the doc, then add your new data, and reindex it.

This is essentially the same thing that update() does internally.

The advantage of of using get() plus index() is that you can do it all
in a language you are familiar with, as opposed to trying to debug mvel.

In your example, if I understand correctly, you add in a hash the
document present in Elasticsearch. Then you get couples tags /
values and add them before reindex the hash.

I think a hash with only one level of imbrication and my document has
more than one. And my real problem is this second imbrication level, I
managed to be indexed by putting everything on one level but that does
not make practical research. I also looking for a way to automatically
index, data on several levels, this datas are different each time
making it impossible to define them by hand.

I'm not sure what imbrication means, but I assume you're talking about a
structure like this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

This is easy to do in Perl. For instance, I could add a new tag to
'tags' with:
push @{ $doc->{one}{two}{three} }, $new_tag;
You don't need to pre-create that structure. If your $doc looked like
this:

$doc = { name => 'Foo' }

and you did this:
push @{ $doc->{one}{two}{three} }, $new_tag;
then you'd end up with this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

But this has nothing to do with Elasticsearch - it's basic Perl
references. Perhaps you should read 'perlreftut':

perlreftut - Mark's very short tutorial about references - Perldoc Browser

clint