Convert into JSON for PERL module


(Jérome) #1

Hi !

I've Genbank Flat file (example : http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)
and i want index it.
My problem is the FEATURES, as you can see they're on 2 levels and
they're différent with the file used. So i can't use the index
function.

I used it for indexing all data which are present in any document :

$result = $es->index(
index => $index,
type => $type,
id => $num_acc,
data => {
ACC => $num_acc,
DESC => $desc,
VERSION => $version,
GI => $gi,
ORGANISME => $espece,
CLASSIFICATION => $classification,
SEQUENCE => $seq
},
);

For the FEATURE i use the update funtion :

loop_for_the_primaryTag {
loop_for_the_tag_and_value_from_this_primaryTag {
$result = $es->update(
index => $index,
type => $type,
id => $num_acc,

		script => "if(ctx._source.$primary_tag){if(ctx._source.$primary_tag\

["$tag"] == null){ctx._source.$primary_tag["$tag"] = "$value"}
else {ctx._source.$primary_tag["$tag"] += " $value"}}else
{ctx._source.$primary_tag["$tag"] = "$value"}",
);
}
}

I wrote you the script at the bottom.
The objective is :
{
primary_tag : {
tag : value,
tag : value
},
primary_tag2 : {
tag : value
}
}

But the script doesn't works. My question is : Where are the errors in
the script ? Is it possible to do that ?

I think i can use the JSON to index it without this problem but i try
and failed to convert my file in JSON. My second and most important
question is : how can i convert this file in JSON (i read the doc on
CPAN but...) and how can i index it in PERL ?


The pretty script


if(ctx._source.$primary_tag){
if(ctx._source.$primary_tag["$tag"] == null){
ctx._source.$primary_tag["$tag"] = "$value"
}
else {
ctx._source.$primary_tag["$tag"] += " $value"
}
}
else {
ctx._source.$primary_tag["$tag"] = "$value"
}

Thanks for help !

I'll send an SOS to the World
I hope that someone gets my
Message in a forum


(Clinton Gormley) #2

Hi Jerome

I've Genbank Flat file (example : http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)
and i want index it.
My problem is the FEATURES, as you can see they're on 2 levels and
they're différent with the file used. So i can't use the index
function.

I'm afraid I don't really understand your post - rather difficult
without real data.

But... I think what you're trying to do may be a lot easier using Perl
directly, rather than trying to use the update() method.

I think you may be under the impression that you can only update docs in
ElasticSearch via update(). This is incorrect. You can retrieve the
doc using get(), change it, and reindex it using index().

You could just do something like:

my %doc = get_next_doc_from_flat_file_or_from_elasticsearch();
while (my ($tag,$value) = get_next_tag()) {
push @{$doc{$tag}},$value
}
$es->index( id => $num_acc, data => %doc );

(this is of course pseudo-code, because I have no idea how you read the
raw data).

clint

I used it for indexing all data which are present in any document :

$result = $es->index(
index => $index,
type => $type,
id => $num_acc,
data => {
ACC => $num_acc,
DESC => $desc,
VERSION => $version,
GI => $gi,
ORGANISME => $espece,
CLASSIFICATION => $classification,
SEQUENCE => $seq
},
);

For the FEATURE i use the update funtion :

loop_for_the_primaryTag {
loop_for_the_tag_and_value_from_this_primaryTag {
$result = $es->update(
index => $index,
type => $type,
id => $num_acc,

  	script => "if(ctx._source.$primary_tag){if(ctx._source.$primary_tag\

["$tag"] == null){ctx._source.$primary_tag["$tag"] = "$value"}
else {ctx._source.$primary_tag["$tag"] += " $value"}}else
{ctx._source.$primary_tag["$tag"] = "$value"}",
);
}
}

I wrote you the script at the bottom.
The objective is :
{
primary_tag : {
tag : value,
tag : value
},
primary_tag2 : {
tag : value
}
}

But the script doesn't works. My question is : Where are the errors in
the script ? Is it possible to do that ?

I think i can use the JSON to index it without this problem but i try
and failed to convert my file in JSON. My second and most important
question is : how can i convert this file in JSON (i read the doc on
CPAN but...) and how can i index it in PERL ?


The pretty script


if(ctx._source.$primary_tag){
if(ctx._source.$primary_tag["$tag"] == null){
ctx._source.$primary_tag["$tag"] = "$value"
}
else {
ctx._source.$primary_tag["$tag"] += " $value"
}
}
else {
ctx._source.$primary_tag["$tag"] = "$value"
}

Thanks for help !

I'll send an SOS to the World
I hope that someone gets my
Message in a forum


(Jérome) #3

Ah sorry for the difficult post.

You said i can update my docs using "get" and "index".
If I must re-index my documents, I must necessarily provide ALL the
document data or can I just provide only the new data?

This is why I used the update () function, it allows to update by
entering only the new data.

In your example, if I understand correctly, you add in a hash the
document present in ElasticSearch. Then you get couples tags /
values ​​and add them before reindex the hash.

I think a hash with only one level of imbrication and my document has
more than one. And my real problem is this second imbrication level, I
managed to be indexed by putting everything on one level but that does
not make practical research. I also looking for a way to automatically
index, data on several levels, this datas are different each time
making it impossible to define them by hand.

You can see the script here if you need it :
http://dl.free.fr/jyVhHesXY

On 15 mai, 11:41, Clinton Gormley cl...@traveljury.com wrote:

Hi Jerome

I've Genbank Flat file (example :http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)
and i want index it.
My problem is the FEATURES, as you can see they're on 2 levels and
they're différent with the file used. So i can't use the index
function.

I'm afraid I don't really understand your post - rather difficult
without real data.

But... I think what you're trying to do may be a lot easier using Perl
directly, rather than trying to use the update() method.

I think you may be under the impression that you can only update docs in
ElasticSearch via update(). This is incorrect. You can retrieve the
doc using get(), change it, and reindex it using index().

You could just do something like:

my %doc = get_next_doc_from_flat_file_or_from_elasticsearch();
while (my ($tag,$value) = get_next_tag()) {
push @{$doc{$tag}},$value
}
$es->index( id => $num_acc, data => %doc );

(this is of course pseudo-code, because I have no idea how you read the
raw data).

clint

I used it for indexing all data which are present in any document :

$result = $es->index(
index => $index,
type => $type,
id => $num_acc,
data => {
ACC => $num_acc,
DESC => $desc,
VERSION => $version,
GI => $gi,
ORGANISME => $espece,
CLASSIFICATION => $classification,
SEQUENCE => $seq
},
);

For the FEATURE i use the update funtion :

loop_for_the_primaryTag {
loop_for_the_tag_and_value_from_this_primaryTag {
$result = $es->update(
index => $index,
type => $type,
id => $num_acc,

               script => "if(ctx._source.$primary_tag){if(ctx._source.$primary_tag\

["$tag"] == null){ctx._source.$primary_tag["$tag"] = "$value"}
else {ctx._source.$primary_tag["$tag"] += " $value"}}else
{ctx._source.$primary_tag["$tag"] = "$value"}",
);
}
}

I wrote you the script at the bottom.
The objective is :
{
primary_tag : {
tag : value,
tag : value
},
primary_tag2 : {
tag : value
}
}

But the script doesn't works. My question is : Where are the errors in
the script ? Is it possible to do that ?

I think i can use the JSON to index it without this problem but i try
and failed to convert my file in JSON. My second and most important
question is : how can i convert this file in JSON (i read the doc on
CPAN but...) and how can i index it in PERL ?


The pretty script


if(ctx._source.$primary_tag){
if(ctx._source.$primary_tag["$tag"] == null){
ctx._source.$primary_tag["$tag"] = "$value"
}
else {
ctx._source.$primary_tag["$tag"] += " $value"
}
}
else {
ctx._source.$primary_tag["$tag"] = "$value"
}

Thanks for help !

I'll send an SOS to the World
I hope that someone gets my
Message in a forum


(Clinton Gormley) #4

Hi Jerome

You said i can update my docs using "get" and "index".
If I must re-index my documents, I must necessarily provide ALL the
document data or can I just provide only the new data?

This is why I used the update () function, it allows to update by
entering only the new data.

You can get() your existing doc, which will include all of the data
already in the doc, then add your new data, and reindex it.

This is essentially the same thing that update() does internally.

The advantage of of using get() plus index() is that you can do it all
in a language you are familiar with, as opposed to trying to debug mvel.

In your example, if I understand correctly, you add in a hash the
document present in ElasticSearch. Then you get couples tags /
values ​​and add them before reindex the hash.

I think a hash with only one level of imbrication and my document has
more than one. And my real problem is this second imbrication level, I
managed to be indexed by putting everything on one level but that does
not make practical research. I also looking for a way to automatically
index, data on several levels, this datas are different each time
making it impossible to define them by hand.

I'm not sure what imbrication means, but I assume you're talking about a
structure like this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

This is easy to do in Perl. For instance, I could add a new tag to
'tags' with:

push @{ $doc->{one}{two}{three} }, $new_tag;

You don't need to pre-create that structure. If your $doc looked like
this:

$doc = { name => 'Foo' }

and you did this:

push @{ $doc->{one}{two}{three} }, $new_tag;

then you'd end up with this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

But this has nothing to do with ElasticSearch - it's basic Perl
references. Perhaps you should read 'perlreftut':

http://perldoc.perl.org/perlreftut.html

clint


(Jérome) #5

Thanks.

I'm not sure what imbrication means, but I assume you're talking about a
structure

Yes i was talking about structure.

I never use hash like that, i learn something today. ^^

But i tried it and it's works !

Thank you very much for this help !

On 15 mai, 13:30, Clinton Gormley cl...@traveljury.com wrote:

Hi Jerome

You said i can update my docs using "get" and "index".
If I must re-index my documents, I must necessarily provide ALL the
document data or can I just provide only the new data?

This is why I used the update () function, it allows to update by
entering only the new data.

You can get() your existing doc, which will include all of the data
already in the doc, then add your new data, and reindex it.

This is essentially the same thing that update() does internally.

The advantage of of using get() plus index() is that you can do it all
in a language you are familiar with, as opposed to trying to debug mvel.

In your example, if I understand correctly, you add in a hash the
document present in ElasticSearch. Then you get couples tags /
values ​​and add them before reindex the hash.

I think a hash with only one level of imbrication and my document has
more than one. And my real problem is this second imbrication level, I
managed to be indexed by putting everything on one level but that does
not make practical research. I also looking for a way to automatically
index, data on several levels, this datas are different each time
making it impossible to define them by hand.

I'm not sure what imbrication means, but I assume you're talking about a
structure like this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

This is easy to do in Perl. For instance, I could add a new tag to
'tags' with:

push @{ $doc->{one}{two}{three} }, $new_tag;

You don't need to pre-create that structure. If your $doc looked like
this:

$doc = { name => 'Foo' }

and you did this:

push @{ $doc->{one}{two}{three} }, $new_tag;

then you'd end up with this:

$doc = {
name => 'Foo',
one => {
two => {
three => {
tags => ['foo','bar','baz'],
}
}
}
}

But this has nothing to do with ElasticSearch - it's basic Perl
references. Perhaps you should read 'perlreftut':

http://perldoc.perl.org/perlreftut.html

clint


(system) #6