Gossamer Forum
Home : General : Perl Programming :

hashes of 2D array

Quote Reply
hashes of 2D array
Hi,

I'm trying to read data in from a file which has many lines, each with 5 different fields. I'm trying to group them using Field1 as a hash key. In the values, I want to put a reference to a 2D array.

For example, if my file is:

Line1: Field1:A Field2:40 Field3:45 Field4:red Field5:blue
Line2: Field1:B Field2:34 Field3:87 Field4:green Field6:black
Line3: Field1:A Field2:33 Field3:44 Field4:blue Field5:green

then I want to make a hash with keys 'A' and 'B'. The value for $hash{A} is a reference to a 2d array, which in this case is an array of 2 arrays, the first holding all the elements of Line1, the second holding all the elements of line2.

I'm dealing with hashes, references, and multidemensional data structures for the first time, and am having a bit of difficulty with this problem. Can anyone help? I'm trying basically to read the data in from a file format similar to above, and then access the elements in the 2d array - I think I would be able to do this if each step was separate, but putting it all together is proving quite difficult.

Thanks for any help...Smile
Quote Reply
Re: [tintin1978] hashes of 2D array In reply to
I'm totally confused but is this anywhere near?

Code:
my %hash = ();
my @row = ();

open FH, "blah.txt" or die $!;
while (<FH>) {
@row = split /\s*Field\d+:/;
$hash{$row[0]} = [ shift @row ];
}
close FH;

After that you lost me a bit Blush

Last edited by:

Paul: Aug 15, 2002, 8:31 AM
Quote Reply
Re: [Paul] hashes of 2D array In reply to
Basically I'm trying to look not at the whole row, at the specific elements of each row... to see if they are similar (the number fields). So I need to separate each row into an array, so I can access each element.

Let's say we're looking at if Field2 is similar in each case :

$value1 = $ref_to_AoA -> [0][1]; #I think this is how you dereference, right - field 2 in line 1

$value2 = $ref_to_AoA ->[2][1]; #fiield 2 in line 3

then compare $value1 and $value2.

I have a huge files, and want to remove all the lines where Field1 is identical, and field2 is similar.... hence making a hash key for each 'field1'.....

Again, I'm sorry if I'm not very clear - this is really confusing for me as I'm just learning on my own. Thanks very much for your help...

Cheers
Quote Reply
Re: [tintin1978] hashes of 2D array In reply to
How exactly is the file delimited?....you don't actually have Field1: Field2: etc all through your file do you?

Last edited by:

Paul: Aug 15, 2002, 8:56 AM
Quote Reply
Re: [Paul] hashes of 2D array In reply to
No, I have the file delimited already... iI was just using a hypothetical example.... The actual file is tab-delimited (although it doesn't show here).....

file.txt:
----------------------------------------------------------------------------------------
0.0 3 981 343 1389 ENSP00000242839
0.0 3 960 343 1366 ENSP00000242839
0.0 2 966 377 1432 ENSP00000262788
0.0 3 999 343 1415 ENSP00000242839
1e-141 5 961 345 1367 ENSP00000242839
1e-136 2 961 377 1427 ENSP00000297957
1e-136 2 961 377 1427 ENSP00000262788
----------------------------------------------------------------------------------------

The file is output from a protein sequence comparison program. Each line is a hit against a database of other proteins against my query protein.
Field 1: score
Field 2: start position of hit on query
Field 3: end position of hit on query
Field 4: start position of hit on database match
Field 5: end position of hit on database match
Field 6: unique identifier for protein in database that matches my protein.


If you look at the first and second lines, you see that there is a hit on the same protein, in roughly the same, but slightly different positions. This also happens with the fourth and fifth, and throughout the files, giving lots of redundancy in my results. I want to remove this second (almost identical) hit from the file.

Any ideas?

Cheers
Quote Reply
Re: [tintin1978] hashes of 2D array In reply to
This looks rather messy.

Is there a lot of variablity in the ENSP000000xxxxx values? Meaning are there lots of different protiens you hit compared to the number of positions they show up? Or do you have more positions showing up per protiens?

As well, for the guesstimation of matches, would it be alright to chop off one sig fig?
Quote Reply
Re: [Aki] hashes of 2D array In reply to
Yeah, a seach of one yeast protein has resulted in high-scoring hits to lots of different human proteins - inferring that they are all related (eg. all part of one protein family).

The scoring method is what causes multiple hits which vary slightly... maybe there is a slight difference in alignment, maybe an insertion or deletion is causing it... either way, I've ended up with up from 1-10 copies of each human protein hit by the yeast. And want to get rid of all but the longest one ((I've made a 'length' element in the array as the difference between the start and end hits).

I'm not too sure what you mean by guesstimation of matches, and chopping off one sig fig... can you explain a little more?

cheers
Quote Reply
Re: [tintin1978] hashes of 2D array In reply to
How many points of data do you have to sift through? (implications on algo)

Code:
0.0 3 981 343 1389 ENSP00000242839
0.0 3 800 343 1366 ENSP00000242839

Would a difference as drastic as that be considered two different entries or that shouldn't happen?
Quote Reply
Re: [Aki] hashes of 2D array In reply to
There are on average, about 350 lines/entries to consider, thus 350 arrays created...

In the one test file I'm using, there are 64 different unique identifiers (ENSP.....nnnnnnn), but in some files, there are 100s.

In the example you give, I'd want to include the first entry and remove the second. The second is shorter in length, and also has many more gaps in the alignment - it would need these gaps to fill in the difference between the query-sequence start-end points (797) and the match-sequence start-end points (1023).
Quote Reply
Re: [tintin1978] hashes of 2D array In reply to
Hello tintin1978,

Would it not be easier to use a Sql database ?

i.e. Read each line from the text file and compute the math required for differences.

So you can store the difference as a field also.

Create a single table or a one to many based on protein .

Then use Sql queries to get your required results.

Thanks

Kode
Quote Reply
Re: [kode] hashes of 2D array In reply to
yeah I thought about that, but each query was taking too long, and as I have 1000s of files to consider, the end run-time would have been in the weeks.....

thanks though