Wednesday, June 17, 2015

UniPirate: Downloading and Sorting Big Data(Protein Database)

I started to download more of the protein codes, but rather than make a regex script to mess with them after(delete HTML, JS, and bank data) I decided to just use regex while the file is in the prog RAM(ha) and delete it then using Replace(string,"X","Y").

Here I am just checking the packets from the server in the background (to look for the signatures of this mess:
which appears when the p# or http://www.(baseurl)/P99314 does not exist in the data base, it has a distinct error code in the html though note <title>Error</title> so, a

if(WebBrowser.DocumentText.Contains("Error")=True){
main()
}

is a simple way to just jump over empty slots. The database is so huge that we really don't need to systematically go through, in fact the empty slots are not all at large numbers they appear to vary throughout, and a shotgun blast(i.e. bad aim, through a random number generator) is the easisest method as it contains the fairest distribution from multiple organism types. See downloading the .fasta provides this info along with some other information about the string:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content="text/html; charset=windows-1252" http-equiv=Content-Type></HEAD>
<BODY><PRE>&gt;sp|P42889|COL_MYOCO Colipase OS=Myocastor coypus GN=CLPS PE=1 SV=1
MEKVLALVLLTLAVAYAAPDPRGLIINLDNGELCLNSAQCKSQCCQHDSPLGLARCADKA
RENSGCSPQTIYGIYYLCPCERGLTCDGDKSIIGAITNTNYGICQDPQSKK
</PRE></BODY></HTML>

So basically we have to remove all this nonsense(the Html to begin with) which is simple enough.

       
        rtf.Text = Replace(rtf.Text, "bothersometext", "")
so on and so forth for whatever might be in the way. Then we need to organize them based on animal, super easy, the likely hood of the genetic code spelling MOUSE, is pretty low there are some 20 amino acids so that's .05^5 or about 0.00000003125,and since probabilities don't really matter since it doesn't limit it from happening, just take my word that it is unlikely(the strings are small too so each time is a new instance so it is really unlikely). Now you can avoid this problem all together if you set the Regex(this is default in vb,C# etc) so that capitalization is ignored as the majority of the OS names(organism) are spelt like:

Mouse
Human

and ussually latin names are included so you can really cover your bases if you like or truly guarantee by using a different alphabet and by that I mean numbers(the organism identifier number string).


In order to sort the files:



Now for performance, it may be best to take each findstr instance and use it as its own separate program or process so as to maximize distribution over the multiple cores of the cpu. It truly depends on the list of OS's and the size of the data.

I will be writing up a renaming program at some point so that:

|P42889|COL_MYOCO Colipase OS=Myocastor coypus GN=CLPS PE=1 SV=1
MEKVLALVLLTLAVAYAAPDPRGLIINLDNGELCLNSAQCKSQCCQHDSPLGLARCADKA
RENSGCSPQTIYGIYYLCPCERGLTCDGDKSIIGAITNTNYGICQDPQSKK

is saved as Colipase in C:\Myocastor_coypus ( the latter of the two operations is already performed by doghunter.bat which is the program shown above in the Gvim window.



Oh, and to answer questions before they present themselves. Yes, the operation above is a copy operation and not a cut. I do this so as to preserve originals of the files to have a fall back in case something goes terrible wrong <program|original|end> is better than <download|program|original> and then an<program|original|end>. Just for safety, good practice when working with a database the size of all the proteins ever sequenced and publicly stored.

No comments:

Post a Comment