Wednesday, August 17, 2016

PhyloclassTalk was used to solve a homicide

PhyloclassTalk, an open-source phylogeographic text-mining system based in BioSmalltalk, was used in veterinary forensics to solve a homicide! The September 2016 issue of Legal Medicine includes an article which fully describes the case in detail. PhyloclassTalk was used to narrow blasted sequences of the species (Canis Familiaris) and extract proper meta-data (Breed names) from NCBI's GenBank. A hand-crafted database of dog breeds was built and integrated into a PhyloclassTalk repository to classify (by breed name) and observe the ones located in Argentina, where the sample of and individual was found in a crime scene. Finally it was also used to build and export the results to Arlequin format. PhyloclassTalk paper is almost completed, meanwhile a beta release of the software can be downloaded from its web site.

Thursday, August 20, 2015

Browsing +1,2 million formal scientific names from the NCBI Taxonomy Database.

Contents of this post does not require to load or install BioSmalltalk or PhyloclassTalk, but uses a plain Pharo image with the FastTable package.

As part of the PhyloclassTalk project I wanted to add a feature to browse all formal scientific names found in the full NCBI taxonomy database. The recently published FastTable package in the pharo mailing-list makes me wonder how well will perform to open a FastTable Morphic window with its contents. You can also download the taxonomy dump list I used for this experiment. I filtered the original file (taxdmp.zip) to remove "noise" (synonyms, authorities). Using a Sony Vaio i3 at 2.40Ghz it takes just 4 seconds, and you get a fully scrollable list, without pagination, without lags. First we open the FastTable widget with a basic benchmark:
Smalltalk garbageCollect.
[ 
| speciesDumpReader speciesDumpList | 
speciesDumpReader := 'scientific_names.dmp' asFileReference readStream.
speciesDumpList := speciesDumpReader contents lines.
FTEasyListMorph new
  extent: 300@550;
  elements: speciesDumpList;
  openInWindow
] timeToRun. 
 "0:00:00:03.968" "0:00:00:04.249" "
Now let's go for a more functional species "browser" by adding a callback to open the Google results for the selected taxa:
| speciesDumpReader speciesDumpList | 
speciesDumpReader := 'scientific_names.dmp' asFileReference readStream.
speciesDumpList := speciesDumpReader contents lines.
FTEasyListMorph new
 header: 'NCBI Taxonomy Database List';
 extent: 300 @ 550;
 elements: speciesDumpList;
 menu: [ : taxaName | 
  MenuMorph new 
   add: ('Open External Browser') 
                        target: NBWin32Shell 
                        selector: #shellBrowse: 
                        argument: 'https://www.google.com/?gws_rd=ssl#q=' , taxaName;
   yourself ];  
 openInWindow
and of course, a screenshot:
I hope to see more of this cool packages coming to the Pharo Smalltalk community. Enjoy!

Friday, March 20, 2015

BioSmalltalk now available through GitHub

I have created the BioSmalltalk repository in GitHub so you can clone and contribute from there.

I hope this will make it easy for interested parties to contribute to this code or to specialize it to their own needs. Regular distributions will still be made at Google Code (for now) but if you want the absolute latest changes, GitHub will be the place to go.

If you are interested, please feel free to get involved.

Link: http://github.com/hernanmd/BioSmalltalk

Regards,

Hernán

Monday, December 22, 2014

Download a human chromosome in one line of code

Let's write plain Smalltalk code to download the Human chromosome 22 FASTA from the NCBI servers (about 9,6 Mbytes gzip compressed)
| client fileName fStream |

fileName := 'hs_alt_HuRef_chr22.fa.gz'.
[ client := (FTPClient openOnHostNamed: 'ftp.ncbi.nlm.nih.gov')
                loginUser: 'anonymous' password: '';
                binary;
                changeDirectoryTo: 'genomes/H_sapiens/CHR_22'.
(FileStream newFileNamed: fileName)
        binary;
        nextPutAll: (client getFileNamed: fileName);
        close ]
on: NetworkError, LoginFailedException
do: [ : ex | self error: 'Connection failed' ].

fStream := fileName asFileReference readStream.
(ByteArray streamContents: [ : stream |
    FLSerializer serialize: fStream binary contents on: stream ]) storeString.
That seems a lot of typing for a Bioinformatics library and Smalltalk tradition. That's why I wrote a Genome Downloader class which makes really easy to download the latest build:
BioHSapiensGD new downloadChromosome: 22.
If you don't want the blocking feature, you can easily download in background by setting the process priority:
[ BioHSapiensGD new downloadChromosome: 22 ] 
 forkAt: Processor userBackgroundPriority 
 named: 'Downloading Human Chromosome...'.
Results will be downloaded in the directory where the virtual .image and .changes files are located. But why stop at human race? There are subclasses for Bos Taurus (from the UMD, Center for Bioinformatics and Computational Biology, University of Maryland, and The Bovine Genome Sequencing Consortium), Gallus Gallus (International Chicken Genome Sequencing Consortium) and Mus Musculus (Celera Genomics and Genome Reference Consortium) and others can be built by just specializing very few methods. We can just download any available assembled genomes with just one line of code. Enjoy.

Tuesday, January 14, 2014

Arlequin format writer

Introduction


Arlequin is a famous software for population genetics data analysis. The file format is well documented in the Arlequin's Manual, so I will not duplicate information here. Writing an Arlequin file consists of basically generating a customized INI file with both Profile and Samples sections.
Now you can use the API provided in BioSmalltalk to write Arlequin files programatically. The API pattern in the most naive form looks like this
arlequinFile := BioArlequinFile new.
arlequinFile profileSection
        addTitle: 'Sample Title';
        " ... profile configuration messages ... ".

arlequinFile samplesSection
        addSampleName: '"SAMPLE1"';
        addSampleSize: '8';
        addSampleData: " ... sample data 1 ... ";

        addSampleName: '"SAMPLE2"';
        addSampleSize: '8';
        addSampleData: " ... sample data 2 ... ";       

        " ... you guessed it ... "

it seems pretty simple, but in practice you will not type the hundreds of samples in a typical Arlequin data set. You would like to iterate your input data. 

Building the Samples Collection
If you observe the pattern above, each sample contains three pieces of information: Sample Name, Sample Size and Sample Data. Basically you have two input layouts. Each population comes from separate collections, i.e.:
| arlequinFile samplesSection samplesCollection idCollection frqCollection |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

idCollection := #('OT1' 'B1' 'A1' 'CH1' 'J1' 'USA1' 'OT2' 'OT3' 'B2' 'A2' 'A3' 'A4' 'USA2' 'USA3' 'USA4' 'USA5' 'USA6' 'USA7' 'OT4' 'B3' 'B4' 'B5' 'A5' 'J2' 'J3' 'USA8' 'USA9' 'USA10' 'USA11' 'USA12' 'USA13' 'B6' 'C1' 'J4' 'USA14' 'OT5' 'OT6' 'B7' 'CH2' 'CH3' 'A6' 'CH4' 'A7').
frqCollection := #(5 5 6 3 2 11 1 2 1 1 1 1 1 2 1 1 1 1 5 2 1 1 1 1 1 1 1 4 1 1 1 3 1 1 2 4 3 1 1 1 1 1 1).
samplesCollection := #('ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT' 'ATCTAACAACACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAACAATACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAGCAATACTCTCCTATCTTTTACCGTCACCCACT' 'ATTTAGCAATACTGTCCTATCTCTTATCGTCACCTACT' 'ATTTAGCAATGCTGTCCCATCTTTTATTGTCACCCACT').
 
samplesSection addSamples: (BioA31SampleCollection forDNA
 identifiers: idCollection;
 frequencies: frqCollection;
 sequences: samplesCollection;
 yourself).

" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Or population data comes as a triplet. This could be the case after you have grouped your input by alignment and calculated the frequencies. In that case you may use #triplesDo: to take each population by 3-element and build your Arlequin file like this:
| arlequinFile samplesSection populations |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

populations := #('OT1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'B1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A1' 6 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'CH1' 3 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'J1' 2 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'USA1' 11 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'OT2' 1 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'OT3' 2 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'B2' 1 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'A2' 1 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A3' 1 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'A4' 1 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'USA2' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA3' 2 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'USA4' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'USA5' 1 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA6' 1 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'USA7' 1 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'OT4' 5 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B3' 2 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'B4' 1 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B5' 1 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'A5' 1 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'J2' 1 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'J3' 1 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'USA8' 1 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'USA9' 1 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'USA10' 4 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'USA11' 1 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'USA12' 1 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'USA13' 1 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'B6' 3 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'C1' 1 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'J4' 1 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'USA14' 2 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'OT5' 4 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'OT6' 3 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'B7' 1 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT').
populations triplesDo: [ : id : freq : seq |
 samplesSection 
  addSampleName: id;
  addSampleSize: freq;
  addSampleData: seq;
  yourself ].
" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Don't forget to check BioArlequinFile convenience methods for building for different data types: #buildHaplotypicDataDNAProfileTitle: aString groups: aNumber missingData: missingCharacter #buildHaplotypicDataFrequencyProfileTitle: aString groups: aNumber missingData: missingCharacter And let me know any suggestions for improving the Arlequin API.

Saturday, February 23, 2013

PhyloclassTalk preview

In this post I want to present a preview of PhyloclassTalk, an application for phylogenetics analysis using the BioSmalltalk environment with Pharo 1.4. The main parts are presented through the Metro style UI popularized in Windows 8. The following screenshot shows the main application window:


excepting for the icons, the layout was generated programatically with simple and plain Morphic objects. The "Territory Builder" uses a wizard library called Merlin and it is based in a Territorial library which basically is a Composite pattern implementation to build complex territorial objects. I have integrated the Help in just one hour, based in the HelpSystem without any previous knowledge of the library.

The main module window is a "Case Study Browser" implemented with the OmniBrowser framework. From the browser one can create and associate several phylogenetic data to a species case study, classify according to defined territories and then export results into formats like Arlequin, Google Fusion Tables or Fluxus Network.

The following screenshot describes the "Blast Query Builder", which enables dynamic generation and execution of Blast XML results, producing filtered objects which can be later loaded in the case study browser for further associations. Fitered results could be cumulative, meaning that each new execution is applied on the previous results.



Detailed features as the rule engine protocol and the post-curation of classified data are going to be described an the upcoming paper. I will provide also new posts on this front as I prepare a release, stay there online.


Saturday, January 26, 2013

BioSmalltalk 0.4 Release

Time to talk about the new 0.4 release. BioSmalltalk virtual-images include basic developement libraries for doing bioinformatics, for example XML APIs, OS services, HTTP requests, serialization, refactoring, etc. Both BioPharo and BioSqueak are editions of BioSmalltalk and includes specific platform packages, as all Smalltalk platforms evolves on their own.

There are separate downloads for different OSes: Windows, Linux and MacOS X. And there are different "Editions" which currently are: Pharo 1.4 and Squeak 4.4. Main developement is happening in Pharo 1.4. The Squeak edition is experimental and does not contain menu entries for targeted browsing and updating.


Please feel free to comment or register into the mailing list provided by Google Code hosting.




Tuesday, July 17, 2012

How to do a BLAST from Smalltalk

Provided by the National Center for Biotechnology Information (NCBI), BLAST is a sequence homology search program that finds all the sequences that match a particular given sequence from a database, and it is considered the most fundamental tool in bioinformatics today, although there are many other similar programs for different cases. I'm not going here to review the huge literature on sequence/string matching, for what any simple search in any search engine would get you a lot of resources to start.

For remote queries, you may access the free NCBI BLAST service in two ways, from a web browser and programatically through the BLAST URL API, previously known as QBlast. In this post you may read how to use the API using Smalltalk, and downloading a file with results in XML format for later processing
| search |
 
search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
   hitListSize: 150;
   filterLowComplexity;
   expectValue: 10;
   wordSize: 11;
   blastn;
   blastPlainService;
   alignmentViewFlatQueryAnchored;
   formatTypeXML;   
   fetch.
search outputToFile: 'blast-query-result.xml' contents: search result.
in this script, at first we set up the Blast Client with the database and query sequence. Notice at this stage you may serialize the BioNCBIWWWBlastClient instance without further parameters. The second "set" of messages are "in cascade", in Smalltalk sometimes you need to send one object several consecutive messages, and in this case are directed to configure the search and send the requests in the #fetch. Each command is implemented as a subclass of BioNCBICommand because each one (like GET or PUT) accepts specific parameter types, but you don't need to worry about the valid parameters as they are validated internally, for example, when setting the database. Each message to the Blast client builds the corresponding URL, for example a GET url:

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&RID=954517013-7639-11119

during execution you may open a Transcript and see the request progress:

17 July 2012 6:58:51 am Fetching NCBI BLAST request identifier...
17 July 2012 6:58:53 am Parsing RTOE...
17 July 2012 6:58:53 am New RTOE = 18
17 July 2012 6:58:53 am RTOE = 18
17 July 2012 6:58:53 am RID = 0ADUV9FM016
17 July 2012 6:58:53 am Executing...BioNCBIFetch
17 July 2012 6:58:53 am QBLAST: Delaying retry for 18 seconds
17 July 2012 6:59:11 am Sending another NCBI BLAST request...
17 July 2012 6:59:14 am NCBI BLAST request sent, waiting for results...
17 July 2012 6:59:15 am Finished successfully

Monday, July 16, 2012

Parsing and profiling BLAST results


For my application I wanted to display BLAST summaries because we track some history data in the Nuccore database. If you, as a good bioinformatician, have downloaded the BLAST XML files you may find a summary of each BLAST is found in the header. 

As we may have several BLAST files, processing entirely each one just to read the header is a waste of time, so we need to parse a XML up to a particular node is found, which is provided by the BioSmalltalk API using StAX parser (XMLPullParser) as backend.


Now, compare the time spent reading a whole XML file versus just stopping after the last header node is found.

BioObject requestXmlFile
 ifNotNilDo: [ : fileLocation |
  | blastReader timeToReadFile timeToProcessNode output |  
  timeToReadFile :=   [ blastReader := BioNCBIBlastReader newFromXML: fileLocation ] timeToRun.
  timeToProcessNode :=  [ output := blastReader blastProgramName ] timeToRun.
  timeToReadFile :=   ( Duration milliSeconds: timeToReadFile ) asSeconds.
  timeToProcessNode :=  ( Duration milliSeconds: timeToProcessNode ) asSeconds.
  Transcript 
   cr;
   show: 'Seconds to read ' , fileLocation , ' = ';
   show: timeToReadFile;
   cr;
   show: 'Seconds to process node = ' , timeToProcessNode asString;
   cr;
   show: output ]
Wonder how to read a whole group of BLAST XML files contained in a directory? This script would do it:
| time blastReaders blastHits |

Smalltalk garbageCollect.
Transcript clear.
time := [ blastReaders := BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\'.
  Transcript show: blastReaders size asString , ' blast XML files were read'; cr.
  blastHits := blastReaders parsedContents ] timeToRun.
blastHits explore.
Transcript 
 cr;
 show: 'Seconds to read NCBI BLAST XML''s = ';
 show: ( Duration milliSeconds: time ) asSeconds

but wrapping everything you want to measure in a #timeToRun block isn't that fun. Plus getting rid of that bunch of temporary variables, in Pharo you may profile the task in a beautiful one-liner:

( BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\' ) parsedContents.

just right-click the selected line and select "Profile it" as shown in the screenshot.

 

after execution, the Time Profiler is open with detailed information of time spent in each method so you may see where to look better for bottlenecks.

 

Custom serialization of big DNA or proteins

Lately I've been experimenting with serialization engines in Pharo. Besides the "traditional" alternative (SmartReferenceStream) I took the chance of evaluating Fuel, a new serializer which is nicely documented and supported, actually Mariano Martinez Peck and Martin Dias (the Fuel developers) answered privately my questions and requirements in a very fast way, so thanks to them I can show you an interesting feature now in BioSmalltalk
A typical serialization in bioinformatics includes a huge group of sequences or a big XML tree, so one of my requirements is to customize the serialization strategy to save precious memory. This means to change the serializer on-the-fly when a particular object is found in a graph of objects, specifically, if a DNA or protein sequence with a particular threshold is found, you certainly would like to zip it. Follows an example for serializing an Array with a random object ('hello') and the chromosome 28 of chicken:
objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed: 'GGA28.fa') contents.
threshold := 1000.

FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
   aSerializer := FLSerializer newDefault.
   aSerializer analyzer
       when: [ :o | o isString and: [ o size > threshold and: [ (BioParser tokenizeFasta: o) second isDNASequence ] ] ]
       substituteBy: [ :o | o zipped ].
   aSerializer        
       serialize: objectToSerialize
       on: aStream binary ].
and of course, the corresponding materialization
result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
 aMaterialization := FLMaterializer newDefault materializeFrom: aStream binary.
 zippedStrings := aMaterialization objects select: [:o | o isString and: [ o isDNASequence ]].
 unzippedStrings := zippedStrings collect: [:o | o unzipped ].
 zippedStrings elementsExchangeIdentityWith: unzippedStrings.
 aMaterialization root ].
Looking at the possibilities, many of the custom DNA compression algorithms (or even XML) could be attached and used if saving space is becoming an issue in your bioinformatics experiments.