This article describes how to run CloudBlast wisely. OmicsBox allows you to perform NCBI Blast searches using a cloud system for high-performance computing. This system runs jobs in parallel and autoscales depending on demand. To control usage we use so-called CloudUnits. These units represent computation time (cpu seconds) on the AWS infrastructure. At the moment only Blast and InterProScan consume units in OmicsBox as they represent over 80% of the total cloud costs.
All OmicsBox subscriptions include a certain amount of units that can easily be recharged. However, to consume these resources wisely, you should take several aspects into consideration when performing Blast searches. Please remember that the consumption of units does not depend on the overall time it takes to Blast your dataset but on the amount of used computational resources. Be cautious when analyzing big sequence datasets for the first time and please read the following recommendations:
Reduce your search space:
Use a taxonomic filter. This saves time and computational resources (and units). Of course, this is just a recommendation and the final decision depends on you, your research requirements and budget. Normally, if you are looking for potential homologous sequences, for example a plant species you may want to consider only plant species and exclude the significantly large amount of bacterial genomes (>50%) in the NR protein database. This saves time and units.
Adjust search sensitivity:
We recommend using the blastx-fast configuration which basically increases the word-size of the alignments from 3 to 6. According to the NCBI, this should not alter your homology search results significantly but provide a performance increase. This configuration has been described more in detail by Shiryev et al. here: http://www.ncbi.nlm.nih.gov/pubmed/17921491 (2007).
Check on your Cloud Usage:
Use the “Cloud Usage” tab to review your unit consumption from within OmicsBox (from the “View” menu). Please see image below.
If you are new to CloudBlast and OmicsBox, we recommend you to try your final blast parameters on a smaller dataset first. This avoids surprises. Use a small subset of your dataset (for example 1000 sequences) to estimate the units consumption and than review and compare a few of your alignment results and the amount of consumed units.
Example:
If you are blasting 1000 nucleotide sequences (contigs, CDS) against the NR database without a filter on a specific taxonomy group and with blastx (without the “fast” option) means you are searching all 6 reading frames of all sequences against the world’s largest protein sequences collection with great sensitivity. Compared to a blast search again a plant subset with blastx-fast, consumption is more than 10 times lower. The below numbers are from September, 30th 2020.
- 1000 nucleotide sequences with blastx-fast against the NCBI non-redundant database: 46 min. and 99173 CloudUnits
- 1000 nucleotide sequences with blastx-fast against the NCBI non-redundant viridiplantae subset: 26 min. and 7880 CloudUnits
Disclaimer:
BioBam is not interested in selling any extra cloud units. Units are pass-through items for the sole purpose to cover cloud expenses.