Large Scale Microbiome Profiling in the Cloud

Scalable Metagenomics Analyses

Flint takes advantage of the Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large reference collection of bacterial genomes. Our implementation relies distributing the alignment of millions of sequencing reads against a large collection of bacterial genomes. The genome collection is partitioned in order to distribute it across worker machines, and this allows the use of large collections of reference genomes. We use the Bowtie2 aligner under the hood in the worker-nodes, and are able to maintain fast alignment rates, without loss of accuracy.

Our computational framework is primarily implemented using Spark’s MapReduce model, and deployed in a cluster launched using the Elastic Map Reduce (EMR) service offered by Amazon Web Services (AWS). The initial cluster configuration (as of Spring 2019) consists of multiple commodity worker machines (computational nodes), and in the current configuration of the cluster that we use, each worker machine consists of 15 GB of RAM, 8 vCPUs (a hyperthread of a single Intel Xeon core), and 100 GB of EBS disk storage. Each of the worker nodes work in parallel to align the input sequencing DNA reads to a partitioned shard of the reference database.

You can read the Flint publication to learn more.

Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast profiling of metagenomic samples against a large collection of reference genomes.


Resources πŸ—„

Documentation πŸ“—

Publication 🏷

Flint can be referenced by using the citation:


Releases

Latest: RC3, build B20190715


β€’ What's New in Release Candidate 3

Flint RC 3 is a maintenance update that refactors some of the internal functions to improve efficiency and reporting.

Changes, Additions, and Fixes

  • Input Configuration. The JSON configuration used as input has been refactored. The examples/ direcetory contains the new format.
  • Output reports. New format for the output files.

Requirements 🚩

  • EMR 5.22.0
  • Hadoop 2.8.5
  • Spark 2.0+
  • Python 2.7
  • Boto3
  • Pandas
  • For a full set of requiremtns, see the full documentation.

β€’ What's New in Release Candidate (RC) 1

Flint RC 1 is the initial public release of the Flint pipeline. RC 1 is a focused release that contains refinements and enhancements to the existing features of Beta 2. As part of this release we are also making available the necessary indices to deploy in your cluster.

Changes, Additions, and Fixes

  • Initial release.

Requirements 🚩

  • EMR 5.22.0
  • Hadoop 2.8.5
  • Spark 2.0+
  • Python 2.7
  • Boto3
  • Pandas
  • For a full set of requiremtns, see the full documentation.

Source Code πŸ–₯

GitHub Repo

Flint is open source software written in Python and Spark; it is available under the MIT License (MIT). The source code can be obtained at this GitHub repository.

If you see a bug 🐞, please file a bug report.


Main Contributors πŸ€“

Miami, Fl. 🌴🐬