Cloud Case Study Precis

Introduction

This page concisely describes public cloud use for research computing and data science at the University of Washington.

We track close to 100 research projects using the public cloud for computing across issues of access, time, compute power, storage, data management, cost and other issues. The summary below is organized by domain. Our work in consulting and advocating for use of the public cloud is integrated with the mission and operation of the UW eScience Institute.

Admonitions

  • Contact us regarding updates to this material
  • Focus here is topics; we try to preserve a degree of anonymity

General Systems and Tools

  • SQL Share: A system for managing, sharing and manipulating research data.
  • Myria: A distributed, shared-nothing Big Data management system and Cloud service from the University of Washington
  • IOT based on Arduino Yun leaf technology and cloud IOT endpoint services
  • Geohackweek and Neurohackweek: Hosting intensive workshops for learning and developing cloud-based tools and methods at UW

Student-driven research

The UW Student High Performance Computing Club has begun making cloud computing available to its members. This includes training and consulting on implementation as well as careful cost management and tracking. The following is a partial list of projects undertaken by students during the pilot phase of this program, spring 2017.

  • Epigenome imputation across a nucleotide-protein-cell tensor (Status: Successful completion)
  • Design of a high-reliability micropump for cooling high-heat semiconductors (In progress)
  • Novel peptide characterization of marine organic matter: insights into carbon cycling (In progress)
  • Characterizing the progression of three pathologies in ER electronic medical records (In progress)
  • Quora question pair intent comparison (In progress)
  • Novel peptide characterization of marine organic matter: insights into carbon cycling (In progress)
  • Schedular development and benchmarking for containerized bioinformatics workflows (UW Tacoma; in progress)
  • Empirical Studies of Docker Orchestration Tools for The Analyses of Big Biomedical Data (UW Tacoma; in progress)
  • Predictive models to optimize cloud computing using genomics data (UW Tacoma; in progress)
  • A Dynamic Scaling Engine in the Cloud (CSE; in progress)
  • LaraDB Experiments for the DARPA Graph Challenge (CSE; in progress)
  • Learning multiple outcomes with predictive coding (CSE; in progress)

Medical research

  • Laboratory Medicine: Cloud-based system for genome analysis: (oncology and related) clinical annotation
  • Crossing the clinical-to-research data barrier
  • Data access and tool access for MRI- and EEG-based research
  • Gut biome metagenomics (Children’s Hospital)
  • Patterns in unexpected in-hospital mortality
  • Deep learning for patient behavior prediction: EEG data in relation to A/V transcripts of patient behavior
  • Canine longitudinal aging studies
  • Biostatistics
  • Light-sheet microscope for rapid-turnaround biopsy analysis
  • Neuroimaging: Functional MRI
  • Neuroimaging: Visual cortex studies

Hydrology and Geochemistry

  • GDS: Geometabolomics Data System, a community library and reproducible workflow environment for molecular spectral analysis applied to naturally occurring Dissolved Organic Matter (DOM).
  • HiMAT (NASA): Atmosphere-land coupled analysis of the hydrological state and future of high mountain Asia
    • Hydrological studies and human impacts drawing from in situ, remote sensing, model, re-analysis and assimilation data and methods.
  • Dynamic Infomation Framework (DIF) (World Bank): Scientific hydrological expertise transferred into public information
    • In resource management and public safety domains the incorporation of scientific modeling is not well developed.
    • This program provides localized information building from a reproducible model of free and open access

Genomics and Biochemistry

  • Genetic architecture of autism
  • Metagenomics of methane-consuming microbial communities
  • Enzyme inhibition by molecular structure
  • Peptide scaffolding enumeration and design: Large-scale computing using the Rosetta protein folding toolkit

Library science: With Suzallo Library

  • LIDARY: A pilot study providing geospatial LIDAR data as an open, curated digital resource
  • Proof-of-concept curation of an epigenome imputation engine (see above: student projects)
  • Migration to the cloud of BrainInfo.org including NeuroNames

Ocean science

  • LiveOcean: Ocean modeling forecast
  • Marine microbial ecology
  • Mesoscale eddie structure and correlation to marine life

Computer Science

  • Analysis of code fault detection: Student project
  • IOT: A design pattern and tutorial for using cloud-based support of Internet of Things implementations (NSF: Campus Cyberinfrastructure)
  • Data security on the cloud: A generic data system with automated and human protocols for working on sensitive data including elements of compliance with oversight regulations (NSF: Campus Cyberinfrastructure)
  • Scale on the cloud: See under Molecular Engineering and Science the protein folding case study (NSF: Campus Cyberinfrastructure)
  • Collaboration on the cloud: See case studies herein on GeoServer/THREDDS, on LIDAR, on Dynamic Information Frameworks and on HiMAT; thematically lightweight geospatial data system with the underlying theme of ‘access to data through pre-built frameworks, data APIs and minimal (non-redundant) software engineering. (NSF: Campus Cyberinfrastructure)

Mechanical and Civil Engineering

  • Computational fluid dynamics of hydrogen and methane combustion

Astronomy

  • Identifying stellar composition through spectral model superposition in nearby galaxies
  • Large Scale Synoptic Telescope (LSST) toolchain development

Geospatial

  • Implementation of GeoServer and a THREDDS server on the public cloud
  • Various data archival projects: Using the cloud for many 9s of reliability

Stubs and pending

  • IOT
  • Power consumption