Organic chemistry case study on Azure

Geo Data System

Introduction

This page describes a data system for organic chemistry research, specifically for physical and spectral analysis of dissolved organic matter (DOM) obtained from samples of the earth’s hydrosphere. This work was prototyped on the Microsoft Azure cloud and is currently on hiatus pending further development funding and corresponding cycles from the collaboration group.

Warnings

Overview

In 2010 one of us (RF) began working with a group of geochemists (AS, RS, TD, JN, RS, HM) to build a system on the Microsoft technology stack that would advance the state of the art of research in the earth’s global carbon cycle. Specifically the scientists recognized that there were two important functions that a centralized, standardized data system could potentially provide: Perfunctory data processing and a large collective pool of queryable samples.

To provide a sense of the research we present a brief scientific digression.

The earth’s hydrosphere is all of the water on earth in its many locations from the bottom of the Antarctic ice sheet to water vapor suspended in the upper reaches of the atmosphere to the oceans to the permafrost of Siberia to the mud on the banks of the Great Grey Greasy Limpopo River (all set about with fever trees) to the glaciers of the Himalayas to the damp soil of the Hoh rainforest… and on and on. All of this water bears the fingerprints of its history in the form of dissolved carbon-bearing molecules. These molecules generally reflect metabolic processes of life on earth at all scales from microbial to macroscopic as well as physical effects such as degredation due to the sun’s ultraviolet rays. The common denominator in this picture is the storage and transport of carbon in what is called the global carbon cycle. Carbon is stored in the earth system in reservoirs that include soil, oceans, sediment, rock, atmosphere, surface water and ice. It is transported and transformed between reservoirs by physical, chemical and biological processes at various rates. The aggregate of medium-to-small carbon molecules with life-based or organic structure are referred to as DOM for Dissolved Organic Carbon.

The current state-of-the-art in analysis of DOM found in water samples involves spectral methods, three in particular in the work described here. By ‘spectral analysis method’ we mean a procedure that generates from one water sample a series of values associated with one or more spectral parameters such as wavelength. As an example if one were studying rain rather than DOM one might measure the intensity of a rainbow across all of its colors; by wavelength.

Let’s consider spectral DOM data generated in the course of a single research project. If a particular field campaign produced one thousand water samples and each of these sampels yielded 400,000 data values across three spectral methods then the project will have produced 400 million data values that must be stored and analyzed. Typical reduction methods applied to these data might be applied iteratively many times over the course of two or three years resulting in three or four research papers; where the conclusions therein would emerge from patterns and structures discovered in that data.

The problem is this: The scientists are extremely adept at imagining forms of data analysis that have not yet been implemented. They are able to see how other scientists in related field could make use of this data. They would like those scientists to have direct, unfettered access to the spectral DOM data without any human intervention being necessary. And finally our scientists are able to envision the power of combining datasets from multiple sampling projects to create a deeper body of data to analyze. But none of these imagined pathways can be implemented because the data are simply too complex and unwieldy in their traditional form to support these advances in DOM-driven research.

The task therefore falls to technologists (including these same scientists) to build a data management system that makes all of these imagined advances a reality. This, simply put, is the objective of our DOM data system project.

Introduction

Links

Warnings

Overview

User Interface

API

Query

Data Egress

Metadata

MATLAB Functions

EEMs