Building a secure compute environment; with examples

Introduction

This document presents a secure data management system, specifically using the motivation of managing Private Health Information (PHI) under HIPAA regulations on the public cloud. This template applies more generally to secure data management in research computing.

Framework

In a simple view HIPAA regulations require that data be encrypted both at rest (on a storage device) and in transit (moving from one location to another). In addition all transactions around PHI data should be logged in a traceable manner to validate the assertion that only authorized individuals have had access to PHI.

Putting a conforming technical system in place is straightforward. The steps are given here although this resource is not exhaustive. Our objective is to provide the technical basis for a scalable research solution under HIPAA guidelines. The social component of this process is called a Shared Responsibility model. In this model AWS provides necessary security measures and the practitioners build and operate the environment according to established practices. Hence using HIPAA-aligned technologies from AWS is not equivalent to complying with HIPAA.

We proceed along the following program.

  • Define some User Stories to motivate the structure of what follows
  • Provide instructions for building a secure compute environment (SCE) on the AWS cloud
    • Constructing a Virtual Private Cloud
    • Constructing storage, compute and data management elements
  • Describe implementation 1 of the SCE: Simple, includes a very small synthetic dataset
  • Describe implementation 2 of the SCE: OMOP

While this study uses AWS our group is also building comparable structures on other cloud platforms, notably on the Microsoft Azure public cloud.

Admonitions

  • AWS has more than twelve HIPAA-aligned technologies. Only these technologies may come into contact with PHI. Other technologies can be used provided there is no such contact.
  • HIPAA compliance is an obligation of the data system builder, the medical researcher and the parent organization(s).
  • The public cloud is very secure: physically and technologically. Compromise is most likely to be caused by human error either in design or in operation.
  • File names may not include PHI. They may include identifier strings that could be indexed in a secure table.
  • AWS has many virtual machine / instance / EC2 *types including general purpose computers that are **m-type. These are fine to use in an SCE. There is another lightweight type called a t-type. Do not use these. The t instance type by its nature will not connect to a Dedicated Tenancy VPC. That is, this technical detail is simply not supported. It is no loss, just a guideline to follow.***

User story

  • A scientist K receives approval from the IRB to work with PHI data.
    • Intent: Analyze these data in a Secure Compute Environment (SCE)
  • K contacts IT professional J for the SCE as a working environment
    • SCE built by J as described below
    • Data uploaded from a secure data warehouse to an encrypted S3 bucket
  • K provided with access
    • pem file and ip address of a bastion server B
    • login and password to a private subnet EC2 instance
    • Notice no IAM User account: Console access not needed
    • K logs on to the SCE, carries out analysis over time
      • The system logs all activity
    • Patient-held devices contribute ongoing data via phone app using the API Gateway
      • These data supplement the research
  • The study concludes, data preserved, log files preserved, SCE deleted

Supplemental ideas from the researcher

  • Scientist writes, anticipating a SCE:
    • Early use cases will involve external devices (phones, sensors)
      • Data direct to cloud: Authenticate, validate data, match to patient
      • Standard format (HL7? CCD?)
    • Subsequent
      • Bidirectional data sharing
        • Device, cloud, EHR
    • End result: Allowed persons (care / research) can get to all of this data
      • progress in stages
      • On the cloud: tools for analysis
        • ML, visualizations out, comparative w/r/t analytic datasets etc

Supplemental ideas from the IT professional

  • Constraints? AWS Services available within the VPC? VPC required?
    • Currently 13: API Gateway, Direct Connect, Snowball, DynamoDB, EBS, EC2, EMR, ELB, Glacier, RDS, Aurora, Redshift, S3
      • API Gateway would be the mechanism for having a phone report my bp or something
    • Yes? Expanded how often? Yearly? Extend to entire service platform?
      • SA “essentially yes” (je crois BAA expands as the list automatically)

Intermezzo Kilroy stack (a list of must-fix details)

  • Questions from Dogfooding on March 23 2017
    • Create VPC…
      • Does my VPC need a CIDR that doesn’t overlap other ones in my account? 10.0.0.0 is very popular…
        • Peered VPCs must not have collisions; otherwise fine
      • Stipulate ‘Default Tenancy’:
        • Shared versus dedicated: Shared not allowed; so this must be Dedicated Tenancy, not Default
      • Creating subnets: AZ: VPC is regional
        • Subnets are associated with AZs and should be intentionally designated
        • Using multiple AZs (multiple subnets across AZs) will make the VPC “present” in those AZs
          • for SCE it is more controlled to be in just one AZ
        • Going multi-AZ would be a high availability strategy which is a compute-heavy idea
  • AB: How to do determine Account has been registered at AWS as PHI/HIPAA-active?
    • Email aws-hipaa@amazon.com and include the account number
    • Is there a DLT component?
  • Generate and incorporate a scientist workflow diagram with extensive caption per User Story
  • Generate and incorporate a complete SCE architecture diagram near the top of this
  • Create a sub-topic around L: In coffee shop, in a data center (out of Med Research), etcetera
    • The ‘in transit / at rest’ component leaves a sparking wire in L: Section on risk added
    • While a coffee shop to S3 encrypted is not a foul on AWS it is a foul on me and my organization
  • Need explanation of public/private subnet collision avoidance
    • Public subnet elements also have private subnet addresses
      • 10.0.0.x as the Spublic address space means that we get two bonus things for the bastion at 10.0.0.9:
        • That is: 10.0.0.x is actually a private ip address for a public-facing instance
        • We get a public ip address 52.x.y.z etc whatever
        • We get a DNS entry ec2-blahblahblah
          • Inside the VPC the latter always resolves to the private 10.0.0.9 so that traffic stays inside the VPC
          • But do not hard code an ec2-blah DNS entry because this can change when the machine bounces
    • Verify protocol is that new resources auto-generated will be on the private subnet
      • AB How is this done? We got as far as ‘no default’ for public and private both
    • Verify automated allocation will choose free ip addresses on the private subnet: Correct
    • Verify protocol is to not create new resources on the public subnet: Yes, as policy
    • This is tremendously important so that a new Wi is not publicly visible
    • Look at the VPC Routing Table “default subnet” column (verify this is correct)
  • Key management story
    • Rename K2 as K_Bastion and describe where it lives on L, add to Risk
      • It may make sense to describe chain of custody of KBastion
    • J needs to give K a pem file and an ip address: To get to B plus creds to the Sprivate from the bastion
      • Risk: This makes B a single point of failure if the private EC2 keys are left on the Bastion: It gets compromised, game over
      • Password-protected is a way forward
    • Scale up to her group: Giving everyone the same login is not really an option
  • Free up and assign B an Elastic IP address that will persist
    • Verify that this is publicly discoverable and place this fact in Risk: True
      • Lock down the SG to restrict network access
    • Rename K3 as K_EC2
    • Chain of custody of K_EC2
    • How do keys work in view of an AMI source?
  • EBS Encryption Keys should be a Risk sub-section
    • EBS encryption select has an account default key: Details are listed below
    • Notice that publishing these details is not technically bad for pedagogy…
      • but I have a bit of a nagging doubt here: Are these “Details” global/permanent?
      • The screencap redacts to be on the safe side; stay with that for now
  • Risk due diligence: Check the DLT account T&C; “who is responsible for the risk?”
  • Risk entry: We do not encrypt the boot volume
  • Risk entry: EC2 swap space: How does this touch on PHI?
  • How are keys / Roles managed when spinning up / shutting down Wi?
    • For now we allow that EC2 is the only game in town
  • Look at the JS SK scenario and build out the IAM component for SK
    • e.g. turning machines on and off
    • do with console? do with CLI?
    • AB: Help needed!
  • Differentiate SQS and SNS
    • https://aws.amazon.com/blogs/aws/s3-event-notification/
    • http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
    • PubSub concept
    • SQS: Things sit in a queue and that queue is polled to trigger processes
      • e.g. CANVAS the LMS: As events happen
  • What is Ansible and what does it get us?
    • Configuration management
    • cf Chef and Puppet
  • Can/should the NAT Gateway be used to pull updates from GitHub? Generalize.
  • DDil on elastic IP: NAT Gateway? Bastion? IG?
  • Fix the RT / subnet editing section… gotta get that right
  • What are the tradeoffs in building B and M and Wi from AMIs?
    • Config management tool: Option 1 (Ansible Chef Puppet)
    • Manually take a stale AMI and update it and re-save it (and update the hot machine)
      • “You still have to manage the OS” is part of Shared Responsibility
  • Vestigial details from earlier notes to incorporate
    • Logging: CloudWatch and CloudXXXXX are AWS logging services; and this is frequently parsed using Splunk
    • Intrusion detection! Jon Skelton (Berkeley AWS Working Group) reviewed use of Siricata (mentions ‘Snort’ also)
    • Include an encryption path for importing clinical data
    • Include a full story on access key management
    • The IOT import will – I think – be a poll action: The secure VM is polling for new data
    • This system should include a very explicit writeup of how the human in the loop can break the system
    • Acceptable for data on an encrypted drive to moves through an encrypted link to another encrypted drive?
      • To clarify: Must the data be further encrypted at rest first?
    • Filenames may not include PHI. Hence there is an obligation on the MRs to follow this and/or build it into file generation.
    • CISO approval hinges on IT, Admin and Research approvals.
    • Lambda
  • Make sure we return to ‘Default subnet’ referred to in the text
  • In a route table does 0.0.0.0/0 refer to the public internet? Where does this come up???
    • It would be excellent to explain how the smaller CIDR block does not conflict
    • with the “wide open internet” sense of the second CIDR block
  • Bastion server inbound ip range should match UW / UW Med / etcetera
    • Also differentiate the UW VPN CIDR block
  • Could not give my dog NG a PIT name!!!
  • Bastion and Sprivate worker: Need more details on the configuration steps!
    • Enable cloudwatch checkbox? Yes
  • Missing instructions on setting up S3 buckets: For FlowLog and for DataIn
  • Let’s be clear that subnet CIDR blocks are always for the private subnet component. Making the subnet a public subnet means that there is a second set of (public: on the internet) ip addresses that map to those private subnet resources. I still have a hard time with how this doesn’t use up the internet. 2^32 is a mere 4 billion.
  • Hit these terms in a glossary
    • Ansible
    • Regions and Availability Zones on AWS
    • Bastion Server
    • Siricata / Snort
    • Direct tools like ssh and third party apps like Cloudberry
    • Dedicated Instance
    • Lambda Service
    • NAT Gateway
    • HIPAA-aligned tech at AWS (original 9; need to add with link the new ones)
      • S3 storage
      • EC2 compute instances (VMs)
      • EBS elastic block storage: Attached filesystem
      • RDS relational database service
      • DynamoDB database
      • EMR elastic map reduce: Hadoop/Spark engine support
      • ELB elastic load balancer
      • Glacier archival storage
      • Redshift data warehouse
  • “How can you be using Lambda? It is not on the list…?”
    • Tools that do not come in contact with PHI can be thought of as ‘triggers and orchestration’.
    • Services that may come in contact with PHI can be described as ‘data and compute’
  • Encryption
    • HIPAA requires data be encrypted at rest, i.e. on a storage device
    • HIPAA requires data be encrypted in transit, e.g. moving from one storage device to another

Plan of action for this SCE Proof of Concept

  1. System architecture and diagram
    • Anticipate pipeline, pipeline
    • EMR / data warehouse / Red Cap survey system / new clinical data / sensors
    • Destinations are RDS, S3, EBS
    • AWS big box, VPC inside, Public and Private subnets inside VPC
    • NAT gateway inside the Public subnet box
    • Internet Gateway on boundary of VPC
    • S3 Endpoint on boundary of VPC
  2. Artificial data 1
    • static historical synthetic data (from EMR)
    • IOT stream
    • anticipate study-to-clinical pipeline
  3. Complete system including analytical tools
    • include R, Python, Jupyter
  4. Artificial data 2 see KS

Build the SCE

Notes on format and preliminary steps

Notes in passing

We punctuate the procedural steps needed to build the Secure Compute Environment (SCE) with short notes on rationale, how things fit together. We also make extensive use of very short abbreviations (just about every entity gets one, indicated by boldface and obsessive re-naming of everything using the Project Identifier Tag (PIT)

Part 1: Getting started

Zero’th Required Action

As SCE builder your absolute First Priority Step 0 Must Do is to designate to AWS and DLT that the account you are using involves PII/PHI/HIPAA data.

Cost tracking and cost reduction

  • This will be a multi-day effort. Shut down instances to save money at the end of the day.
  • Tags… kilroy

Objectives

Our main objective (see Figure below) is to use a Laptop or other cloud-external data source to feed data into a SCE wherein we operate on that data. The data are assumed to be Private Health Information (PHI) or Personally Identifiable Information (PII).

PIT means Project Identifier Tag (an informal term)

  • Write down or obtain a Project Identifier Tag (PIT) to use in naming/tagging everything
    • This is a handy string of characters
    • In our example here PIT = ‘hipaa’. Short, easy to read = better

Source computer

  • Identify our source computer as L, a Laptop sitting in a coffee shop
    • This is intentionally a non-secure source
    • L does not have a PIT
    • L could also be a secure resource operated by Med Research / IRB / data warehouse

We are starting with a data source and will return to encryption later. The first burst of activity will be the creation of a Virtual Private Cloud (VPC) on AWS per the diagram above. We assume you have done Step 0 above and are acting in the capacity of a system builder; but you may not be an experienced IT professional. That is: We assume that you are building this environment and that you may or may not be doing research once it is built; but someone will.

VPC via Wizard versus manual build

The easiest way to create a VPC is using the console Wizard. That method is covered in a section below and it can automate many of the steps we describe manually. We describe the manual method to illuminate the components.

CIDR block specification

The CIDR block syntax uses a specification like 10.0.0.0/16. This has two components: A ‘low end of range’ ip address w.x.y.z and a width parameter /N. w, x, y and z are integers from 0 to 255, in total providing 32 bits of address space.

N determines an addressable space of size s = 2^(32 - N). For example N = 24 produces s = 2^8 or s = 256 available addresses, starting at w.x.y.z. Hence z (and possibly y) would increase to span the available address space.

Another example: Suppose we specify 10.0.0.0/16. Then s = 2^16 so 65536 addresses are available: 10.0.0.0, 10.0.0.1, 10.0.0.2, …, 10.0.1.0, 10.0.1.0, …, 10.0.255.255. y and z together span the address space.

These ip addresses are defined in the VPC, contextually local within the VPC.

Any subnets we place within the VPC will be limited by this address space.
In fact we proceed by defining subnets within the VPC with respective CIDR ranges, subranges of the VPC CIDR block. In our case the first subnet will have CIDR = 10.0.0.0/24 with 256 addresses available: 10.0.0.0, 10.0.0.1, …, 10.0.0.255. Five of these are appropriated by AWS machinery. The second subnet will be non-overlapping with CIDR range = 10.0.1.0/24.

Since AWS appropriates five ip addresses for internal use (.0, .1, .2, .3, and .255) we should look for ways of making ip address assignment automatic.

Creating a Virtual Private Cloud

Build the VPC

Here we abbreviate elements with boldface type. In most cases the entity we create can be named so to remind you: For consistency we have come up with a Project Identifier Tag like ‘hipaa’ so that each entity can be given a PIT name: ‘hipaa_vpc’ and so on. In naming associated S3 buckets: The name may be harder to produce because it must be an allowed DNS name that does not conflict with any existing S3 buckets across the entire AWS cloud.

Intermezzo: 30 Screencaps: Need to be interspersed (kilroy)

AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap) AWS HIPAA encryption bucket policy screencap)

To continue…

  • From the console create a new VPC V

    • Give V a PIT name

    • V will not use IPv6v.

    • V will have a CIDR block defining an ip address space
      • We use 10.0.0.0/16: 65536 (minus a few) available addresses
    • V automatically has a routing table RT
      • Select Routing Tables, sort by VPC and give RT a PIT name
        • ‘hipaa_routingtable’
        • The routing table is a logical mapping of ip addresses to destinations
    • V is automatically given a security group SG
      • Select Security Groups, sort by VPC and give SG a PIT name
        • ‘hipaa_securitygroup’
    • Create an associated Flow Log FL
      • In March 2017 the AWS console UI was a little tetchy so be prepared to go around twice
      • On the console view of the VPC: Click the Create Flow Log button
        • Assuming permissions are not set: Click on the Hypertext to Set up permissions
          • Because: We need to define the proper Role
          • On the Role creation page: Give the Role a PIT name; Create new; Allow
          • You now have an IAM Role for FlowLogs
            • This gives the account the necessary AWS permissions to work with Flow Logs
            • In so doing we fell out of the Create Flow Log dialog so… back around
        • Return to the VPC in the console
        • Click on Create Flow Log
          • Filter = All is required (not “accepted” and not “rejected” traffic)
          • Role: Select the role we just created above
          • Destination log group: Give it a PIT name
            • Example: hipaa_loggroup
    • Create subnets Spublic and Sprivate
      • The private subnet Sprivate is where work on PHI proceeds
        • CIDR block 10.0.1.0/24
        • Sprivate will be firewalled behind a NAT gateway
          • This prevents traffic in (such as ssh)
      • The public subnet Spublic is for internet access
        • CIDR block 10.0.0.0/24
        • Spublic connects with the internet via an Internet Gateway
        • Spublic will be home to a Bastion server B
        • Spublic will be home to the NAT Gateway NG mentioned above
          • B and NG are on the public subnet but also have private subnet ip addresses
            • That is: Everthing on the public subnet also has a private ip address in the VPC.
            • This will use the private ip address space
            • Public names will resolve to private addresses within the VPC at need.
    • Create an an Internet Gateway IG
      • Give a PIT name as in ‘hipaa_internetgateway’
      • Attach hipaa_internetgateway to V
    • Create a NAT Gateway NG
      • Give it a PIT name
      • Elastic IP assignment may come into play here
    • Create a route table RTpublic
      • Give it a PIT name: ‘hipaa_publicroutes’
      • This will supersede the V routing table RT
      • Select the Subnet Associations tab
        • Edit subnet association to be Spublic
      • Select the Routes tab
        • Edit (under Routes) and add 0.0.0.0/0 pointing to IG

Note: The console column for subnets shows “Auto-assign Public IP” and this should be set to Yes for Spublic. Note the column title includes the term Public IP. The Private subnet should have this set to No. If necessary change these entries using the Subnet Actions button.

Note: In the subnet table find a “Default Subnet” column. In this work-through both Spublic and Sprivate have this set to No: There is no default subnet. This will be modified later via the route table RT in V.

RT reads:

10.0.0.0/16         VPC "local" 
0.0.0.0/0           NAT gateway 

RTpublic (hipaa_publicroutes) has

10.0.0.0/16 
0.0.0.0/0      Internet Gateway

We now have two RTs. hipaa_routingtable (default for the VPC in general) and for Spublic (hipaa_publicroutes). Notice that RT operates by default and RTpublic supersedes this on Spublic. This means that new resources on Sprivate will by default use NG which is what we want.

The V RT 0-entry points at NG: All internet-traffic will route through NG. NG does not accept inbound traffic. This is by default. (Analogy: Home router)

Spublic has the custom RTpublic which routes non-local traffic to the IG, i.e. the internet. This does accept inbound traffic allowing us to ssh in. This is an exception to the default.

Part 3: Adding EC2 and S3 resources to the VPC

S3 buckets

S3 Encryption policy

We create new S3 buckets associated with projects and assign them a Policy to ensure that Server-side encryption is requested by anyone attempting to upload data. This ensures the data will be encrypted when it comes to rest in the bucket.

AWS link for S3 server-side encryption policy for copy-paste

AWS HIPAA encryption bucket policy screencap)

S3 Endpoints

An S3 Endpoint is routing information associated with the VPC. S3 access from the VPC should not go through the public internet; and this routing information ensures that. The S3 Endpoint is not subsequently invoked; it is simply infrastructure. For example an EC2 instance might access an S3 bucket via the AWS CLI as in

% aws s3 cp s3://any-bucket-name/content local_filename 

Building a Bastion Server B

  • On V create a public-facing Bastion Server B
    • B has only port 22 open (to support ssh)
    • B uses Secure Groups on AWS to limit access to only a subset of URLs
    • B can be a modest general-purpose machine such as a m4.large
      • In this example we choose an m4.large running AWS Linux
      • DO NOT USE T instances
        • They will not connect to our Dedicated Tenancy VPC: This is not supported.
    • Go through all config steps: Memory, tags;
      • Security group is important
        • Create a new security group
        • PIT name: hipaa_bastion_ssh_securitygroup
        • Description = allow ssh from anywhere
        • Notice that in the config table “Source” is 0.0.0.0/0 which is “anywhere”
          • Best practice is to restrict the inbound range
          • Consider differentiating UW from the UW-VPN
            • This would allow someone to log in from anywhere VIA the UW VPN
      • Key pair use a PIT as ‘hipaa_bastion_keypair’
        • Generate new, download to someplace safe on L
      • Launch instance
Building a work environment EC2 instance on Sprivate
  • On Sprivate install a small Dedicated EC2 instance E

We are now configuring the S3 Endpoint. In so doing we selected the VPC RT (not the public subnet RT; could use that) We are concerned about S3 traffic S3 Endpoint is a new type of Gateway that gives S3 access from Private subnet direct into S3 without any traverse of the internet.

WARNING: Kilroy: After I added the S3 Endpoint the NAT gateway entry had vanished from my VPC routing table. This is bad. Right now the procedure is going to be: After S3 EP go examine RT and re-add NAT gateway if it is not properly present. holy cow!!!!!!!!!!!!!!!

Created S3 bucket

Part 4: Ancillary components

This section describes the use of automated services, data bases tables, IOT endpoints and other AWS features to augment the SCE. Such ancillary components may or may not touch directly on PHI, an important differentiator as only HIPAA-aligned technologies are permitted to do so.

  • Set up a DynamoDB table to track names of uploaded files
  • Set up a Lambda service
    • Triggered by new object in bucket in the S3 input bucket
    • This Lambda service is managed using a role
  • Set up an SQS Simple Queue Service Q
  • Create an SNS to notify me when interesting things happen

Part 5: Encryption

Suppose on EC2 we create an EBS volume for the PHI So is the “comes with” EBS volume encrypted? No. Therefore: Keep data on /hipaa Volume type = General Pupose and the size does affect IOPS; I went for 64GB

For costing and performance include this link: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

This gets back into optimization; it does not affect our HIPAA story

Notice the EBS volume has an AZ which must match the target EC2 where it is going

Notice the volume can by encrypted with a check box; and then there is a Master Key issue.

Using the default key technically encrypts this data at rest. So that’s done.

Working with your own set of keys would be part of risk management: Should the keys be compromised etcetera. So there is some additional hassle and so on but some potential risk management.

The question is: Do we want to use one key across multiple environments (the default) or created new keys for each new envvironment? How much hassle, etc.

ok done: hipaa_ebs_ec2_private

Next log in to EC2 on the private subnet:

  • log in to the bastion server
  • used copy-paste to edit a file on the bastion called ec2_private_keypair.pem
    • I did a bad paste and was missing the —–BEGI
    • on this file we ran chmod 600 on the pem file
    • ssh -i 10.0.1.248
      • notice this ip address is in the console Description tab when highlighting the private subnet EC2 instance
    • Notice that the pem file on the bastion would immediately compromise the EC2 on the private subnet
      • That sure sucks for you

Using (sudo) fdisk to create the partition on the raw block device (fast: writing the partition table) will blow away anything already there.

Now we did the simple

% aws s3 cp s3://bucket-name/keyname .

The keyname that I used was the filename that I uploaded to this S3 bucket from L. That file was pushed from L using the console but can also be done using the CLI.

We are intentionally not going to encrypt the boot volume. It can be done; goes on the DD pile.

We implement server side encryption on S3 next.

  • The file may be unencrypted on L
  • We upload it to S3 and stipulate “encrypt this when it comes to rest in S3”
  • S3 manages this
  • We create an associated policy that only allows this type of upload
    • Therefore a not-encrypted-at-rest request will be denied

Best way is follow “S3 AWS encryption” links to http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingEncryption.html Click Server-Side Encryption Click on Amazon S3-Managed Encryption keys in the left bar to get to http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html

Copy the code box contents Go to console S3 bucket Permissions tab Bucket Policy button Paste Replace in two places: actual bucket name

PutObject command: must have server-side-encryption set to AES256 (that is the name of the encryption algorithm) AND must have server-side-encryption set to true

What about the inbound files: They must be “encrypted in transit” so we get that with an https endpoint to S3: Done.

Transferring to the instance: scp

Last encryption note: SSL is used by the CLI be default; so our EC2-private command ‘aws s3 … etc’: Look at cli/latest/reference for the link on this. This means that S3 to EC2private is encrypted in transit. The EBS /hipaa is encrypted at rest. Done.

Part 6: Auditing

SCE activity must be logged in such a manner as to permit subsequent tracing of PHI data access and ongoing monitoring of the security state of the system.

CloudTrail and CloudWatch

CloudTrail is logging the API calls to my account. Whether through the console, the CLI, the APIs directly: All logged in cloud trail once it is enabled. They all use the same APIs; so we log on the API calls. Logs to S3.

Here is the key thing: Enable cloud trail which creates a destination S3 bucket where all of the logging will go.

Best practice is to turn on Cloud Trail in all of your regions; so you are not blind. “Either you are logging or it is gone.”

Cloud Watch is more for monitoring: Performance metrics.

In both cases there is never PHI data in the log: PRovided ss#s are not in the filenames (for example)

And S3 has its own internal logging mechanism as well.

  • Create a new bucket hipaa-s3-access-logs with vanilla settings
  • Locate the existing inbound data bucket > Properties > Logging > enable > stipulate the access logs bucket; add a tag…

This will have http traffic details; where stuff was coming from for example

AWS Config

Cloud trail tells you what’s happening in terms of the API AWS Config tells you what changed

Use them together to see if something is happening / happened of concern

High level set up and then more detailed is possible

Console > MAnagement > Config

Think of Config (like Cloud Trail) as account-wide, not “per resource” so tagging with the PIT is not correct.

My process was pretty default.

So now we have Cloud Trail, S3 logging and Config operating on this account. So as we get more sophisticated we could dig in to Cloud Watch.

Part 7: Disaster Recovery

Indicate awareness; up to CISO to provide hurdles

Part 8: Generating Pseudo-Data

The following is a snapshot of some Python code that generates two CSV files with imaginary health data. The data are non-trivial insofar as three of the measured values are related to vitals.

# 100 people (all named John) live in a small town with one doctor. They appeared in this town
#   on October 30, 1938 all at once. Every day or two one of them at random visits the doctor
#   where his vital signs are recorded: Blood pressure (two numbers), respiration rate, heart rate,
#   blood oxygen saturation, body temperature (deg F) and weight (pounds). The doctor also asks
#   'Since your last visit how many albums have you purchased by Count Basie? By T-Pain? By the 
#    Dead Kennedys?' This process generates a 10,000 row time-series database over 41 years. 
#
# The code below builds 3 tables (each being a list of lists) and these are written to CSV files.
#   The following are some notes on these tables and related parameters.
#
#  Tstudy is the first day of study data, morning of October 30 1938
#  Tborn is the latest possible date of birth, Dec 31 1922 (so all participants are 16 or older)
#    However all participants give their DOB as August 26 1920.
#
# The Python random number generator is given a fixed seed so that this code always generates the 
#   same patient history.
#
#  1. patient table p: Surname, Given Name, DOB, (height feet, inches), patient ID
#  2. patient parameters pp: for internal use
#  3. time series data ts: Across all patients
#

import datetime
import random as r
import csv

Tborn = datetime.datetime(1922,12,31,0,0,0)
Tstudy = datetime.datetime(1938,10,30,0,0,0)

pName = 'patients.csv'
ppName = 'patientParameters.csv'
tsName = 'timeseries.csv'

# Keep results reproducible using a fixed random number seed
r.seed(31415)

lc = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
uc = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']

def RandLastName():
    return uc[r.randint(0,25)]+lc[r.randint(0,25)]+lc[r.randint(0,25)]+lc[r.randint(0,25)]

def cmtoftin(ncm):
    if ncm < 50*2.54: return {5,8}
    nin = ncm / 2.54
    nft = int(nin/12)
    nin = int(nin - 12*nft)
    return (nft, nin)

def ftintocm(height):
    height_in = height[0]*12 + height[1]
    return int(height_in*2.54)

def someheight():
    hgtLow = 157
    hgtHi = 198     # from 5 foot 2 to 6 foot 6
    return cmtoftin(r.randint(hgtLow, hgtHi))

sysMin = 90
sysMax = 129
diaMin = 60
diaMax = 84
rrMin = 10
rrMax = 20
hrMin = 40
hrMax = 100
osMin = 95.0
osMax = 99.0
btMin = 97.0
btMax = 99.1
mMin = 98
mMax = 280

def ModelBaseE(hr, sys, dia, bmi):
    if hr > 58 and hr < 62 and sys-dia > 30 and bmi < 22: return 1
    return 0

def ModelTPain(rr, os, height): 
    return float(rr) + 2.0*(os-95.0)+0.2*(ftintocm(height)-157)

def ModelDK(weight, bt):
    if weight < 150 and bt < 97.4: return 1
    return 0

def RandomSystolic(a,b):
    return r.randint(a,b)

def RandomDiastolic(a,b):
    return r.randint(a,b)

def RandomRespirationRate(a,b):
    return r.randint(a,b) 

def RandomHeartRate(a,b):
    return r.randint(a,b) 

def RandomOxygenSaturation(a,b):
    return r.uniform(a,b)

def RandomBodyTemperature(a,b):
    return r.uniform(a,b)

def RandomMass(a,b):
    return r.randint(a,b)

def BMI(height,mass):
    height_in = height[0]*12+height[1]
    return 703.0*mass/(height_in*height_in)

nPatients = 100
bornDay = datetime.datetime(1920,8,26,8,30,12)
p=[]
pHdr = []
pp=[]
ppHdr=[]

pSurnameIndex = 0
pFirstNameIndex = 1
pDOBIndex = 2
pHeightIndex = 3
pIDIndex = 4

pHdr.append(['Surname','Given name', 'DOB', 'height', 'ID'])
p.append(['Bigboote','John',bornDay,someheight(),0])
p.append(['Yaya','John',bornDay,someheight(),1])
p.append(['Smallberries','John',bornDay,someheight(),2])
p.append(['Parker','John',bornDay,someheight(),3])
p.append(['Whorfin','John',bornDay,someheight(),4])
p.append(['Valuk','John',bornDay,someheight(),5])
p.append(['Gomez','John',bornDay,someheight(),6])
p.append(['OConnor','John',bornDay,someheight(),7])
p.append(['Emdall','John',bornDay,someheight(),8])
p.append(['Gant','John',bornDay,someheight(),9])
p.append(['Manyjars','John',bornDay,someheight(),10])
p.append(['Milton','John',bornDay,someheight(),11])

pDone = len(p)
for i in range(nPatients-pDone):
    while True:
        nextName = RandLastName()
        if not nextName in p:
            p.append([nextName,'John',bornDay,someheight(),i+pDone])
            break

with open(pName,'wb') as patientFile:
    csvWriter = csv.writer(patientFile, dialect='excel')
    csvWriter.writerows(pHdr)
    csvWriter.writerows(p)

# The next table is pp for 'patient parameters' and it requires some explanation
# It will be used for two purposes: To keep static range parameters for the patient's vital
#   signs (used to generate the time series data on a per-patient basis) and it will also
#   be used to track cumulative values for the three 'effect' categories BaseE, TPain and DK.
#
#   In passing: These are meaningless categories. Their similarity in sound to musical acts 
#   is coincidental. 
#
ppIDidx=0
ppS0idx=1
ppS1idx=2
ppD0idx=3
ppD1idx=4
ppRR0idx=5
ppRR1idx=6
ppHR0idx=7
ppHR1idx=8
ppOS0idx=9
ppOS1idx=10
ppBT0idx=11
ppBT1idx=12
ppM0idx=13
ppM1idx=14
ppSUMBidx=15
ppSUMTidx=16
ppSUMDKidx=17

ppHdr.append(['ID','S0','S1','D0','D1','RR0','RR1','HR0','HR1','OS0','OS1',\
           'BT0','BT1','M0','M1','SUMB','SUMT','SUMDK'])


for i in range(nPatients):
    s0 = r.randint(sysMin, sysMax)
    s1 = s0 + r.randint(4,20)       # a range of systolic pressures (mmHg)
    d0 = r.randint(diaMin, diaMax)
    d1 = d0 + r.randint(4,10)       # a range of diastolic pressures
    rr0 = r.randint(rrMin, rrMax)
    rr1 = rr0 + r.randint(2,8)
    hr0 = r.randint(hrMin, hrMax)
    hr1 = hr0 + r.randint(10,20)
    os0 = r.uniform(osMin, osMax)
    os1 = os0 + r.uniform(1.0,2.0)
    bt0 = r.randint(sysMin, sysMax)
    bt1 = bt0 + r.randint(3,20)
    m0 = r.randint(mMin, mMax)
    m1 = m0 + r.randint(3,40)    
    pp.append([i,s0, s1, d0, d1, rr0, rr1, hr0, hr1, os0, os1, \
               bt0, bt1, m0, m1, 0, 0, 0])

    
# Study begins on October 30 1938, generates 10,000 records and continues for about 41 years
ts = []
tsHdr = []
tsHdr.append(['date','ID','systol','diastol','resp.rate','heart rate','OSat','temp',\
           'weight','BMI','BaseE','TPain','DK','sum BaseE','sum TPain','sum DK'])

Time = Tstudy
for i in range(10000):
    # generate this timestamp
    thisID = r.randint(0,99)
    
    # for each of the following vitals we allow a bit of correcting goofy values
    
    # blood pressure
    thisSys = RandomSystolic(pp[thisID][ppS0idx], pp[thisID][ppS1idx])
    thisDia = RandomDiastolic(pp[thisID][ppD0idx], pp[thisID][ppD1idx])
    if thisDia >= thisSys - 3: thisDia = thisSys - 4
    
    # Respiration and Heart rates
    thisRR = RandomRespirationRate(pp[thisID][ppRR0idx], pp[thisID][ppRR1idx])
    thisHR = RandomHeartRate(pp[thisID][ppHR0idx], pp[thisID][ppHR1idx])
    
    # blood oxygen saturation
    thisOS = RandomOxygenSaturation(pp[thisID][ppOS0idx], pp[thisID][ppOS1idx])
    if thisOS > 99.4: thisOS = 99.4
     
    # body temperature 
    thisBT = RandomBodyTemperature(pp[thisID][ppBT0idx], pp[thisID][ppBT1idx])
    
    # body weight and BMI
    thisMass = RandomMass(pp[thisID][ppM0idx], pp[thisID][ppM1idx])
    thisBMI = BMI(p[thisID][pHeightIndex], thisMass)
    
    # three 'diagnostic observations'
    thisBaseE = ModelBaseE(thisHR, thisSys, thisDia, thisBMI)
    thisTPain = ModelTPain(thisRR, thisOS, p[thisID][pHeightIndex])
    thisDK = ModelDK(thisBMI, thisBT)
    
    # track the cumulatives of the diagnostics
    pp[thisID][ppSUMBidx]+=thisBaseE
    pp[thisID][ppSUMTidx]+=thisTPain
    pp[thisID][ppSUMDKidx]+=thisDK
    thisSumBaseE = pp[thisID][ppSUMBidx]
    thisSumTPain = pp[thisID][ppSUMTidx]
    thisSumDK = pp[thisID][ppSUMDKidx]

    # create a new record in the time series
    ts.append([Time, thisID, thisSys, thisDia, thisRR, thisHR, thisOS, \
               thisBT, thisMass, thisBMI, thisBaseE, thisTPain, thisDK, \
               thisSumBaseE, thisSumTPain, thisSumDK])
    
    # add a random number of days to the time
    Time += datetime.timedelta(days=r.randint(1,2))

with open(tsName,'wb') as timeseriesFile:
    csvWriter = csv.writer(timeseriesFile, dialect='excel')
    csvWriter.writerows(tsHdr)
    csvWriter.writerows(ts)

# Write the patient parameters pp[] at the end to record cumulatives on BaseE, TPain, DK
with open(ppName,'wb') as patientParameterFile:
    csvWriter = csv.writer(patientParameterFile, dialect='excel')
    csvWriter.writerows(ppHdr)
    csvWriter.writerows(pp)

Part 9: System operation

Part 10: Concluding remarks

  • Set up Ansible-assisted process for configuring and running jobs on EC2 instances
  • Pushing data to S3
    • Console does not seem like a good mechanism
    • Third party apps such as Cloudberry are possible…
    • AWS CLI with scripts: Probably the most direct method
  • Compute scale test: Involves setting up some substantial processing power
    • Implication is that the SCE can intrinsically fire up EC2 instances as needed
    • Launch W x 5 Dedicated instances, call these Wi
    • Assign S3 access role
    • Encrypted volumes
    • S/w pre-installed (e.g. genomics pipelines)
    • Update issue: Pipeline changes, etcetera;
  • Wi can be pre-populated with reference data: Sheena Todhunter operational scenario
    • Assumes that a SCE exists in perpetuity to perform some perfunctory pipeline processing
    • On B
      • Create SQS queue of objects in S3
      • Start a Wi for each message in queue…
    • Go
      • Latest pipeline… EBS Genomes… chew
      • If last instance running: Consolidate / clean-up
    • SNS topic notifies me when last instance shuts down.
      • Run Ansible script to configure Wi (patch, get data file names from DynamoDB table, etcetera)
      • Get Ws the Key from E
      • The Ws send an Alert through the NAT gateway to Simple Notification Service (SNS)
      • Which uses something called SES to send an email to the effect that the system is working with PHI data
        • Ws pull data from S3 using VPC Endpoint; thanks to the Route table
        • Ws decrypt data using HomerKey
        • Ws process their data into result files: Encrypted EBS volume.
        • Optionally the result files are encrypted in place in the EBS volume.
    • Through S3EP_O the results are moved to S3.
    • Wi sends an Alert through the NAT gateway to SNS
      • which uses something called SES to send another email: Done
    • Wi evaporates completely leaving no trace

Procedure Log

Create a VPC V

pic0001

hipaa0002

  • Use the name ‘hipaa’; should be unique.
  • CIDR as shown is typical.
  • Dedicated Instance means: Nobody else allowed here.

hipaa0003

  • I added a tag indicating that I originated the VPC.

Create a subnet

hipaa0004

hipaa0005

Public subnet addresses will be of the form: 10.0.0.2, .3, .4, … .254

hipaa0006

Take note of the AZ:

hipaa0007

We could do multiple public sub-nets by creating more than one in multiple Azs; that is a big-time concept.

Now the private one:

hipaa0007

hipaa0008

hipaa0009

hipaa0010

Attach it to the VPC:

hipaa0011

hipaa0012

Now for the NAT Gateway

hipaa0013

Choose the public one in czarhipaa

hipaa0014

hipaa0015

Notice we Created a New EIP

hipaa0016

Now the Route Table

hipaa0017

hipaa0018

hipaa0019

Here they are (including the default main one):

hipaa0020

hipaa0021

Edit and modify as shown:

hipaa0022

Save

And then under Subnet Associations tab: Edit:

hipaa0023

Now let us go back to the Route Table selector

hipaa0024

hipaa0025

hipaa0026

Subnet Associations tab:

hipaa0027

hipaa0028

Now for the Endpoint

hipaa0029

Notice that this has Full Access; we will restrict access at a later step.

hipaa0030

end as of Jan 27 2017.

Risk

This section identifies points of risk and their severity. Severity is described both ‘when protocol is properly observed’ and ‘when protocol is not observed’. That is: We provide examples of how failure to follow protocols could result in the compromise of PHI. There is in all of this a notion of diminishing returns: A tremendous amount of additional effort might be incorporated in building an SCE that provides only small reduction of risk.

Protocols described in this document

VPC creation: Manual versus Wizard

Options outside the purview of this document

Dedicated instances

We drive cost up using dedicated instances on Sprivate. It is technically feasible to not do this but there is an attendant cost in time and risk.

Extended key management strategy

Encryption keys here are taken to be default keys associated with the AWS account. It is possible on setting up the SCE to create an entire structure around management of newly-generated keys. This is a diminishing-returns risk mitigation procedure: It may create a profusion of complexity that is itself a risk.

One open question is whether a single AWS account should / could / will be used to provision multiple independent research projects, each with one or more respective data systems.

  • Log in to B and move K to E
    • Observe that material encrypted
    • Maybe instead we should be tunneling directly to E from L
  • Use an AMI to create a processing EC2 W

    • Also with ENC
  • Create S3 buckets
    • S3D for data
    • S3O for output
    • S3L for logging
    • S3A for ancillary purposes (non-PHI is the intention)

    • Such S3 buckets only accept http PUT; not GET or LIST
  • Create an S3 bucke S3O for output
  • Create an S3 bucke S3L for logging
  • Create an S3 Endpoint S3EP_D in V
  • Create an S3 Endpoint S3EP_O in V
  • Create an S3 Endpoint S3EP_L in V
    • “S3 buckets have a VPC Endpoint included… ensure this terminates inside the VPC”
  • Create role R allowing W to read data at S3EP from S3