U19 Computational Support and Data Description | Integrative Analysis of Lung Cancer Etiology and Risk

High Performance Computing at Baylor College of Medicine

Conducting Data Analysis at Baylor College of Medicine (BCM)

Introduction

This documentation has two basic sections: computation resources and data access

Section I: Computation Resources

If you have an approved Onco Array Proposal, and plan to conduct data analysis at BCM, please carefully read these items - some may be highly relevant to you and your group:

Do you have a BCM VPN account (or, Do you have an ECA )?
Note: Your login credentials on Onco Array Website does not automatically grant you a BCM VPN account
- Yes, and it's working just fine: You can skip to step 2 below to apply for HPC account
  But wait - before you go: Your BCM VPN gives you an ECA, AND an E-Mail account. If you DO NOT wish to check your BCM E-Mail regularly, please do a one-time setup to redirect all your incoming E-Mails to an alternative destination: Microsoft Documentation is available. We use Microsoft Exchange Server and the standard client is Outlook. You can access your E-Mail via a web broswer as well. This does not require you to establish a VPN connection first.
- Yes, but not working: Please call BCM IT Support line: 1-713-798-8737 during normal business hours, i.e., Central Time 8-5, M-F
- Yes, but none of the above helped me: Please contact Onco Array Support Team to explain. We may have to re-submit your sponsorship paper work
- No, I never had a BCM VPN account: Please E-Mail Onco Array Support Team and explain your intentions:
  - Approved proposal(s) you intend to work on (proposal title, its ID number aka pCode, Contact PI)
  - You full name and E-Mail address
  - We will contact you soon with paper work. There will be some forms to fill out in order for you to be vetted with BCM. Once approved, come back to this page and go straight to step 2 below
Your Linux Account @ chemo.dldcc.bcm.edu, the High Performance Computing Cluster (HPC account)
We will sponsor you an account at BCM HPC. Please read an introduction to our Computational Environment (VPN connection required)
In principle, no interactive jobs from login node will be tolerated. Googling "pbs qsub tutorial" will also give you plenty of examples.

Section II: Data Access

Once you are logged in to discovery.dartmouth.edu, all data files are stored at /mount/ictr1/Onco. Please note that this is a Read-Only folder. You should setup symbolic links to this place from your working areas, such as /mount/amos1/home/yourECA. Temporary spaces may also be setup in certain scratch areas. Again, you should try very hard not to copy files - use symbolic links instead, such as

ln -s /mount/ictr1/Onco .

This example will allow you to have a shortcut named Onco in your working area, acting just like a folder.

**Data and Files (Marker Description and Sample Description)**
File	Type	Explanation	Access
samples-53600-top-tped.txt (129GB)	Genotype	There are 533631 markers in Oncoarray.	ln -s /bmds/data/Onco/samples-53600-top-tped.txt .
samples-57775-top-tped.txt (139GB)	Genotype	Samples from plink 1-57775 in phenotype file, with another 4175 new samples	ln -s /bmds/data/Onco/samples-57775-top-tped.txt .
probs_impute2_51655_samples_563_chunks (747GB compressed)	Imputed Probs	There are about 21,000,000 markers in imputed files. Genotypes were imputed using phase 3 1000G panel. Find files chunk-chr.impute2.gz, samples from random 1-51655 in phenotype file	ln -s /bmds/data/Onco/probs_impute2_51655_samples_563_chunks .
probs_impute2_3577_samples_563_chunks (564GB compressed)	Imputed Probs	Find files chunk-chr.impute2, samples from "random" 51656-55232 in phenotype file.	ln -s /bmds/data/Onco/probs_impute2_3577_samples_563_chunks .
probs_impute2_CHRX_41187_samples_31_chunks	Imputed Probs	Find files chunk-chr.impute2, samples from "chrx_random" 1-41187 in phenotype file.	ln -s /bmds/data/Onco/probs_impute2_CHRX_41187_samples_31_chunks .
updated_phenotype_MARCH_06_2017.txt	Phenotype	PCA1,PCA2,PCA3 were added in phenotype file. This phenotype is based on the latest file received from Xandra, onco_chris20160527, and CHINA samples were updated based on descriptions listed in excel file. status variable stands for possible QC status, "----call1" stands for low call rate,"-CALL1----" stands for low call rate, "---SEX--" is for problemstic gender,"--NOCAU---" is for no-white samples,"DUP-----" is for dupication, "SIB-----" is for related samples. "-9" means that samples are not in any category, but still based on availability of other phenotypes. variable "caret" stands for samples of caret source, "1" is YES, "-9" is NO. There are 57775 samples in Oncoarray project, which consists of 53600, plus 4175 (new samples). Unique ID is assigned as Plink 1-57775. The main keys, plink can be used to link phenotype and genotype (tped plink format, samples as columns from 1-53500, or 1-57775); random can be used to link phenotype and imputed probs (impute2 output format, samples as columns, each sample has 3 columns, from 1-51655, or 51656 - 55232 (new samples)) chrx_random can be used to link phenotype and imputed probs (impute2 output format, samples as columns, each sample has 3 columns, from 1-41187)	ln -s /bmds/data/Onco/updated_phenotype_MARCH_06_2017.txt .
G1000_imputation_map.txt (3.9GB)	Marker	Annotated marker file for 21,000,000 markers in imputed files. Note that marker file can be used to retrieve both genotypes and imputed probs. You can query this file using conditions.	ln -s /bmds/data/Onco/G1000_imputation_map.txt .
G1000_CHRX_imputation_map.txt	Marker	Annotated marker file for 677,303 markers of ChrX in imputed files. Note that marker file can be used to retrieve both genotypes and imputed probs. You can query this file using conditions.	ln -s /bmds/data/Onco/G1000_CHRX_imputation_map.txt .