Conducting Data Analysis at Baylor College of Medicine (BCM)
IntroductionThis documentation has two basic sections: computation resources and data access
Section I: Computation Resources
If you have an approved Onco Array Proposal, and plan to conduct data analysis at BCM, please carefully read these items - some may be highly relevant to you and your group:
-
Do you have a BCM VPN account (or, Do you have an ECA )?
Note: Your login credentials on Onco Array Website does not automatically grant you a BCM VPN account
- Yes, and it's working just fine: You can
skip to step 2 below to apply for HPC account
But wait - before you go: Your BCM VPN gives you an ECA, AND an E-Mail account. If you DO NOT wish to check your BCM E-Mail regularly, please do a one-time setup to redirect all your incoming E-Mails to an alternative destination: Microsoft Documentation is available. We use Microsoft Exchange Server and the standard client is Outlook. You can access your E-Mail via a web broswer as well. This does not require you to establish a VPN connection first. - Yes, but not working: Please call BCM IT Support line: 1-713-798-8737 during normal business hours, i.e., Central Time 8-5, M-F
- Yes, but none of the above helped me: Please contact Onco Array Support Team to explain. We may have to re-submit your sponsorship paper work
- No, I never had a BCM VPN account:
Please E-Mail
Onco Array Support Team and explain your intentions:
- Approved proposal(s) you intend to work on (proposal title, its ID number aka pCode, Contact PI)
- You full name and E-Mail address
- We will contact you soon with paper work. There will be some forms to fill out in order for you to be vetted with BCM. Once approved, come back to this page and go straight to step 2 below
- Yes, and it's working just fine: You can
skip to step 2 below to apply for HPC account
- Your Linux Account @ chemo.dldcc.bcm.edu, the High Performance Computing Cluster (HPC account)
We will sponsor you an account at BCM HPC. Please read an introduction to our Computational Environment (VPN connection required)
In principle, no interactive jobs from login node will be tolerated. Googling "pbs qsub tutorial" will also give you plenty of examples.
Section II: Data Access
Once you are logged in to discovery.dartmouth.edu, all data files are stored at /mount/ictr1/Onco. Please note that this is a Read-Only folder. You should setup symbolic links to this place from your working areas, such as /mount/amos1/home/yourECA. Temporary spaces may also be setup in certain scratch areas. Again, you should try very hard not to copy files - use symbolic links instead, such as
ln -s /mount/ictr1/Onco .This example will allow you to have a shortcut named Onco in your working area, acting just like a folder.
File | Type | Explanation | Access |
---|---|---|---|
samples-53600-top-tped.txt (129GB) |
Genotype | There are 533631 markers in Oncoarray. | ln -s /bmds/data/Onco/samples-53600-top-tped.txt . |
samples-57775-top-tped.txt (139GB) |
Genotype | Samples from plink 1-57775 in phenotype file, with another 4175 new samples | ln -s /bmds/data/Onco/samples-57775-top-tped.txt . |
probs_impute2_51655_samples_563_chunks (747GB compressed) |
Imputed Probs | There are about 21,000,000 markers in imputed files. Genotypes were imputed using phase 3 1000G panel. Find files chunk*-chr*.impute2.gz, samples from random 1-51655 in phenotype file | ln -s /bmds/data/Onco/probs_impute2_51655_samples_563_chunks . |
probs_impute2_3577_samples_563_chunks (564GB compressed) |
Imputed Probs | Find files chunk*-chr*.impute2, samples from "random" 51656-55232 in phenotype file. | ln -s /bmds/data/Onco/probs_impute2_3577_samples_563_chunks . |
probs_impute2_CHRX_41187_samples_31_chunks |
Imputed Probs | Find files chunk*-chr*.impute2, samples from "chrx_random" 1-41187 in phenotype file. | ln -s /bmds/data/Onco/probs_impute2_CHRX_41187_samples_31_chunks . |
updated_phenotype_MARCH_06_2017.txt |
Phenotype | PCA1,PCA2,PCA3 were added in phenotype file. This phenotype is based on the latest file received from Xandra, onco_chris20160527, and CHINA samples were updated based on descriptions listed in excel file. status variable stands for possible QC status, "----call1" stands for low call rate,"-CALL1----" stands for low call rate, "---SEX--" is for problemstic gender,"--NOCAU---" is for no-white samples,"DUP-----" is for dupication, "SIB-----" is for related samples. "-9" means that samples are not in any category, but still based on availability of other phenotypes. variable "caret" stands for samples of caret source, "1" is YES, "-9" is NO. There are 57775 samples in Oncoarray project, which consists of 53600, plus 4175 (new samples). Unique ID is assigned as Plink 1-57775. The main keys, plink can be used to link phenotype and genotype (tped plink format, samples as columns from 1-53500, or 1-57775); random can be used to link phenotype and imputed probs (impute2 output format, samples as columns, each sample has 3 columns, from 1-51655, or 51656 - 55232 (new samples)) chrx_random can be used to link phenotype and imputed probs (impute2 output format, samples as columns, each sample has 3 columns, from 1-41187) | ln -s /bmds/data/Onco/updated_phenotype_MARCH_06_2017.txt . |
G1000_imputation_map.txt (3.9GB) |
Marker | Annotated marker file for 21,000,000 markers in imputed files. Note that marker file can be used to retrieve both genotypes and imputed probs. You can query this file using conditions. | ln -s /bmds/data/Onco/G1000_imputation_map.txt . |
G1000_CHRX_imputation_map.txt |
Marker | Annotated marker file for 677,303 markers of ChrX in imputed files. Note that marker file can be used to retrieve both genotypes and imputed probs. You can query this file using conditions. | ln -s /bmds/data/Onco/G1000_CHRX_imputation_map.txt . |