Project Overview: Statistical Analysis on Biochemical Recurrence Prostate Cancer Among Patients in 3 Locations

Project Highlights

Computed Patient Baseline Characteristics table and Kaplan-Meier Curve graphs regarding 3 locations for prostate cancer patients
Proved probabilities of freedom from BCR among three locations were significantly different after stratification by conducting Stratified Logrank Test
built a AIC Stepwise Cox’s proportional hazards regression model with imputation and interaction terms, obtaining which demographic factors and genes have effects on survival
Found which factors and genes have greater effects on survival using Random Survival Forest method

Code

Data Source: 3 datasets are from built-in libraries: library(prostateCancerCamcap), library(prostateCancerTaylor), library(prostateCancerStockholm)
Language Used: R
Libraries Used: ggplot2, tidyr, devtools, dplyr, party, survival, tcltk, tcltk2, forcats, compareGroups, rms, npsurv, MASS, survminer, randomForestSRC

Project Objective

To investigate whether the target genes as well as demographic factors have effects on biochemical recurrence(BCR) free time among prostate cancer patients in three public datasets, Taylor, Cambridge and Stockholm (see Section 4)
To test if BCR free times are significantly different among three datasets (see Section 2 and Section 3)
To investigate which factors and genes have greater effects on survival using Random Survival Forests (Section 7)

Project Background

Biochemical recurrence(BCR) continues to occur in a large proportion of prostate cancer patients. Yet, established clinical predictors(PSA, Gleason score) often provide poor prognosis. It’s noticeable that more research is now focusing on the effect of genes such as integrated genomics and gene signatures on prostate cancer BCR.

Data Cleaning

Some main cleaning I did as follows:

Cleaned 3 datasets (3 locations) respectively to prepare them in the same format for horizontal join such as
- extracting Gene expressions, survival and demographic variables
- making variables the same names with the same data types for the 3 datasets
- dropping the rows having NAs in Time and Event indicator
Selected variables in the same order for all datasets and fully joined them horizontally
Created new variables such as 'dataset' with 3 levels (Cam, Taylor and stoch) to distinguish data from 3 different locations in the fully joined dataset
Cleaned categorical variables with multiple levels such as combining values with the same meaning to one level

After cleaning, we learned the following from the final joined dataset:

Columns included:
- Age: patient’s age
- Time: patient’s biochemical recurrence(BRC) free time
- PSA: patient’s PSA(prostate-specific antigen) level
- Event: two levels, 0 (right censoring), 1 (relapse)
- Stage: it measures how far the cancer has spread. Nine levels
- iCluster: seven levels
- Gleason: Gleason Score. Six levels
- ECE: extra-capsular extension. Two levels, Yes and No
- 23 columns for 23 target genes: SLC2A1-14, HK1-3, GCK, OAX1, ENO2, NCAM1, SYP, CHGA, FOLH1, KLK3
Total number of patients (rows): 343 patients. 111 from Cam dataset, 92 from Stoch dataset and 140 from Taylor dataset.
Factors that were missing or contained missing values, which were imputed later when fitting some Cox's proportional hazards regression models in Section 4:
- Factor—“Age” was unavailable in both Stoch and Taylor datasets
- Factors—“PSA” and “ECE” were unavailable in Taylor dataset.
- 2 NAs in “Stage”, 14 NAs in “iCluster” and 4 NAs in “Gleason”. Also, we assumed the deleted missing values in “Time” and “Event” were random distributed.

Section 1: Conducting univariate analysis among three datasets

After cleaning and joining the three datasets(Taylor, Cambridge and Stockholm), I conducted univariate analysis among three datasets as a table (Table 1) using compareGroups library. Below is part of the table.

Section 2: Kaplan-Meier Curves

KM curves among three datasets (Graph 1)

It’s clear from Graph 1 that patients in dataset—Stoch have the highest survival rate before BCR free time turns 90 months. The patients in dataset—Cam have the second highest survival rate until BCR free time turns 60 months, after which the survival rate plumps to 0 when the BCR free time is about 67 months. The patients in dataset—Taylor have the lowest survival time at the beginning and becomes the second in the end. Even though the curves have 95% CIs, it’s still hard to tell if they are significantly different among three datasets.

KM curves among different Gleason Scores (Graph 2)

From Graph 2, it’s uneasy to tell if the survival rates of patients having different Gleason Scores are significantly different. Also, not enough observations for Gleason Scores are 5 and 10.

Section 3: Stratified Logrank Test among three datasets

It’s uneasy to see whether the statistically difference exists among three datasets from Graph 1 since the KM curves cross each other and so on, so conducting a logrank test is a good way to investigate if the probabilities of freedom from BCR among three datasets are significantly different.

Also, a cofounder might exist. After investigation, I found “iCluster” was a cofounder. Firstly, “iCluster” was noticed to be related to the survival under a logrank test. Secondly, it was associated with “dataset” under the Pearson's Chi-squared test.

Therefore, I conducted a logrank test stratified by "iCluster" among the three datasets. From the R output below, we can see the survival rates among these three datasets are significantly different after being stratified by “iCluster”.

survdiff(formula = Surv(Time, Event) ~ dataset + strata(iCluster), 
    data = datacamtaylorstock, rho = 0)

n=329, 14 observations deleted due to missingness.

                 N Observed Expected (O-E)^2/E (O-E)^2/V
dataset=Cam    104       18     13.1     1.840      2.83
dataset=stoch   85       41     51.1     2.010      7.73
dataset=taylor 140       36     30.8     0.889      2.21

 Chisq= 8  on 2 degrees of freedom, p= 0.02

Section 4: Cox’s proportional hazards regression model

Imputation of missing values:

Since “Age”, “PSA” and “ECE” were unavailable in all three datasets, I deleted them. Besides, there were still 2 missing values in “Stage”, 14 Missing values in “iCluster” and 4 missing values in “Gleason”. I imputed all the NA’s in “iCluster” and “Gleason” using Random Forest algorithm and delete the 2 rows with NA’s in “Stage”.

In order to know which demographic factors and genes have influence on BCR free time, I fit several Cox’s proportional hazards regression models using AIC Stepwise method such as the 2 models below:

Model 1: AIC Stepwise Cox’s model using the pooled dataset with imputation: Likelihood ratio test=114.8 on 20 df, p=3e-15

Surv(Time, Event, type = "right") ~ iCluster + Gleason + ENO2 + SYP + CHGA + SLC2A1 + SLC2A6 + SLC2A12 + SLC2A13  + SLC2A14 + HK1 # the model after AIC stepwise
Likelihood ratio test=114.8  on 20 df, p=3e-15
n= 341, number of events= 98

Model 2: adding interaction terms between “dataset” and each gene: Likelihood ratio test=182.5 on 55 df, p=1e-15

 Surv(Time, Event, type = "right") ~ iCluster + Gleason + dataset + ENO2 + NCAM1 + SYP + CHGA + FOLH1 + SLC2A1 +  SLC2A3 + SLC2A4 + SLC2A5 + SLC2A6 + SLC2A7 +SLC2A8 + SLC2A10 + SLC2A12 + GCK + HK1 + dataset:NCAM1 + dataset:SYP + dataset:CHGA + dataset:FOLH1 + dataset:SLC2A1 + dataset:SLC2A3 + dataset:SLC2A5 +dataset:SLC2A6 + dataset:SLC2A7 + dataset:SLC2A8 + dataset:SLC2A12 + dataset:GCK + dataset:HK1 #26(13*2) interaction terms stay in the model
 Likelihood ratio test=182.5  on 55 df, p=1e-15
 n= 341, number of events= 98

Section 7: Investigating which factors and genes have greater effects using Random Survival Forests

I used Random Survival Forest technique to find which factors and genes have greater effects on survival since it applies to right-censored data. I grew 400 bootstrapped trees and implemented log-rank splitting.

Graph below shows the Leave-one-out cross validation error drops fast and then stables at around 29% as the number of trees increases.

We can see the importance of each variable with ranking below. “Gleason” has the greatest effect on survival followed by genes—FOLH1, SLC2A1, SYP, SLC2A8 and so forth.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project_on_BCR_prostate_cancer.pages		Project_on_BCR_prostate_cancer.pages
Project_on_BCR_prostate_cancer.pdf		Project_on_BCR_prostate_cancer.pdf
R code-BCR prostate cancer project.R		R code-BCR prostate cancer project.R
README.md		README.md
Random Survival Forests-importance.png		Random Survival Forests-importance.png
Random Survival Forests-trees.png		Random Survival Forests-trees.png
graph1.png		graph1.png
graph2.png		graph2.png
table1-continued.png		table1-continued.png
table1.png		table1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project_on_BCR_prostate_cancer.pages

Project_on_BCR_prostate_cancer.pages

Project_on_BCR_prostate_cancer.pdf

Project_on_BCR_prostate_cancer.pdf

R code-BCR prostate cancer project.R

R code-BCR prostate cancer project.R

README.md

README.md

Random Survival Forests-importance.png

Random Survival Forests-importance.png

Random Survival Forests-trees.png

Random Survival Forests-trees.png

graph1.png

graph1.png

graph2.png

graph2.png

table1-continued.png

table1-continued.png

table1.png

table1.png

Repository files navigation

Project Overview: Statistical Analysis on Biochemical Recurrence Prostate Cancer Among Patients in 3 Locations

Project Highlights

Code

Project Objective

Project Background

Data Cleaning

Section 1: Conducting univariate analysis among three datasets

Section 2: Kaplan-Meier Curves

Section 3: Stratified Logrank Test among three datasets

Section 4: Cox’s proportional hazards regression model

Section 7: Investigating which factors and genes have greater effects using Random Survival Forests

About

Releases

Packages

Languages

ensembles4612/Statistical-Analysis-on-Biochemical-Recurrence-Prostate-Cancer-Among-Patients-in-3-Locations

Folders and files

Latest commit

History

Repository files navigation

Project Overview: Statistical Analysis on Biochemical Recurrence Prostate Cancer Among Patients in 3 Locations

Project Highlights

Code

Project Objective

Project Background

Data Cleaning

Section 1: Conducting univariate analysis among three datasets

Section 2: Kaplan-Meier Curves

Section 3: Stratified Logrank Test among three datasets

Section 4: Cox’s proportional hazards regression model

Section 7: Investigating which factors and genes have greater effects using Random Survival Forests

About

Topics

Resources

Stars

Watchers

Forks

Languages