Skip to content
Snippets Groups Projects
Commit 77297901 authored by Jacques Rougemont's avatar Jacques Rougemont
Browse files

exercises for week1

parents
No related branches found
No related tags found
No related merge requests found
---
title: "1. Introduction to R"
author: "EPFL - SV - BIO-463"
date: 02/18/2025
date-format: long
eval: false
code-fold: false
format:
html:
embed-resources: true
pdf:
papersize: a4
fig-width: 6
fig-height: 4
format-links: false
---
## The R Programming Language
[R is a programming language](https://www.r-project.org/) used for statistical analysis and data manipulation, widely used by several scientific communities which have contributed a large number of libraries.
R is free (GNU license) and can be used to produce publication-ready graphics.
R binaries and additional packages can be downloaded from several sites:
* [Installation instructions](https://stat.ethz.ch/CRAN/) on the CRAN site.
* [List of CRAN packages](https://stat.ethz.ch/CRAN/web/packages/available_packages_by_name.html)
* [The Bioconductor project](http://www.bioconductor.org/) (bioinformatics software)
* [The rdocumentation site](https://www.rdocumentation.org/)
It can be used on the command-line in a terminal:
```{verbatim}
bash$ R
R version 4.4.2 (2024-10-31) -- "Pile of Leaves"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin20
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> 1+1
[1] 2
> y = c(1, 4, 9, 16, 25)
> sqrt(y)
[1] 1 2 3 4 5
> myFunc = function(x) {1+x^2}
> myFunc(1:10)
[1] 2 5 10 17 26 37 50 65 82 101
> ?log
> q()
```
However we recommend using the RStudio Desktop, an IDE for R available at [posit.co](https://posit.co/downloads/).
![](rstudio-windows.png)
### Installing packages
R comes with many additional packages that provide statistical methods for various types of data analysis.
To install them you use either
* **install.packages** (if it comes from [CRAN](https://stat.ethz.ch/CRAN/))
* **BiocManager::install** (if it comes from [Bioconductor](http://www.bioconductor.org/)):
```{r}
#| label: install packages
install.packages(c("BiocManager", "quarto"))
BiocManager::install("pheatmap")
```
Once a package is installed, you need to load it into your session with the command **library**:
```{r}
#| label: load packages
BiocManager
library(BiocManager)
```
{{< pagebreak >}}
## Exercise 1
The purpose of this exercise is to observe the effect of some common operations in R,
and familiarize yourself with the language and the interface.
Try to change some of the commands and see the effect.
1. Open RStudio.
2. Alternatively you can clone the [same gitlab repository](https://gitlab.epfl.ch/genomics-and-bioinformatics/course-data-2025.git) into your working directory and open the directory from RStudio.
2. Alternatively you can clone the [same gitlab repository](https://gitlab.epfl.ch/genomics-and-bioinformatics/course-data-2025.git) into your working directory and open the directory from RStudio.
3. Open the file [ExercisesWeek1.qmd](https://gitlab.epfl.ch/genomics-and-bioinformatics/course-data-2025/-/blob/main/week1/ExercisesWeek1.qmd) in RStudio (this is the file used to generate the document you are currently reading...)
4. Run the following code blocks and understand what they are doing.
Read the data from the tab-delimited file *GeneExpressionData.txt* (open the file as well to have a look at its content):
```{r}
#| label: load data
data = read.delim("GeneExpressionData.txt", row.names=1)
```
If the file is not found, check your path and use **setwd()** to change to your working directory:
```{r}
#| label: path functions
getwd()
## setwd("/YOUR/PATH/TO/GITLAB/REPO")
dir()
```
First look at the data (notice that rows and columns have names!):
```{r}
#| label: data check
dim(data)
head(data)
data[1:4, ]
data$id
data$C1[1]
data$C2[3:10]
data["ATP2A3",]
vector = data$C1
vector[4]
```
Compute some basic statistics:
```{r}
#| label: summary stats
summary(data)
summary(data$C1)
mean(data$C2)
median(log(data$C2))
sapply(data[1:10,], min)
sapply(data, max)
apply(data, 1, mean)
apply(data, 2, sd)
?sd
```
Elementary data transformation (are all ratios well-defined?):
```{r}
#| label: data manips
any(data$C2==0)
which(data$C2==0)
ratios = log2(data$C1/data$C2)
geomMeans = sqrt(data$C1*data$C2)
```
Plot the data
```{r}
#| label: plots
plot(data$C1, data$C2, log='xy', pch=20, main='', xlab='C1', ylab='C2')
h1 = hist(log2(data$C1), breaks=30, main='', xlab='log2 values')
hist(log2(data$C2), br=h1$breaks, add=T, col=2)
```
If you would like to learn more about R, we suggest two online courses that are focused on Bioinformatics:
* [UCDavis Introduction to R](https://ucdavis-bioinformatics-training.github.io/2021-March-Introduction-to-R-for-Bioinformatics/R/Intro2R_main)
* [SIB first steps with R](https://github.com/sib-swiss/first-steps-with-R-training)
{{< pagebreak >}}
## Exercise 2
In this exercise we will perform a typical gene expression analysis based on a dataset from Leukemia cells:
1. Load the dataset *leukemiaExpressionSubset.rds* (it is in compressed [RDS format](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS)):
```{r}
#| label: load leukemia
library(pheatmap)
data = readRDS("leukemiaExpressionSubset.rds")
```
2. In the file, samples (table columns) are named according to cell type and experiment number.
Let us create an annotation table by splitting the sample type and the sample number in different columns:
```{r}
#| label: extract sample type
colnames(data)
annotations = data.frame(
LeukemiaType = substr(colnames(data),1,3),
row.names = substr(colnames(data),10,13))
colnames(data) = rownames(annotations)
```
3. Log-transform the data, generate scatter plots of sample pairs and a boxplot of the distribution of gene expression values:
```{r}
#| label: pairs and box plots
logdata = log2(data)
## calculate the median per column (dimension no 2)
meddata = apply(logdata, 2, median)
## next subtract this median from each column
logdata = sweep(logdata, 2, meddata, "-")
## choose a color for each cell type
typeCols = c("ALL"='red', "AML"='magenta', "CLL"='blue', "CML"='cyan', "NoL"='gray')
## set some graphical parameters and plot the data
par(las=1, cex=1.1, lwd=2, lty=1, pch=20)
pairs(logdata[,1:5])
boxplot(logdata, las=2, lty=1, lwd=2, col=typeCols[annotations$LeukemiaType], pch=20)
```
4. Create a clustered "heatmap" of the data:
```{r}
#| label: cluster and heatmap
pheatmap(logdata, show_rownames=F, annotation_col=annotations, scale='none',
clustering_distance_cols='correlation', clustering_method='complete',
annotation_colors=list(LeukemiaType=typeCols))
```
5. Save the transformed data to a tab-delimited text file:
```{r}
#| label: save to file
write.table(logdata, file = "testoutput.txt", sep="\t", quote=F)
```
{{< pagebreak >}}
## Exercise 3
In this exercise we will seek information from genomic data portals mentionned in the lecture, and use them to inform our analysis.
We are interested in the gene *BCL2A1* because it has been implicated in many cancers, including Leukemia.
We would like to see if it displays some interesting signal in our data.
Go to the [Ensembl](https://www.ensembl.org/Homo_sapiens)
site and search for the identifier of this gene (should replace the string **ENSG00000XXXXXX** in the code).
Use this identifier to extract the corresponding row from the log-data matrix, and show that it is disregulated in *acute leukemia (ALL, AML)*:
```{r}
#| label: genome browsers
geneid = "ENSG00000XXXXXX"
bcl2a1_expression = as.numeric(logdata[geneid,])
boxplot(bcl2a1_expression~annotations$LeukemiaType)
```
Use the [UCSC genome browser](https://genome-euro.ucsc.edu/) to answer the following questions:
1. Which strand is human BCL2A1 on?
2. How many splice variants (isoforms) exist according to the *NCBI RefSeq* and to *GENCODE*?
3. What is the next protein-coding gene upstream of BCL2A1? and downstream?
4. Can you find a binding site for [NFKB1](https://www.genecards.org/cgi-bin/carddisp.pl?gene=NFKB1) (nuclear factor kappa B subunit 1, a transcription factor) less than 10kb upstream of BCL2A1, within a Dnase-1 hypersensitive site bearing an H3K27ac mark? Hint: look at the tracks *ReMap ChIP-seq* and *Encode Regulation*
id C1 C2
RNF14 0.076370757306 0.076370757306
HIF3A 1.29543988057 0.0791993038729
UBE2Q2 0.0480852916371 0.0480852916371
RNF10 1.23604040267 0.0197998259682
RNF11 6.68818524759 1.23604040267
GPAA1 0.181026980281 0.181026980281
REM1 0.0141427328344 0.0141427328344
REM2 0.0141427328344 0.0141427328344
IMP4 1.32372534624 0.107484769542
C16orf11 0.00848563970067 0.00848563970067
RNF17 0.155570061179 0.155570061179
MZT2A 0.0282854656689 0.0282854656689
MZT2B 0.0339425588027 0.0339425588027
NDP 0.0169712794013 0.0169712794013
PMM2 1.23886894924 1.23886894924
UBE2F 0.132941688644 0.132941688644
PMM1 0.0933420367073 0.0933420367073
ASS1 0.121627502376 1.33786807908
MLLT11 0.00565709313378 1.22189766984
FHIT 0.076370757306 0.076370757306
ZNF709 0.0169712794013 0.0169712794013
ZNF708 0.0113141862676 1.22755476297
ZNF879 0.0282854656689 0.0282854656689
ZNF878 0.00565709313378 0.00565709313378
LRRTM4 0.0480852916371 0.0480852916371
CAMK1 0.0905134901404 0.0905134901404
ZNF701 0.0113141862676 0.0113141862676
ZNF700 2.45228097937 4.06991503737
ZNF707 0.197998259682 0.197998259682
CEACAM3 0.0226283725351 0.0226283725351
ZNF704 0.0650565710384 0.0650565710384
ABHD12B 0.0395996519364 0.0395996519364
AC092118.1 0.00848563970067 0.00848563970067
RNF114 4.49668514472 0.0226283725351
RNF115 0.025456919102 0.025456919102
ZC3H15 0.0537423847709 2.48622353817
ZC3H14 1.29261133401 4.94133306411
SPN 0.0169712794013 0.0169712794013
RNF111 0.050913838204 1.26715441491
IQCA1 0.076370757306 0.076370757306
ZC3H18 1.25301168207 0.0367711053696
GRIN1 0.050913838204 0.050913838204
AC145676.2 0.00282854656689 0.00282854656689
SP8 0.0141427328344 0.0141427328344
DHX9 11.4267579755 11.7983360891
CBX2 2.63524148962 3.1139530947
LRRTM3 0.0311140122358 0.0311140122358
NUP98 0.223455178784 11.5710138504
XPC 0.12728459551 2.55976574891
SP1 4.47733858507 5.14183924696
XPA 0.0424281985033 0.0424281985033
SP3 2.47208080534 5.71337898774
SP4 2.47773789847 0.0452567450702
SP5 0.0113141862676 0.0113141862676
SP6 0.00565709313378 0.00565709313378
CAMKV 5.02053236798 2.58805121458
TACSTD2 0.00282854656689 0.00282854656689
GOLIM4 0.0480852916371 0.0480852916371
OPA3 0.00848563970067 1.2247262164
OPA1 4.98093271605 1.33221098594
RAB40C 0.0169712794013 2.0420288851
RAB40B 0.0169712794013 0.0169712794013
RAB40A 0.0113141862676 0.0113141862676
AC016722.1 0.025456919102 0.025456919102
RP11-190A12.7 0.0 0.0
COL7A1 0.616623151582 0.616623151582
MDP1 0.0735422107391 0.0735422107391
GTSE1 0.0424281985033 6.12363108201
FAM183A 0.0480852916371 0.0480852916371
OVCH2 0.0791993038729 0.0791993038729
OVCH1 0.0791993038729 0.0791993038729
FAM183B 0.00565709313378 0.00565709313378
AKR1D1 0.0452567450702 0.0452567450702
PSAP 18.4408528213 16.1941607248
MYO3A 0.121627502376 0.121627502376
ITGA9 2.52865173668 1.31241115998
UGCG 1.27846860117 1.27846860117
MYO3B 0.0197998259682 0.0197998259682
RP4-559A3.7 0.0 0.0
ATP2A1 0.076370757306 0.076370757306
ATP2A2 5.06206554989 1.22755476297
ATP2A3 0.0197998259682 0.0197998259682
AL844853.1 0.0 0.0
GCG 0.0311140122358 0.0311140122358
ITGA2 0.189512619982 2.62199377338
ITGA3 1.31241115998 0.0961705832742
ITGA4 0.0933420367073 0.0933420367073
ITGA5 0.115970409242 0.115970409242
RIT1 0.0226283725351 0.0226283725351
ITGA7 2.52582319011 0.0933420367073
TRHR 0.0141427328344 0.0141427328344
DENND4A 0.0961705832742 0.0961705832742
DENND4B 0.166884247446 2.59936540085
RP11-303O9.2 0.0 0.0
ELL3 0.0735422107391 0.0735422107391
SWAP70 3.33464021911 1.30958261341
RARRES1 0.0339425588027 0.0339425588027
PHLDA1 0.0113141862676 1.22755476297
PHLDA3 5.2918127072 0.025456919102
File added
week1/rstudio-windows.png

139 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment