exercises for week1

77297901 · Jacques Rougemont · 77297901 · 77297901 · 77297901 · 77297901
Commit 77297901 authored 3 months ago by Jacques Rougemont
--- a/week1/ExercisesWeek1.qmd
+++ b/week1/ExercisesWeek1.qmd
+---
+title: "1. Introduction to R"
+author: "EPFL - SV - BIO-463"
+date: 02/18/2025
+date-format: long
+eval: false
+code-fold: false
+format:
+  html:
+    embed-resources: true
+  pdf:
+    papersize: a4
+    fig-width: 6
+    fig-height: 4
+format-links: false
+---
+
+
+## The R Programming Language
+
+[R is a programming language](https://www.r-project.org/) used for statistical analysis and data manipulation, widely used by several scientific communities which have contributed a large number of libraries.
+R is free (GNU license) and can be used to produce publication-ready graphics.
+
+R binaries and additional packages can be downloaded from several sites:
+
+* [Installation instructions](https://stat.ethz.ch/CRAN/) on the CRAN site.
+* [List of CRAN packages](https://stat.ethz.ch/CRAN/web/packages/available_packages_by_name.html)
+* [The Bioconductor project](http://www.bioconductor.org/) (bioinformatics software)
+* [The rdocumentation site](https://www.rdocumentation.org/)
+
+It can be used on the command-line in a terminal:
+```{verbatim}
+bash$ R
+R version 4.4.2 (2024-10-31) -- "Pile of Leaves"
+Copyright (C) 2024 The R Foundation for Statistical Computing
+Platform: x86_64-apple-darwin20
+
+R is free software and comes with ABSOLUTELY NO WARRANTY.
+You are welcome to redistribute it under certain conditions.
+Type 'license()' or 'licence()' for distribution details.
+
+  Natural language support but running in an English locale
+
+R is a collaborative project with many contributors.
+Type 'contributors()' for more information and
+'citation()' on how to cite R or R packages in publications.
+
+Type 'demo()' for some demos, 'help()' for on-line help, or
+'help.start()' for an HTML browser interface to help.
+Type 'q()' to quit R.
+
+> 1+1
+ [1] 2
+> y = c(1, 4, 9, 16, 25)
+> sqrt(y)
+ [1] 1 2 3 4 5
+> myFunc = function(x) {1+x^2}
+> myFunc(1:10)
+ [1]   2   5  10  17  26  37  50  65  82 101
+> ?log
+> q()
+```
+
+However we recommend using the RStudio Desktop, an IDE for R available at [posit.co](https://posit.co/downloads/).
+
+![](rstudio-windows.png)
+
+
+### Installing packages 
+
+R comes with many additional packages that provide statistical methods for various types of data analysis.
+To install them you use either
+
+* **install.packages** (if it comes from [CRAN](https://stat.ethz.ch/CRAN/))
+* **BiocManager::install** (if it comes from [Bioconductor](http://www.bioconductor.org/)):
+
+```{r}
+#| label: install packages
+install.packages(c("BiocManager", "quarto"))
+BiocManager::install("pheatmap")
+```
+
+Once a package is installed, you need to load it into your session with the command **library**:
+```{r}
+#| label: load packages
+BiocManager
+library(BiocManager)
+```
+
+{{< pagebreak >}}
+
+## Exercise 1
+
+The purpose of this exercise is to observe the effect of some common operations in R,
+and familiarize yourself with the language and the interface.
+
+Try to change some of the commands and see the effect.
+
+1. Open RStudio.
+2. Alternatively you can clone the [same gitlab repository](https://gitlab.epfl.ch/genomics-and-bioinformatics/course-data-2025.git) into your working directory and open the directory from RStudio.
+2. Alternatively you can clone the [same gitlab repository](https://gitlab.epfl.ch/genomics-and-bioinformatics/course-data-2025.git) into your working directory and open the directory from RStudio.
+3. Open the file [ExercisesWeek1.qmd](https://gitlab.epfl.ch/genomics-and-bioinformatics/course-data-2025/-/blob/main/week1/ExercisesWeek1.qmd) in RStudio (this is the file used to generate the document you are currently reading...)
+4. Run the following code blocks and understand what they are doing.
+
+Read the data from the tab-delimited file *GeneExpressionData.txt* (open the file as well to have a look at its content):
+```{r}
+#| label: load data
+data = read.delim("GeneExpressionData.txt", row.names=1)
+```
+If the file is not found, check your path and use **setwd()** to change to your working directory:
+```{r}
+#| label: path functions
+getwd()
+##  setwd("/YOUR/PATH/TO/GITLAB/REPO")
+dir()
+```
+First look at the data (notice that rows and columns have names!):
+```{r}
+#| label: data check
+dim(data)
+head(data)
+data[1:4, ]
+data$id 
+data$C1[1]
+data$C2[3:10]
+data["ATP2A3",]
+vector = data$C1
+vector[4]
+```
+Compute some basic statistics:
+```{r}
+#| label: summary stats
+summary(data)
+summary(data$C1)
+mean(data$C2)
+median(log(data$C2))
+sapply(data[1:10,], min)
+sapply(data, max)
+apply(data, 1, mean)
+apply(data, 2, sd)
+
+?sd
+```
+Elementary data transformation (are all ratios well-defined?):
+```{r}
+#| label: data manips
+any(data$C2==0) 
+which(data$C2==0)
+ratios = log2(data$C1/data$C2)
+geomMeans = sqrt(data$C1*data$C2)
+```
+Plot the data
+```{r}
+#| label: plots
+plot(data$C1, data$C2, log='xy', pch=20, main='', xlab='C1', ylab='C2')
+h1 = hist(log2(data$C1), breaks=30, main='', xlab='log2 values')
+hist(log2(data$C2), br=h1$breaks, add=T, col=2)
+```
+
+If you would like to learn more about R, we suggest two online courses that are focused on Bioinformatics:
+
+ * [UCDavis Introduction to R](https://ucdavis-bioinformatics-training.github.io/2021-March-Introduction-to-R-for-Bioinformatics/R/Intro2R_main)
+ * [SIB first steps with R](https://github.com/sib-swiss/first-steps-with-R-training)
+
+{{< pagebreak >}}
+
+## Exercise 2
+
+In this exercise we will perform a typical gene expression analysis based on a dataset from Leukemia cells:
+
+1. Load the dataset *leukemiaExpressionSubset.rds* (it is in compressed [RDS format](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS)):
+```{r}
+#| label: load leukemia
+library(pheatmap)
+data = readRDS("leukemiaExpressionSubset.rds")
+```
+2. In the file, samples (table columns) are named according to cell type and experiment number.
+Let us create an annotation table by splitting the sample type and the sample number in different columns:
+```{r}
+#| label: extract sample type
+colnames(data)
+annotations = data.frame(
+            LeukemiaType = substr(colnames(data),1,3),
+            row.names = substr(colnames(data),10,13))
+colnames(data) = rownames(annotations)
+```
+3. Log-transform the data, generate scatter plots of sample pairs and a boxplot of the distribution of gene expression values:
+```{r}
+#| label: pairs and box plots
+logdata = log2(data)
+## calculate the median per column (dimension no 2)
+meddata = apply(logdata, 2, median)
+## next subtract this median from each column
+logdata = sweep(logdata, 2, meddata, "-")
+## choose a color for each cell type
+typeCols = c("ALL"='red', "AML"='magenta', "CLL"='blue', "CML"='cyan', "NoL"='gray')
+## set some graphical parameters and plot the data
+par(las=1, cex=1.1, lwd=2, lty=1, pch=20)
+pairs(logdata[,1:5])
+boxplot(logdata, las=2, lty=1, lwd=2, col=typeCols[annotations$LeukemiaType], pch=20)
+```
+4. Create a clustered "heatmap" of the data:
+```{r}
+#| label: cluster and heatmap
+pheatmap(logdata, show_rownames=F, annotation_col=annotations, scale='none', 
+         clustering_distance_cols='correlation', clustering_method='complete',
+         annotation_colors=list(LeukemiaType=typeCols))
+```
+5. Save the transformed data to a tab-delimited text file:
+```{r}
+#| label: save to file
+write.table(logdata, file = "testoutput.txt", sep="\t", quote=F)
+```
+
+{{< pagebreak >}}
+
+## Exercise 3
+
+In this exercise we will seek information from genomic data portals mentionned in the lecture, and use them to inform our analysis.
+
+We are interested in the gene *BCL2A1* because it has been implicated in many cancers, including Leukemia.
+We would like to see if it displays some interesting signal in our data.
+
+Go to the [Ensembl](https://www.ensembl.org/Homo_sapiens)
+site and search for the identifier of this gene (should replace the string **ENSG00000XXXXXX** in the code).
+Use this identifier to extract the corresponding row from the log-data matrix, and show that it is disregulated in *acute leukemia (ALL, AML)*:
+```{r}
+#| label: genome browsers
+geneid = "ENSG00000XXXXXX"
+bcl2a1_expression = as.numeric(logdata[geneid,])
+boxplot(bcl2a1_expression~annotations$LeukemiaType)
+```
+
+Use the [UCSC genome browser](https://genome-euro.ucsc.edu/) to answer the following questions:
+
+1. Which strand is human BCL2A1 on?
+2. How many splice variants (isoforms) exist according to the *NCBI RefSeq* and to *GENCODE*?
+3. What is the next protein-coding gene upstream of BCL2A1? and downstream?
+4. Can you find a binding site for [NFKB1](https://www.genecards.org/cgi-bin/carddisp.pl?gene=NFKB1) (nuclear factor kappa B subunit 1, a transcription factor) less than 10kb upstream of BCL2A1, within a Dnase-1 hypersensitive site bearing an H3K27ac mark? Hint: look at the tracks *ReMap ChIP-seq* and *Encode Regulation*
--- a/week1/GeneExpressionData.txt
+++ b/week1/GeneExpressionData.txt
+id	C1	C2
+RNF14	0.076370757306	0.076370757306
+HIF3A	1.29543988057	0.0791993038729
+UBE2Q2	0.0480852916371	0.0480852916371
+RNF10	1.23604040267	0.0197998259682
+RNF11	6.68818524759	1.23604040267
+GPAA1	0.181026980281	0.181026980281
+REM1	0.0141427328344	0.0141427328344
+REM2	0.0141427328344	0.0141427328344
+IMP4	1.32372534624	0.107484769542
+C16orf11	0.00848563970067	0.00848563970067
+RNF17	0.155570061179	0.155570061179
+MZT2A	0.0282854656689	0.0282854656689
+MZT2B	0.0339425588027	0.0339425588027
+NDP	0.0169712794013	0.0169712794013
+PMM2	1.23886894924	1.23886894924
+UBE2F	0.132941688644	0.132941688644
+PMM1	0.0933420367073	0.0933420367073
+ASS1	0.121627502376	1.33786807908
+MLLT11	0.00565709313378	1.22189766984
+FHIT	0.076370757306	0.076370757306
+ZNF709	0.0169712794013	0.0169712794013
+ZNF708	0.0113141862676	1.22755476297
+ZNF879	0.0282854656689	0.0282854656689
+ZNF878	0.00565709313378	0.00565709313378
+LRRTM4	0.0480852916371	0.0480852916371
+CAMK1	0.0905134901404	0.0905134901404
+ZNF701	0.0113141862676	0.0113141862676
+ZNF700	2.45228097937	4.06991503737
+ZNF707	0.197998259682	0.197998259682
+CEACAM3	0.0226283725351	0.0226283725351
+ZNF704	0.0650565710384	0.0650565710384
+ABHD12B	0.0395996519364	0.0395996519364
+AC092118.1	0.00848563970067	0.00848563970067
+RNF114	4.49668514472	0.0226283725351
+RNF115	0.025456919102	0.025456919102
+ZC3H15	0.0537423847709	2.48622353817
+ZC3H14	1.29261133401	4.94133306411
+SPN	0.0169712794013	0.0169712794013
+RNF111	0.050913838204	1.26715441491
+IQCA1	0.076370757306	0.076370757306
+ZC3H18	1.25301168207	0.0367711053696
+GRIN1	0.050913838204	0.050913838204
+AC145676.2	0.00282854656689	0.00282854656689
+SP8	0.0141427328344	0.0141427328344
+DHX9	11.4267579755	11.7983360891
+CBX2	2.63524148962	3.1139530947
+LRRTM3	0.0311140122358	0.0311140122358
+NUP98	0.223455178784	11.5710138504
+XPC	0.12728459551	2.55976574891
+SP1	4.47733858507	5.14183924696
+XPA	0.0424281985033	0.0424281985033
+SP3	2.47208080534	5.71337898774
+SP4	2.47773789847	0.0452567450702
+SP5	0.0113141862676	0.0113141862676
+SP6	0.00565709313378	0.00565709313378
+CAMKV	5.02053236798	2.58805121458
+TACSTD2	0.00282854656689	0.00282854656689
+GOLIM4	0.0480852916371	0.0480852916371
+OPA3	0.00848563970067	1.2247262164
+OPA1	4.98093271605	1.33221098594
+RAB40C	0.0169712794013	2.0420288851
+RAB40B	0.0169712794013	0.0169712794013
+RAB40A	0.0113141862676	0.0113141862676
+AC016722.1	0.025456919102	0.025456919102
+RP11-190A12.7	0.0	0.0
+COL7A1	0.616623151582	0.616623151582
+MDP1	0.0735422107391	0.0735422107391
+GTSE1	0.0424281985033	6.12363108201
+FAM183A	0.0480852916371	0.0480852916371
+OVCH2	0.0791993038729	0.0791993038729
+OVCH1	0.0791993038729	0.0791993038729
+FAM183B	0.00565709313378	0.00565709313378
+AKR1D1	0.0452567450702	0.0452567450702
+PSAP	18.4408528213	16.1941607248
+MYO3A	0.121627502376	0.121627502376
+ITGA9	2.52865173668	1.31241115998
+UGCG	1.27846860117	1.27846860117
+MYO3B	0.0197998259682	0.0197998259682
+RP4-559A3.7	0.0	0.0
+ATP2A1	0.076370757306	0.076370757306
+ATP2A2	5.06206554989	1.22755476297
+ATP2A3	0.0197998259682	0.0197998259682
+AL844853.1	0.0	0.0
+GCG	0.0311140122358	0.0311140122358
+ITGA2	0.189512619982	2.62199377338
+ITGA3	1.31241115998	0.0961705832742
+ITGA4	0.0933420367073	0.0933420367073
+ITGA5	0.115970409242	0.115970409242
+RIT1	0.0226283725351	0.0226283725351
+ITGA7	2.52582319011	0.0933420367073
+TRHR	0.0141427328344	0.0141427328344
+DENND4A	0.0961705832742	0.0961705832742
+DENND4B	0.166884247446	2.59936540085
+RP11-303O9.2	0.0	0.0
+ELL3	0.0735422107391	0.0735422107391
+SWAP70	3.33464021911	1.30958261341
+RARRES1	0.0339425588027	0.0339425588027
+PHLDA1	0.0113141862676	1.22755476297
+PHLDA3	5.2918127072	0.025456919102
--- a/week1/leukemiaExpressionSubset.rds
+++ b/week1/leukemiaExpressionSubset.rds
--- a/week1/rstudio-windows.png
+++ b/week1/rstudio-windows.png