diff --git a/week2/ExercisesWeek2.qmd b/week2/ExercisesWeek2.qmd new file mode 100644 index 0000000000000000000000000000000000000000..dfbb144923e6803badca638b391e96cb7b8fa566 --- /dev/null +++ b/week2/ExercisesWeek2.qmd @@ -0,0 +1,171 @@ +--- +title: "2. Genome assembly, sequence alignments" +author: "EPFL - SV - BIO-463" +date: 02/25/2025 +date-format: long +eval: false +code-fold: false +format: + html: + code-fold: false + embed-resources: true + pdf: + papersize: a4 + fig-width: 8 + fig-height: 4 +format-links: false +--- + +## Working environment + +Install R packages we will use during the course +```{r} +install.packages(c("BiocManager", "aphid", "seqHMM", "phytools", "quarto", + "gplots", "RColorBrewer", "ape", "ggplot2", "ggfortify", + "gprofiler2", "Seurat", "dplyr", "patchwork")) +BiocManager::install(c("Biostrings", "Rgraphviz", "pheatmap", "biomaRt", "limma", + "GenomicFeatures", "DESeq2", "PCAtools", "edgeR", "pwalign", + "clusterProfiler", "msigdbr", "annotables", "enrichplot")) +``` + +Load the relevant packages for today: +```{r} +library("Biostrings") +library("Rgraphviz") +library("biomaRt") +``` + +## Exercise 1: Eulerian path in a graph +### Sequencing reads + +The file *SequencingReads.txt* contains 22 sequencing reads. These have been prepared so that they are all in the same orientation and the overlap is 4 nucleotides for every pair of consecutive reads. + +* Load the data as a **[DNAStringSet](https://rdocumentation.org/packages/Biostrings/versions/2.40.2/topics/XStringSet-class)** + - the basic command **scan("filename", "character")** returns the content of a file as an array of strings (one string per line in the file) + - **DNAStringSet** then converts this array into an object that can be manipulated as a set of DNA sequences + +```{r} +#### scan() the file as a list of "character" strings +datafile = +#### check the content +head(datafile) +#### convert to a DNAStringSet +reads = DNAStringSet(datafile) +#### check the result +reads +#### nb of reads, size of reads +numreads = length(reads) +readlength = width(reads[1]) +``` + +### Constructing the graph + +In this part of the exercise, we will construct the Eulerian graph using the +**[graphAM](https://www.rdocumentation.org/packages/graph/versions/1.50.0/topics/graphAM-class) class**. + +The nodes are the overlaps (unique 4-mers from both ends of reads) and the graph edges are the reads +(directed links between the start node and the end node of each read). + +To make this construction, follow these steps: + +1. Make lists of all overlapping 4-mers, one for the start and one for the end of each read, +using the **subseq** method on a **DNAStringSet** and convert the result to simple strings with **as.character**. +2. Create the list of nodes: apply **sort** and **unique** to the combined list of starts and ends. +3. Create the edge labels (called **edgelabels** in the code below): an edge connecting 'x' to 'y' is labelled 'x~y' (see [plot.graph](https://rdocumentation.org/packages/Rgraphviz/versions/2.16.0/topics/plot-methods)). +This can be done with **paste(a, b, sep='~')** where **a**, **b** are lists of strings. + +```{r} +readstarts = as.character(subseq(.....)) +readends = as.character(subseq(.....)) +#### construct the list of graph nodes: combines starts and ends, sorted and non-redundant: +nodes = sort(unique(c(....))) +numnodes = length(nodes) +#### one graph edge for each read +edgelabels = 1:numreads +#### the graph class requires edge names in the form "acgt~tacg" for an edge from node "acgt" to node "tacg" +names(edgelabels) = paste(?,?, sep='~') +``` + +4. Create the adjacency matrix **A**: its columns and rows represent nodes (4-mers) from the **nodes** list, +and the matrix element **A[x,y]** is the number of reads connecting **x** to **y**. +```{r} +#### Adjacency matrix numnodes x numnodes, rows and cols are named with elements of "nodes" +A = matrix(0, nrow=numnodes, ncol=numnodes, dimnames=list(nodes, nodes)) +for (n in 1:numreads) { +#### read no n corresponds to a link readstarts[n] -> readends[n] +#### by construction the row and column names of A are the strings in readstarts and readends +} +``` + +5. Now display the corresponding graph as follows (you can then play with the options to improve the looks of your graph): +```{r} +#### define the graph +grEuler = graphAM(adjMat=A, values=list(weight=1), edgemode="directed") +#### customize display parameters (optional!) +grattr = getDefaultAttrs() +grattr$graph$size = c(7,10) +grattr$edge$minlen = 1.5 +grattr$edge$arrowsize=.5 +grattr$edge$labelfontsize = 18 +#### display it +edgeattr = list(label=edgelabels) +plot(grEuler, edgeAttrs=edgeattr, attrs=grattr) +``` + +### Eulerian path and contig + +1. Manually find the eulerian path in the graph picture and write down the edge numbers in the order of their visit. +Define a vector with these numbers and use it to order the reads accordingly. +2. Generate the corresponding contig (the assembled genome string) +```{r} +edgepath = c(.....) ### edge numbers +sortedreads = reads[edgepath] +### This will concatenate (paste) all reads together, but the overlap is now repeated twice at each junction: +### paste("ACTGGG", "GGGTAT") = "ACTGGGGGGTAT", it should be "ACTGGGTAT" +### Correct the code below to retain only one copy of each overlap: +contig = DNAStringSet(paste(sortedreads, collapse='')) +``` +3. Save that sequence as a **[Fasta](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp)** file +```{r} +### Give the sequence a name: will be useful later +names(contig) = "Week2Contig" +writeXStringSet(contig, "Week2Contig.fa") +``` +4. Go the the [NCBI blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi) and run the following operations: + - Choose "Nucleotide Blast" + - Upload your fasta + - Choose "refseq_rna" database + - Organism: Vertebrata + - Run "Blast" +5. Which gene of which species have you assembled? + +### Nucleotide statistics + +* Create a [barplot](https://rdocumentation.org/packages/graphics/versions/3.6.2/topics/barplot) of the sliding window base frequencies using +[letterFrequencyInSlidingView](https://rdocumentation.org/packages/Biostrings/versions/2.40.2/topics/letterFrequency) of your contig +with a window size of 20 nucleotides. The barplot should represent base frequencies (vertically) as a function of window position (horizontally). + + +## Exercise 2: Align your contig to the Chicken genome + +1. Start from the result of Exercise 1: load the assembled contig as a **DNAString** +2. Download the transcript found at **chr21:295510-295980** on the Chicken genome using +[biomaRt](http://mart.ensembl.org/info/data/biomart/biomart_r_package.html) and keep only the part between nucleotides **335** and **806** +```{r} +ensembl = useMart("ensembl", "ggallus_gene_ensembl") +#### Fill in the coordinates +ensembl_qry = getSequence(chromosome=..., start=..., end=..., type=c("uniprot_gn_symbol","start_position","strand"), seqType="cdna", mart=ensembl) +gg_seq = subseq(DNAStringSet(ensembl_qry$cdna), ..., ...) +``` +3. Align these two sequences using a match score of *+1*, mismatch score of *-1* and gap penalty of *-2* (you need to define the missing variables here: +see [scoring matrices](https://www.rdocumentation.org/packages/Biostrings/versions/2.40.2/topics/substitution.matrices)) + +```{r} +#### This should be your contig (load the file): +qry_seq = readDNAStringSet(...) +#### define scMatrix and gapPenalty +pairwiseAlignment(qry_seq[[1]], gg_seq[[1]], substitutionMatrix=scMatrix, gapOpening=0, gapExtension=gapPenalty) +``` +4. Go to [UCSC's Blat alignment page](http://www.genome.ucsc.edu/cgi-bin/hgBlat), select **chicken** genome (galGal6 release) and paste (or upload) the contig sequence. +See to which genomic region it aligns to and observe the intron/exon structure. + diff --git a/week2/SequencingReads.txt b/week2/SequencingReads.txt new file mode 100644 index 0000000000000000000000000000000000000000..8d66b59514dd7f2df3783898615b36396c921368 --- /dev/null +++ b/week2/SequencingReads.txt @@ -0,0 +1,22 @@ +GGGGTGGTAAAGATCAAGAACAAGA +TCTACACGCACACTGATCAGTCCCA +AGAAGAAAGTGTTGGACTCCTCCCG +AGTCCTTTTTCTGCAGGATCAGGGG +CTGACACGTTTGTGGCTGTGTTTTC +ACCGAAGAGTGTTTCAGCCAGATGT +AAGAAACACGTTGTTACCCCTTCCG +ATGTGGATGCAGAATGCTGCTGCTT +ATGTGATGACATACAGCATAGAAGA +AAGAGTTGGTGGCAGTTGCATCTGA +TGTATGAATATGCCCAGGTGAAGTC +CAGCATCCATTTTAAACTGTAAGAA +TAGTGCATATTTCTGAGCAGGCAGC +TCCCTCAGGATGTGAGTGTGTTCTA +TTTCCTTGCTTTCGGGACACATAGT +TCCATGTGTGTACTTCTGTCCATGT +TGGAAAGTCAAACAGCTTCTCTGTA +CTGAACACACTCCAAAAAACACTGA +CCCGGTTTGTAGAGCTGCTTGTCCC +TCCGAATTACCCCATACTTGGTCCA +CCCACCTGCCGCTTTGGAACATGGA +ACTTTTTCCAGGCTCTGAATGACCG