1 Aims

  • Create a list of mutated MHCI T-cell epitopes those appear in the COVEO database

2 Summary

  • There are 653 epitopes in the original downloaded epitope list. There are 190 out of 653 epitopes those were published more than a single paper. 184 out of 190 epitopes derived from reference sequence (this was used in this report, later the extra 6 mutated epitopes will be included too) see
  • Position of analyzed T cell epitopes on Spike reference sequence see
  • Barcode was assigned to each samples. A given barcode is basically a mutation pattern (amino acide level) of S protein. There are 346976 barcodes in the database. There are only 1972 barcodes that appeared more than 100 samples in the database (only these are included the analysis). 1784108 samples are assigned to these different 1972 barcodes.
  • The selected epitopes and barcodes were matched. A table was created that contains the 935 mutated T cell epitopes see. Downloadable table: “/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes.tsv”

3 Background

  • T cell epitopes were downloaded (https://www.iedb.org/) by David: (“/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes_metadata.tsv”)

  • The COVEO database will be searched for these epitopes and check the presences of their mutated version

4 Analysis

library(tidyverse)
library(openxlsx)
library(DBI)
config <- config::get()
con <- dbConnect(
  drv = RPostgreSQL::PostgreSQL(),
  dbname = config$dbname,
  host = config$host,
  port = config$port,
  user = config$user,
  password = config$password,
  option = config$option
)
selected_schema <- str_remove(config$option, pattern = "-c search_path=")

4.1 T cell epitope table

The table below is the original downloaded table:

f <- "/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes.tsv"
epitope_raw_ref <- read_tsv(f, show_col_types = FALSE)[, c(1,5,6)]
names(epitope_raw_ref) <- c("id", "references", "assays")
epitope_raw_ref$id <- as.character(epitope_raw_ref$id)

f <- "/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes_metadata.tsv"
epitope_raw <- read_tsv(f, show_col_types = FALSE)[1:17]
names(epitope_raw) <- c("iedb_iri", "object_type", "name", "modified_residue", 
                    "modifications", "start", "end", "iri", "synonyms",
                    "source_molecule", "source_molecule_iri", "molecule_parent",
                    "molecule_parent_iri", "source_organism",
                    "source_organism_iri", "species", "species_iri")
epitope_raw <- epitope_raw %>%
  separate(col = iedb_iri, into = c(NA, NA, NA, NA, "id"),
           sep = "/", remove = FALSE) %>%
  inner_join(epitope_raw_ref, by = "id") %>%
  select(id, name, start, end, references, assays, everything())
  
epitope_raw
  • Table above shows that there are 653 epitopes in the raw epitope table

Need to concentrate only those epitopes those were present in more than 1 presentation

epitope <- epitope_raw %>%
  filter(references>1)
epitope
  • There are 190 out of 653 epitopes those were published more than a single paper
epitope%>%
  filter(is.na(start))
  • There is a single epitope (1593848) where the start and end position is missing. Based on David’s alignment (“/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes_aligned.faa”), the position of this epitope is similar to https://www.iedb.org/epitope/1309147 epitope. There difference is a single mutation at amino acide level. So this epitope does not derive from reference sequence but the start-end position will be included the table:
epitope[epitope$id=="1593848", "start"] <- 269
epitope[epitope$id=="1593848", "end"] <- 277

Need to check if epitope sequences derived from the reference sequence.

The reference sequence is here: https://www.ncbi.nlm.nih.gov/nuccore/MN908947 Spike protein (QHD43416.1) is coded in the reference sequence position: 21563-25384

Database will be searched for the reference Spike protein sequence:

select * from datahub_0.translated_aa_ref_sequence(21563, 25384)
  • Note: the “datahub_0.barcodes_with_mutation_positions({start}, {end})” SQL function translate the reference nucleotide sequence into protein sequence (the parameters are the start and end positions of reference nucleotide sequence)
spike_seq <- str_c(spike_seq$aa_symbol, collapse = "")
spike_seq
[1] "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"

Check if the name column of the epitope table contains really the reference sequence.

for (i in 1:nrow(epitope)) {
  ref <- str_sub(spike_seq,
                 start = epitope[i, 'start'],
                 end = epitope[i, 'end'])
  if (epitope[i, 'name'] == ref) epitope[i, 'match_with_ref'] <- TRUE else epitope[i, 'match_with_ref'] <- FALSE
}

Lets see the result of the comparison

table(epitope$match_with_ref)

FALSE  TRUE 
    6   184 

There are 6 epitopes where the name column does not match with the reference.

These are the epitopes those do not match with reference sequence:

epitope[epitope$match_with_ref==FALSE,]
  • These epitopes above are important one, but now for simplicity I will exclude them, later these epitopes will be checked.

4.1.1 Table of T cell epitopes included the analysis

The table below contains the epitopes those derived from reference sequence:

epitope_selected <- epitope %>%
  filter(match_with_ref) 
epitope_selected

4.1.2 Position of epitopes on reference spike sequence

x <- epitope_selected %>%
  select(id, start, end) %>%
  mutate(start = as.numeric(start),
         end = as.numeric(end)) %>%
  arrange(start)
x$y <- 1:nrow(x)
x <- x %>%
  pivot_longer(cols = start:end, values_to = "x")
ggplot(data = x, aes(x = x, y = y, fill = id)) +
  geom_line(alpha = 0.4) +
  labs(title = "Distribution of T-cell epitopes on spike protein",
       x = "Amino acide position",
       y = "") +
  theme_minimal() +
  theme(aspect.ratio=2/6)

  • Each short horizontal line represent a T cell epitope

4.2 Barcode table

Barcode was assigned to each samples. A given barcode is basically a mutation pattern (amino acide level) of S protein.

How many different S protein exist now in the database?

SELECT count(*)
FROM barcode
  • There are 346976 barcodes in the database.

I will use only those barcodes that were assigned to more than 100 samples:

WITH barcode_dist AS (
     SELECT barcode_id, count(*) AS sample_number_in_bacode
          FROM aa_mutation_new AS a
          GROUP BY a.barcode_id
)

SELECT sample_number_in_bacode, count(*) AS number_of_barcode
     FROM barcode_dist
     GROUP BY sample_number_in_bacode
     ORDER BY number_of_barcode DESC
barcode_hist %>%
  filter(sample_number_in_bacode > 100) %>%
  summarise(`Number of barcodes` = sum(number_of_barcode))

There are 1972 barcodes that appeared more than 100 samples in the database.

barcode_hist %>%
  filter(sample_number_in_bacode > 100) %>%
  summarise(`Number of samples belongs to the selected barcodes` = sum(sample_number_in_bacode))
  • Around 1.8 million samples will be included into this analysis

List of barcodes that will be included this analysis:

WITH barcode_dist AS (
     SELECT barcode_id, count(*) AS sample_number_in_bacode
          FROM aa_mutation_new AS a
          GROUP BY a.barcode_id
)

SELECT barcode_id, sample_number_in_bacode, aa_mutation
     FROM barcode_dist
     LEFT JOIN barcode ON barcode.id=barcode_dist.barcode_id
     WHERE barcode_dist.sample_number_in_bacode > 100

Head of the table:

head(barcodes)

4.3 Match barcodes with T cell epitopes

mutation <- barcodes %>%
  mutate(mut = aa_mutation) %>%
  mutate(mut = str_remove(mut, pattern = "\\{"),
         mut = str_remove(mut, pattern = "\\}"),
         mut = str_remove_all(mut, pattern = "p\\.")) %>%
  mutate(n = (str_count(mut, pattern = ",") + 1))

max_mut <- max(mutation$n)
ids <- str_c("x", 1:max_mut)

mutation<- mutation%>%
  separate(mut, into = ids, sep = ",", fill  = "right") %>%
  pivot_longer(cols = all_of(ids), values_to = "mut",
               names_to = "mut_pos_in_barcode") %>%
  drop_na(mut) %>%
  mutate(pos = as.numeric(str_sub(mut, start = 4, end = -4)))

head(mutation)
  • Table above shows the positions of mutations in the selected barcodes.

Select those epitopes where start-end position includes any of the mutation positions above.

mutation_all <- tibble(barcode_id = integer(),
                       sample_number_in_bacode = numeric(),
                       aa_mutation = character(),
                       n = numeric(),
                       mut_pos_in_barcode = character(),
                       mut = character(),
                       pos = numeric(),
                       start = numeric(),
                       end = numeric(),
                       iedb_id= character())

for (i in row.names(epitope_selected)) {
  x <- mutation %>%
    mutate(start = as.numeric(epitope_selected[i, "start"]),
           end = as.numeric(epitope_selected[i, "end"]),
           iedb_id = as.character(epitope_selected[i, "id"])) %>%
    mutate(is_mutation_present = ifelse((pos >= start & pos <= end), TRUE, FALSE)) %>%
    filter(is_mutation_present) %>%
    select(-is_mutation_present)
  mutation_all <- mutation_all %>%
    add_row(x)
}
mutation_fin <- mutation_all_mod %>%
  select(barcode_id, iedb_id, mut) %>%
  group_by(barcode_id, iedb_id) %>%
  summarize(mutation_list = str_c(mut, collapse = ",")) %>%
  inner_join(x, by = c("barcode_id", "iedb_id")) %>%
  mutate(mutation_list = str_replace_all(mutation_list, pattern = 'Ala', replacement = 'A'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Arg', replacement = 'R'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Asn', replacement = 'N'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Asp', replacement = 'D'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Cys', replacement = 'C'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Gln', replacement = 'Q'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Glu', replacement = 'E'),
         mutation_list = str_replace_all(mutation_list, pattern = 'His', replacement = 'H'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Ile', replacement = 'I'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Leu', replacement = 'L'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Lys', replacement = 'K'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Met', replacement = 'M'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Phe', replacement = 'F'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Pro', replacement = 'P'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Ser', replacement = 'S'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Thr', replacement = 'T'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Trp', replacement = 'W'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Tyr', replacement = 'Y'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Val', replacement = 'V'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Gly', replacement = 'G')
         )
`summarise()` has grouped output by 'barcode_id'. You can override using the `.groups` argument.

There are 38155 mutated epitopes in the table above, but many of them appear in more than one times (if the same mutated epitope appears in more than one barcode, then it it will be present more than one row)

4.3.1 Mutated epitope table

mutated_epitope_table <- mutation_fin %>%
  ungroup() %>%
  select(iedb_id, ref_epitope, mut_epitope) %>%
  distinct()
mutated_epitope_table  
  • The table above contains 935 mutated epitopes. This is the final list, these need to be checked if their binding affinity changes as a result of mutation
write_tsv(mutated_epitope_table, file = "/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes.tsv")
write_tsv(mutation_fin, file = "/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes_barcodes.tsv")
write_tsv(mutation_all, file = "/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes_barcodes_all.tsv")
---
title: "MHCI T-cell epitopes from Spike protein"
author: "Krisztian Papp"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_notebook: 
    collapsed: false
    fig_caption: yes
    fig_height: 4
    fig_width: 5
    smooth_scroll: yes
    theme: sandstone
    toc: yes
    toc_depth: 5
    toc_float: yes
    number_sections: true
---

# Aims

* Create a list of mutated MHCI T-cell epitopes those appear in the COVEO database

# Summary

* There are 653 epitopes in the original downloaded epitope list. There are 190 out of 653 epitopes those were published more than a single paper. 184 out of 190 epitopes derived from reference sequence (this was used in this report, later the extra 6 mutated epitopes will be included too) [see](#a1)
* Position of analyzed T cell epitopes on Spike reference sequence [see](#a2)
* Barcode was assigned to each samples. A given barcode is basically a mutation pattern (amino acide level) of S protein. There are 346976 barcodes in the database. There are only 1972 barcodes that appeared more than 100 samples in the database (only these are included the analysis). 1784108 samples are assigned to these different 1972 barcodes.
* The selected epitopes and barcodes were matched. A table was created that contains the 935 mutated T cell epitopes [see](#a3). Downloadable table: "/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes.tsv"

# Background

* T cell epitopes were downloaded (https://www.iedb.org/) by David:  ("/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes_metadata.tsv")

* The COVEO database will be searched for these epitopes and check the presences of their mutated version

# Analysis

```{r}
library(tidyverse)
library(openxlsx)
library(DBI)
```

```{r}
config <- config::get()
con <- dbConnect(
  drv = RPostgreSQL::PostgreSQL(),
  dbname = config$dbname,
  host = config$host,
  port = config$port,
  user = config$user,
  password = config$password,
  option = config$option
)
selected_schema <- str_remove(config$option, pattern = "-c search_path=")
```

## T cell epitope table

The table below is the original downloaded table:

```{r}
f <- "/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes.tsv"
epitope_raw_ref <- read_tsv(f, show_col_types = FALSE)[, c(1,5,6)]
names(epitope_raw_ref) <- c("id", "references", "assays")
epitope_raw_ref$id <- as.character(epitope_raw_ref$id)

f <- "/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes_metadata.tsv"
epitope_raw <- read_tsv(f, show_col_types = FALSE)[1:17]
names(epitope_raw) <- c("iedb_iri", "object_type", "name", "modified_residue", 
                    "modifications", "start", "end", "iri", "synonyms",
                    "source_molecule", "source_molecule_iri", "molecule_parent",
                    "molecule_parent_iri", "source_organism",
                    "source_organism_iri", "species", "species_iri")
epitope_raw <- epitope_raw %>%
  separate(col = iedb_iri, into = c(NA, NA, NA, NA, "id"),
           sep = "/", remove = FALSE) %>%
  inner_join(epitope_raw_ref, by = "id") %>%
  select(id, name, start, end, references, assays, everything())
  
epitope_raw
```
* Table above shows that there are 653 epitopes in the raw epitope table

Need to concentrate only those epitopes those were present in more than 1 presentation

```{r}
epitope <- epitope_raw %>%
  filter(references>1)
epitope
```
* There are 190 out of 653 epitopes those were published more than a single paper

```{r}
epitope%>%
  filter(is.na(start))
```
* There is a single epitope (`1593848`) where the start and end position is missing. Based on David's alignment ("/v/projects/geno2pheno-kpapp/dnieuw/MHCI-Spike-epitopes_aligned.faa"), the position of this epitope is similar to https://www.iedb.org/epitope/1309147 epitope. There difference is a single mutation at amino acide level. So this epitope does not derive from reference sequence but the start-end position will be included the table:


```{r}
epitope[epitope$id=="1593848", "start"] <- 269
epitope[epitope$id=="1593848", "end"] <- 277
```

Need to check if epitope sequences derived from the reference sequence.

The reference sequence is here: https://www.ncbi.nlm.nih.gov/nuccore/MN908947
Spike protein (QHD43416.1) is coded in the reference sequence position: 21563-25384

Database will be searched for the reference Spike protein sequence:

```{sql connection=con, output.var=spike_seq}
select * from datahub_0.translated_aa_ref_sequence(21563, 25384)
```

* Note: the "datahub_0.barcodes_with_mutation_positions({start}, {end})" SQL function translate the reference nucleotide sequence into protein sequence (the parameters are the start and end positions of reference nucleotide sequence)

```{r}
spike_seq <- str_c(spike_seq$aa_symbol, collapse = "")
spike_seq
```

Check if the `name` column of the epitope table contains really the reference sequence. 

```{r}
for (i in 1:nrow(epitope)) {
  ref <- str_sub(spike_seq,
                 start = epitope[i, 'start'],
                 end = epitope[i, 'end'])
  if (epitope[i, 'name'] == ref) epitope[i, 'match_with_ref'] <- TRUE else epitope[i, 'match_with_ref'] <- FALSE
}
```

Lets see the result of the comparison

```{r}
table(epitope$match_with_ref)
```
There are 6 epitopes where the `name` column does not match with the reference. 

These are the epitopes those do not match with reference sequence:

```{r}
epitope[epitope$match_with_ref==FALSE,]
```

* These epitopes above are important one, but now for simplicity I will exclude them, later these epitopes will be checked.

### Table of T cell epitopes included the analysis {#a1}

The table below contains the epitopes those derived from reference sequence:

```{r}
epitope_selected <- epitope %>%
  filter(match_with_ref) 
epitope_selected
```
### Position of epitopes on reference spike sequence {#a2}


```{r}
x <- epitope_selected %>%
  select(id, start, end) %>%
  mutate(start = as.numeric(start),
         end = as.numeric(end)) %>%
  arrange(start)
x$y <- 1:nrow(x)
x <- x %>%
  pivot_longer(cols = start:end, values_to = "x")
```

```{r}
ggplot(data = x, aes(x = x, y = y, fill = id)) +
  geom_line(alpha = 0.4) +
  labs(title = "Distribution of T-cell epitopes on spike protein",
       x = "Amino acide position",
       y = "") +
  theme_minimal() +
  theme(aspect.ratio=2/6)
```
* Each short horizontal line represent a T cell epitope

## Barcode table

Barcode was assigned to each samples. A given barcode is basically a mutation pattern (amino acide level) of S protein.

How many different S protein exist now in the database?

```{sql connection=con}
SELECT count(*)
FROM barcode
```


* There are 346976 barcodes in the database.

I will use only those barcodes that were assigned to more than 100 samples:

```{sql connection=con, output.var = "barcode_hist"}
WITH barcode_dist AS (
     SELECT barcode_id, count(*) AS sample_number_in_bacode
          FROM aa_mutation_new AS a
          GROUP BY a.barcode_id
)

SELECT sample_number_in_bacode, count(*) AS number_of_barcode
     FROM barcode_dist
     GROUP BY sample_number_in_bacode
     ORDER BY number_of_barcode DESC
```

```{r}
barcode_hist %>%
  filter(sample_number_in_bacode > 100) %>%
  summarise(`Number of barcodes` = sum(number_of_barcode))
```

There are 1972 barcodes that appeared more than 100 samples in the database. 


```{r}
barcode_hist %>%
  filter(sample_number_in_bacode > 100) %>%
  summarise(`Number of samples belongs to the selected barcodes` = sum(sample_number_in_bacode))
```
* Around 1.8 million samples will be included into this analysis

List of barcodes that will be included this analysis:

```{sql connection=con, output.var = "barcodes"}
WITH barcode_dist AS (
     SELECT barcode_id, count(*) AS sample_number_in_bacode
          FROM aa_mutation_new AS a
          GROUP BY a.barcode_id
)

SELECT barcode_id, sample_number_in_bacode, aa_mutation
     FROM barcode_dist
     LEFT JOIN barcode ON barcode.id=barcode_dist.barcode_id
     WHERE barcode_dist.sample_number_in_bacode > 100
```

Head of the table:
```{r}
head(barcodes)
```

## Match barcodes with T cell epitopes

```{r}
mutation <- barcodes %>%
  mutate(mut = aa_mutation) %>%
  mutate(mut = str_remove(mut, pattern = "\\{"),
         mut = str_remove(mut, pattern = "\\}"),
         mut = str_remove_all(mut, pattern = "p\\.")) %>%
  mutate(n = (str_count(mut, pattern = ",") + 1))

max_mut <- max(mutation$n)
ids <- str_c("x", 1:max_mut)

mutation<- mutation%>%
  separate(mut, into = ids, sep = ",", fill  = "right") %>%
  pivot_longer(cols = all_of(ids), values_to = "mut",
               names_to = "mut_pos_in_barcode") %>%
  drop_na(mut) %>%
  mutate(pos = as.numeric(str_sub(mut, start = 4, end = -4)))

head(mutation)
```

* Table above shows the positions of mutations in the selected barcodes.

Select those epitopes where start-end position includes any of the mutation positions above.

```{r}
mutation_all <- tibble(barcode_id = integer(),
                       sample_number_in_bacode = numeric(),
                       aa_mutation = character(),
                       n = numeric(),
                       mut_pos_in_barcode = character(),
                       mut = character(),
                       pos = numeric(),
                       start = numeric(),
                       end = numeric(),
                       iedb_id= character())

for (i in row.names(epitope_selected)) {
  x <- mutation %>%
    mutate(start = as.numeric(epitope_selected[i, "start"]),
           end = as.numeric(epitope_selected[i, "end"]),
           iedb_id = as.character(epitope_selected[i, "id"])) %>%
    mutate(is_mutation_present = ifelse((pos >= start & pos <= end), TRUE, FALSE)) %>%
    filter(is_mutation_present) %>%
    select(-is_mutation_present)
  mutation_all <- mutation_all %>%
    add_row(x)
}
```

```{r}
mutation_all_mod <- mutation_all %>%
  mutate(ref_epitope = str_sub(spike_seq, start = start, end = end)) %>%
  select(-mut_pos_in_barcode) %>%
  select(-pos)

x <- mutation_all_mod %>%
  select(-mut) %>%
  select(-n) %>%
  distinct()

mutation_fin <- mutation_all_mod %>%
  select(barcode_id, iedb_id, mut) %>%
  group_by(barcode_id, iedb_id) %>%
  summarize(mutation_list = str_c(mut, collapse = ",")) %>%
  inner_join(x, by = c("barcode_id", "iedb_id")) %>%
  mutate(mutation_list = str_replace_all(mutation_list, pattern = 'Ala', replacement = 'A'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Arg', replacement = 'R'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Asn', replacement = 'N'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Asp', replacement = 'D'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Cys', replacement = 'C'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Gln', replacement = 'Q'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Glu', replacement = 'E'),
         mutation_list = str_replace_all(mutation_list, pattern = 'His', replacement = 'H'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Ile', replacement = 'I'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Leu', replacement = 'L'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Lys', replacement = 'K'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Met', replacement = 'M'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Phe', replacement = 'F'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Pro', replacement = 'P'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Ser', replacement = 'S'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Thr', replacement = 'T'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Trp', replacement = 'W'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Tyr', replacement = 'Y'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Val', replacement = 'V'),
         mutation_list = str_replace_all(mutation_list, pattern = 'Gly', replacement = 'G')
         )
```


```{r}
for (i in row.names(mutation_fin)){
  mut_list <- unlist(strsplit(as.character(mutation_fin[i, "mutation_list"]), ","))
  mutation_fin[i, "mut_epitope"] <- mutation_fin[i, "ref_epitope"]
  for (j in mut_list) {
    replacement_pos <- as.numeric(str_sub(j, start = 2, end = -2)) - as.numeric(mutation_fin[i, "start"]) +1
    replacement_AA <- str_sub(j, start = -1, end = -1)
    mutation_fin[i, "mut_epitope"] <- str_c(str_sub(mutation_fin[i, "mut_epitope"],
                                                    start = 1,
                                                    end = (replacement_pos - 1)),
                                            replacement_AA,
                                            str_sub(mutation_fin[i, "mut_epitope"],
                                                    start = (replacement_pos + 1),
                                                    end = -1))
  }
}
head(mutation_fin)
```

There are 38155 mutated epitopes in the table above, but many of them appear in more than one times (if the same mutated epitope appears in more than one barcode, then it it will be present more than one row) 

### Mutated epitope table {#a3}

```{r}
mutated_epitope_table <- mutation_fin %>%
  ungroup() %>%
  select(iedb_id, ref_epitope, mut_epitope) %>%
  distinct()
mutated_epitope_table  
```
* The table above contains 935 mutated epitopes. This is the final list, these need to be checked if their binding affinity changes as a result of mutation


```{r}
write_tsv(mutated_epitope_table, file = "/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes.tsv")
```

```{r}
write_tsv(mutation_fin, file = "/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes_barcodes.tsv")
```

```{r}
write_tsv(mutation_all, file = "/v/projects/geno2pheno-kpapp/kpapp/t_cell_epitop/data/mutated_epitopes_barcodes_all.tsv")
```







