# Positions with growing AF¶

### Goal:¶

Find mutations that had low average allele frequency (AF) a few weeks ago, but now have 0.5 larger AF. We do not distinguish type of mutations (indels, etc.), only count any type at the position where it is registered in VCF.

### Conclusions:¶

• SQL database is fast an efficient for such studies
• In general Oxford nanopore allele frequencies tend to avoid very high (~1) and very low (~0) values
• Some positions (e.g. 28916) had low AF mutations for a long time, and switched to high AF recently
• Some positions are present with both low and high AF (in different samples)
In [3]:
# load python libraries
import pandas as pd
import pandas.io.sql as sqlio
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import psycopg2

In [4]:
# this connects to database and executes a query
def psql(query):
conn = psycopg2.connect("dbname=***** user=*** password=**** host=postgres.coveo port=5432")
conn.close()
return dummy


### The query¶

We build the query by incrementally filtering the whole data set of mutations. To do so wr run sub-queries and create temporary data tables.

#### 1. Reliable mutations¶

Select reliable mutations based on Poisson statistics, allele frequency > 3 times (random) standard deviation. Calculate the number of occurances and average AF on weekly basis

In [ ]:
query0 = '''
CREATE TEMP TABLE IF NOT EXISTS weekly_AF AS
SELECT pos,date_isoweek, date_isoyear, AVG(af) as avgAF, STDDEV(af) as stdAF, COUNT(*) as cnt
FROM vcf_all JOIN meta on vcf_all.ena_run = meta.ena_run
WHERE ann_num=1
AND af>3/SQRT(dp)
AND clean_host = 'Homo sapiens'
GROUP BY pos,date_isoyear,date_isoweek;

'''


#### 2. Mutations that occur in many samples¶

Subset for the genomic positions that have mutation at least in 200 of the collected samples

In [ ]:
query1 = '''
CREATE TEMP TABLE IF NOT EXISTS frequent_pos AS
SELECT pos FROM weekly_AF
GROUP BY pos
HAVING SUM(cnt)>200;

'''
query2 = '''
CREATE TEMP TABLE IF NOT EXISTS frequent_AF AS
SELECT w.* FROM weekly_AF w JOIN frequent_pos f
ON w.pos = f.pos;

'''


#### 3. Average AF in the last 10 and in previous weeks¶

Get the average AF for each of the selected genomic positions for the last 10 weeks and for the previous weeks, respectively

In [ ]:
query3 = '''
CREATE TEMP TABLE IF NOT EXISTS last_weeks_AF AS
SELECT pos, AVG(avgAF) AS avg_AF FROM frequent_AF
WHERE date_isoyear=2021 AND date_isoweek>24
GROUP BY pos
HAVING SUM(cnt) > 50;

'''
query4 = '''
CREATE TEMP TABLE IF NOT EXISTS previous_weeks_AF AS
SELECT pos, AVG(avgAF) AS avg_AF FROM frequent_AF
WHERE NOT (date_isoyear=2021 AND date_isoweek>24)
GROUP BY pos
HAVING SUM(cnt) > 50;

'''


#### 4. Positions with AF jump¶

Find genomic positions with significant recent jump (>0.5) in AF

In [ ]:
query5 = '''
CREATE TEMP TABLE IF NOT EXISTS jump_AF AS
SELECT p.pos
FROM last_weeks_AF l JOIN previous_weeks_AF p
ON l.pos = p.pos
WHERE l.avg_AF - p.avg_AF > 0.5;

'''


#### 5. Put all together and get detailed information for the selected positions¶

For the selected genomic positions with "jumping AF" collect detailed data, with individual AF values, country of origin, sequencing platform

In [148]:
query6 = '''
SELECT vcf_all.pos,meta.clean_collection_date, vcf_all.af, clean_country,instrument_platform
FROM vcf_all JOIN meta ON vcf_all.ena_run = meta.ena_run
JOIN jump_AF ON vcf_all.pos = jump_AF.pos
WHERE ann_num=1
AND af>3/SQRT(dp)
AND clean_host = 'Homo sapiens'
'''


#### Run the query¶

It takes around 5 minutes

In [149]:
%%time
query = query0 + query1 + query2 + query3 + query4 + query5 + query6
growing_AF = psql(query)

CPU times: user 901 ms, sys: 169 ms, total: 1.07 s
Wall time: 4min 57s


#### The genomic positions with "jumping AF" values¶

In [150]:
np.sort(pd.unique(growing_AF.pos))

Out[150]:
array([  444,   675,   691,   749,   805,  1048,  1125,  1281,  1474,
1499,  2246,  4183,  6622,  6671,  8936,  9554, 10201, 10993,
11332, 11333, 12170, 13210, 13390, 13593, 13638, 14008, 14030,
14185, 14801, 15328, 15451, 15569, 15629, 15760, 16054, 16170,
17302, 17632, 18197, 18402, 18713, 18876, 18927, 20663, 20691,
20991, 22403, 24883, 24905, 24928, 24942, 25171, 25817, 26096,
26625, 26710, 26763, 27407, 27489, 27727, 28149, 28270, 28299,
28559, 28916, 29024, 29288])

### Visualize the AF time evolution¶

Plot symbol colors mark the sequencing platform. We use the same time and AF scale on all figures.

In [151]:
# date type conversion needed for plotting
growing_AF['datetime'] = growing_AF['clean_collection_date'].apply(pd.to_datetime)

In [155]:
for i in np.sort(pd.unique(growing_AF.pos)):
_ = growing_AF[growing_AF.pos==i]
plt.figure(figsize=(16,3))
_ = _[_.instrument_platform=='ILLUMINA']
plt.plot(_.datetime, _.af,'.',label='ILLUMINA pos='+str(i), alpha=0.5)
_ = growing_AF[growing_AF.pos==i]
_ = _[_.instrument_platform!='ILLUMINA']
plt.plot(_.datetime, _.af,'.',label='NANOPORE pos='+str(i), alpha=0.5)
plt.xlabel("date")
plt.ylabel("allele freq")
plt.xlim(pd.to_datetime('2020-01-01'),pd.to_datetime('2021-10-01'))
plt.ylim(0,1.1)
plt.legend()
plt.show()