Flu NA data exploration

Simple machine learning methods for genotype to phenotype prediction

Author
Affiliation

Istvan Csabai

ELTE Eötvös Loránd University

Published

June 7, 2023

Short summary
In this notebook, we explore the relation between mutations and antigenic properties of flu NA protein. In collaboration with Ron Fouchier (EMC), Daniel Remondini (UNIBO), Francesco Durazzi (UNIBO)

1 Data

1.1 The titer table

Titer data for 301 strains measured against 24 sera. Date of the flu strains are encoded in the sample names.

Code
###---#| output: false
# df = pd.read_json('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8.ace')
df = pd.read_csv('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_dist.csv')
df = df.rename(columns={"Unnamed: 0":"strain"})
df
strain F12050 F12052 F12021 F12026 F12027 F12029 F12031 F12034 F12035 ... F12048 F13039 F13041 F13045 F14004 F20028 F20030 F20031 F20034 F20035
0 M1/57 1.574416 3.402557 2.898778 1.942968 7.229467 7.093932 7.939687 9.517563 10.359909 ... 9.615901 12.123658 12.828431 12.156852 13.593330 11.677755 14.704255 15.289147 15.123345 16.697156
1 B1/68 2.600630 1.430592 3.880875 2.615082 6.864780 6.216929 6.578444 7.853448 7.738927 ... 8.441434 9.317535 10.502744 9.998033 11.486862 9.712056 12.089350 12.497773 12.437677 13.903903
2 BI/16190/68 3.011607 1.195503 4.316032 2.675801 6.538964 5.848507 6.188585 7.434459 7.276061 ... 8.140292 8.973482 10.118619 9.659428 11.182185 9.304252 11.655434 12.054789 12.000709 13.458979
3 BI/21793/72 7.559219 4.792659 8.788941 5.179885 1.874458 1.322527 1.207710 2.748615 5.021993 ... 7.520932 9.757911 9.416751 9.660916 11.522350 7.563519 10.007944 10.281167 10.315149 11.478862
4 BI/1761/76 7.689574 4.992261 8.939485 5.262361 1.545108 1.211213 1.464863 2.929424 5.314439 ... 7.522217 9.992182 9.556238 9.789711 11.645269 7.641819 10.178269 10.485362 10.496395 11.672114
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
296 rec A/Hong Kong/2671/2019 15.341979 12.378078 16.883226 13.812782 10.841371 10.086912 10.042302 9.007271 6.301289 ... 8.565258 5.863700 4.496721 6.152295 6.982223 5.022823 1.221563 1.358903 0.831859 1.841209
297 rec A/SouthAustralia/34/2019 15.753855 12.787633 17.182149 14.297536 11.494104 10.685924 10.389738 9.263233 6.418597 ... 9.602043 6.270547 5.450944 7.065334 7.865231 6.094661 2.147825 0.836129 1.483691 0.680975
298 rec A/Netherlands/16/2020 15.090016 12.109388 16.643504 13.518047 10.459409 9.719283 9.698070 8.667638 6.036871 ... 8.283305 5.851700 4.367475 6.031341 6.947526 4.717432 1.189683 1.515029 0.981041 2.132923
299 rec A/Netherlands/122/2020 16.855485 13.849411 18.172795 15.362140 12.403461 11.611856 11.043661 9.755267 7.122762 ... 11.133307 7.770089 7.105910 8.720389 9.514380 7.596464 3.792909 2.257337 3.140030 1.052983
300 rec A/Cambodia/e0826360/2020 16.228734 13.254948 17.613645 14.776721 11.951957 11.143226 10.745410 9.561260 6.756559 ... 10.239361 6.802866 6.088806 7.695222 8.460796 6.726722 2.793297 1.354244 2.126825 0.184532

301 rows × 25 columns

1.2 Amino acid sequences

Aligned neuraminidase (NA) sequennces for the 301 flu strains.

  • BI/16190/68 is one of the earliest cases, we may take that as reference (Same was used in the HA study)
Code
fastaNames = ! grep ">" ../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta | sed 's/>//g'
tableNames = df['strain']
#tableNames[:5]
set(fastaNames)==set(tableNames.values) # check if both data sources have the same samples
Code
from Bio import SeqIO

ref_id = 'BI/16190/68'

fasta_seq = []
fasta_id = []

fnam = "../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta"

fasta_sequences = SeqIO.parse(open(fnam),'fasta')       
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    fasta_seq.append(sequence)
    fasta_id.append(name)
    if name==ref_id:
        ref_seq1nt = sequence

fasta_df = pd.DataFrame(np.column_stack([fasta_id,fasta_seq]),columns=['strain','seq'])
Code
# SeqIO does not parse names with whitespaces
fasta_df['strain'] = fastaNames
fasta_df
strain seq
0 AK/4/93 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECNSPP...
1 AM/1609/77 MNPNQKIITIGSVSLTIATVCFLMQIAILVTTVTLHFKQYECDSPP...
2 AM/4112/92 MNPNQKIITIGSVSLTIATICFLMQITILVTTVTLHFKQYECNSPP...
3 AT/211/89 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
4 AT/3572/88 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
... ... ...
296 rec A/Switzerland/8060/2017 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
297 rec A/Switzerland/9715293/2013 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
298 wt A/Netherlands/2249/2013 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
299 wt A/Perth/16/2009 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
300 wt A/Victoria/361/2011 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...

301 rows × 2 columns

Code
df = pd.merge(fasta_df,df,on="strain")
df.head(3)
Code
years = df['strain'].str.split("/").str[-1].values.astype(int)
years = np.array([ y+(y<2000)*1900 for y in [y+(y<23)*2000 for y in years ]], dtype=int)
years

1.3 MDS projection of titer data

There are 24 measured values for each strain. These 24 dimensional data vectors are projected down to 3D with multidimensional scaling (MDS). This method tries to keep the topology indformation of the original data. Please note that a new MDS projection of the same data may provide a different projection depending on the random_state seed value.

Code
embedding = MDS(n_components=3, random_state=42)
#Xmds_transformed = embedding.fit_transform(Xmds[:,np.argsort(model2.feature_importances_)[::-1][:200]])
titer_mds = embedding.fit_transform(df.iloc[:,2:].values)
titer_mds.shape
# Wall time: 57.6 s
Code
%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0].set_xlabel("titer MDS0")
axs[0].set_ylabel("titer MDS1")
#axs[0].set_title('titer01')

axs[1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('titer02')
axs[1].set_xlabel("titer MDS0")
axs[1].set_ylabel("titer MDS2")

im = axs[2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('titer12')
axs[2].set_xlabel("titer MDS1")
axs[2].set_ylabel("titer MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("3D MDS projections of the titer data")

Try rotating to optimal view in 3D!

Code
%matplotlib notebook
Code
fig = px.scatter_3d(x=titer_mds[:,0],y=titer_mds[:,1],z=titer_mds[:,2],
              color=years)
fig.show()

2 Blosum representation of amino acids and mutations

In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. There are different versions, we will use the BLOSUM62

Code
# https://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt

#   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
blosum62_raw = np.array ([
['A',  4, -1, -2, -2,  0, -1, -1,  0, -2, -1, -1, -1, -1, -2, -1,  1,  0, -3, -2,  0, -2, -1,  0], 
['R', -1,  5,  0, -2, -3,  1,  0, -2,  0, -3, -2,  2, -1, -3, -2, -1, -1, -3, -2, -3, -1,  0, -1], 
['N', -2,  0,  6,  1, -3,  0,  0,  0,  1, -3, -3,  0, -2, -3, -2,  1,  0, -4, -2, -3,  3,  0, -1], 
['D', -2, -2,  1,  6, -3,  0,  2, -1, -1, -3, -4, -1, -3, -3, -1,  0, -1, -4, -3, -3,  4,  1, -1], 
['C',  0, -3, -3, -3,  9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1, -3, -3, -2], 
['Q', -1,  1,  0,  0, -3,  5,  2, -2,  0, -3, -2,  1,  0, -3, -1,  0, -1, -2, -1, -2,  0,  3, -1], 
['E', -1,  0,  0,  2, -4,  2,  5, -2,  0, -3, -3,  1, -2, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['G',  0, -2,  0, -1, -3, -2, -2,  6, -2, -4, -4, -2, -3, -3, -2,  0, -2, -2, -3, -3, -1, -2, -1], 
['H', -2,  0,  1, -1, -3,  0,  0, -2,  8, -3, -3, -1, -2, -1, -2, -1, -2, -2,  2, -3,  0,  0, -1], 
['I', -1, -3, -3, -3, -1, -3, -3, -4, -3,  4,  2, -3,  1,  0, -3, -2, -1, -3, -1,  3, -3, -3, -1], 
['L', -1, -2, -3, -4, -1, -2, -3, -4, -3,  2,  4, -2,  2,  0, -3, -2, -1, -2, -1,  1, -4, -3, -1], 
['K', -1,  2,  0, -1, -3,  1,  1, -2, -1, -3, -2,  5, -1, -3, -1,  0, -1, -3, -2, -2,  0,  1, -1], 
['M', -1, -1, -2, -3, -1,  0, -2, -3, -2,  1,  2, -1,  5,  0, -2, -1, -1, -1, -1,  1, -3, -1, -1], 
['F', -2, -3, -3, -3, -2, -3, -3, -3, -1,  0,  0, -3,  0,  6, -4, -2, -2,  1,  3, -1, -3, -3, -1], 
['P', -1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4,  7, -1, -1, -4, -3, -2, -2, -1, -2], 
['S',  1, -1,  1,  0, -1,  0,  0,  0, -1, -2, -2,  0, -1, -2, -1,  4,  1, -3, -2, -2,  0,  0,  0], 
['T',  0, -1,  0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1,  1,  5, -2, -2,  0, -1, -1,  0], 
['W', -3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1,  1, -4, -3, -2, 11,  2, -3, -4, -3, -2], 
['Y', -2, -2, -2, -3, -2, -1, -2, -3,  2, -1, -1, -2, -1,  3, -3, -2, -2,  2,  7, -1, -3, -2, -1], 
['V',  0, -3, -3, -3, -1, -2, -2, -3, -3,  3,  1, -2,  1, -1, -2, -2,  0, -3, -1,  4, -3, -2, -1], 
['B', -2, -1,  3,  4, -3,  0,  1, -1,  0, -3, -4,  0, -3, -3, -2,  0, -1, -4, -3, -3,  4,  1, -1], 
['Z', -1,  0,  0,  1, -3,  3,  4, -2,  0, -3, -3,  1, -1, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['X',  0, -1, -1, -1, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2,  0,  0, -2, -1, -1, -1, -1, -1]
])

blosum62 = pd.DataFrame(blosum62_raw).rename(columns={0:'AminoAcid'})
cols = blosum62.columns.drop('AminoAcid')
blosum62[cols] = blosum62[cols].apply(pd.to_numeric, errors='coerce')

blosum62['-'] = -np.diag(blosum62_raw[:,1:].astype(int))
#blosum62.loc[len(blosum62)] = list(np.append('-',np.append(-np.diag(blosum62_raw[:,1:].astype(int)),0) ))
_ = list(-np.diag(blosum62_raw[:,1:].astype(np.int64)))
blosum62.loc[len(blosum62)] = blosum62.loc[len(blosum62)-1]
blosum62.iloc[len(blosum62)-1,0] = '-'
blosum62.iloc[len(blosum62)-1,1:] = _ + [0]

aminoAcids = list(blosum62.AminoAcid.values)
blosum62_matrix = blosum62.iloc[:,1:].values.astype(float)
#
blosum62.columns = np.concatenate([['AminoAcids'],aminoAcids])

blosum62
AminoAcids A R N D C Q E G H ... P S T W Y V B Z X -
0 A 4 -1 -2 -2 0 -1 -1 0 -2 ... -1 1 0 -3 -2 0 -2 -1 0 -4
1 R -1 5 0 -2 -3 1 0 -2 0 ... -2 -1 -1 -3 -2 -3 -1 0 -1 -5
2 N -2 0 6 1 -3 0 0 0 1 ... -2 1 0 -4 -2 -3 3 0 -1 -6
3 D -2 -2 1 6 -3 0 2 -1 -1 ... -1 0 -1 -4 -3 -3 4 1 -1 -6
4 C 0 -3 -3 -3 9 -3 -4 -3 -3 ... -3 -1 -1 -2 -2 -1 -3 -3 -2 -9
5 Q -1 1 0 0 -3 5 2 -2 0 ... -1 0 -1 -2 -1 -2 0 3 -1 -5
6 E -1 0 0 2 -4 2 5 -2 0 ... -1 0 -1 -3 -2 -2 1 4 -1 -5
7 G 0 -2 0 -1 -3 -2 -2 6 -2 ... -2 0 -2 -2 -3 -3 -1 -2 -1 -6
8 H -2 0 1 -1 -3 0 0 -2 8 ... -2 -1 -2 -2 2 -3 0 0 -1 -8
9 I -1 -3 -3 -3 -1 -3 -3 -4 -3 ... -3 -2 -1 -3 -1 3 -3 -3 -1 -4
10 L -1 -2 -3 -4 -1 -2 -3 -4 -3 ... -3 -2 -1 -2 -1 1 -4 -3 -1 -4
11 K -1 2 0 -1 -3 1 1 -2 -1 ... -1 0 -1 -3 -2 -2 0 1 -1 -5
12 M -1 -1 -2 -3 -1 0 -2 -3 -2 ... -2 -1 -1 -1 -1 1 -3 -1 -1 -5
13 F -2 -3 -3 -3 -2 -3 -3 -3 -1 ... -4 -2 -2 1 3 -1 -3 -3 -1 -6
14 P -1 -2 -2 -1 -3 -1 -1 -2 -2 ... 7 -1 -1 -4 -3 -2 -2 -1 -2 -7
15 S 1 -1 1 0 -1 0 0 0 -1 ... -1 4 1 -3 -2 -2 0 0 0 -4
16 T 0 -1 0 -1 -1 -1 -1 -2 -2 ... -1 1 5 -2 -2 0 -1 -1 0 -5
17 W -3 -3 -4 -4 -2 -2 -3 -2 -2 ... -4 -3 -2 11 2 -3 -4 -3 -2 -11
18 Y -2 -2 -2 -3 -2 -1 -2 -3 2 ... -3 -2 -2 2 7 -1 -3 -2 -1 -7
19 V 0 -3 -3 -3 -1 -2 -2 -3 -3 ... -2 -2 0 -3 -1 4 -3 -2 -1 -4
20 B -2 -1 3 4 -3 0 1 -1 0 ... -2 0 -1 -4 -3 -3 4 1 -1 -4
21 Z -1 0 0 1 -3 3 4 -2 0 ... -1 0 -1 -3 -2 -2 1 4 -1 -4
22 X 0 -1 -1 -1 -2 -1 -1 -1 -1 ... -2 0 0 -2 -1 -1 -1 -1 -1 1
23 - -4 -5 -6 -6 -9 -5 -5 -6 -8 ... -7 -4 -5 -11 -7 -4 -4 -4 1 0

24 rows × 25 columns

2.1 BLOSUM62 substitution w.r.t the ref seq

BI/16190/68 is one of the earliest cases, we may take that as reference. Same was used in the HA study. We calculate the BLOSUM62 substitution vectors for the mutations in each strain with respect to this reference.

Please note that in the FASTA file there were spurious single amino-acid extra deletions at the last positions, they were deleted.

Intra-sequence deletions are encoded with the negative value of the matching AA.

Code
dummy = []
for x in fasta_df.iterrows():
    idx1 = x[1]['strain']
    #print(idx1)
    # there are sporadic lower case letters and probably false indels
    seq1 = x[1]['seq'].upper()
    if len(seq1)>len(ref_seq1nt):
        seq1 = seq1.replace('-','')
#    else: # spurious single deletions are replaced with ref
#        ii = seq1.find('-')
#        if ii>=0:
#            s = list(seq1)
#            s[ii] = ref_seq1nt[ii]
#            seq1 = ''.join(s)
    dummy.append([idx1,
                  [ blosum62_matrix[aminoAcids.index(seq1[i]),aminoAcids.index(ref_seq1nt[i])]  for i in range(len(seq1)) ]])
        
ref_blosum62 = pd.DataFrame(dummy,columns=['strain','blosum62'])

#CPU times: user 128 ms, sys: 15.7 ms, total: 143 ms
Code
embedding = MDS(n_components=3, random_state=42)
blosum_mds= embedding.fit_transform(np.row_stack(ref_blosum62['blosum62'].values))
blosum_mds.shape
#1min 16s

The sequence projections are similar, but not the same as the titer MDS projections. But we did not expect it either as beyond rotations and mirror images, top-bottom flip, other small differences are expected. On the other hand these projections also nicely correlate with the date, maybe even better than the titer projections. This can be understood, as sequences evolve with time, but sometimes antigenic changes do not follow.

Code
%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(blosum_mds[:,0],blosum_mds[:,1],c=years, cmap=cmap, s=siz)
#axs[0].set_title('blosum01')
axs[0].set_xlabel("BLOSUM62 MDS0")
axs[0].set_ylabel("BLOSUM62 MDS1")

axs[1].scatter(blosum_mds[:,0],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('blosum02')
axs[1].set_xlabel("BLOSUM62 MDS0")
axs[1].set_ylabel("BLOSUM62 MDS2")


im = axs[2].scatter(blosum_mds[:,1],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('blosum12')
axs[2].set_xlabel("BLOSUM62 MDS1")
axs[2].set_ylabel("BLOSUM62 MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("Projections of the BLOSUM62 encoded sequence data")

Try rotating to optimal view in 3D!

Code
fig = px.scatter_3d(x=blosum_mds[:,0],y=blosum_mds[:,1],z=blosum_mds[:,2],
              color=years)
fig.show()

2.2 Predict date from titer

Up to now we used unsupervised projections. Now we will perform basic machine learning regressions. Part of the data will be used as training set to fit the relations, than it will be evaluated on an independent test set. The train test split can be done randomly or based on dates. The first option makes the predictions easier as similar data can be present in the training and test set. Extrapolating future properties would be the most interesting, interpolation is easier.

First let check if we can estimate the date from the titer values. This may not has any use, but can inform us on the relation between properties and in a sense measure the information content.

Code
%matplotlib inline
X = df.iloc[:,2:].values 
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.5035031707908746
Test RMS: 2.725250995631021  MAD: 1.986651428801343

As expected it is hard to extrapolate future as the training set does not contain future dates. The exact values are not correct, but a trend can be seen.

Code

X = df.iloc[:,2:].values 
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.5910112899690743
Test RMS: 17.360129600702983  MAD: 15.309164997196916

2.3 Predict date from amino acids

The same as before, but now the model’s inputs are not the titer values, but the BLOSUM62 encoded sequences.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Blosum62")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.44203423102499495
Test RMS: 0.9996396331279678  MAD: 0.6920104662905252

2.4 Important positions

The model allows us to list the AA positions that were most influential in estimating the date of the strain.

Code

#  - train on all data
plt.figure(figsize=(12,6))
model.fit(X, years)
plt.bar(x=range(len(model.feature_importances_)),height=model.feature_importances_)
plt.plot(model.feature_importances_,'o')
plt.xlabel("NA - amino acid position")
plt.ylabel("Feature importance")
Text(0, 0.5, 'Feature importance')

Code
top20 = np.argsort(model.feature_importances_)[::-1]
dftop_date = pd.DataFrame(index=top20.astype(int), data=model.feature_importances_[top20],columns=['Importance']).head(20).T
dftop_date.rename_axis(columns="Position")
Position 142 140 148 380 368 42 220 400 214 366 371 345 401 266 248 463 312 146 198 337
Importance 0.368931 0.120453 0.104074 0.08837 0.054158 0.044541 0.037659 0.025245 0.018386 0.01421 0.011318 0.011128 0.009755 0.009006 0.008715 0.008032 0.006504 0.006491 0.00637 0.004328

2.5 Extrapolate future

As for the titer it is not possible to guess future dates, but we see even better correlated trend.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5,c='orange', label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.423688698237624
Test RMS: 16.237387002262007  MAD: 14.246689010423573

2.6 Interpolate

Now not the last decades are kept as test set, but dates between 1990 and 2000. Interpolation is pretty good for this range.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)