Flu NA data exploration

Simple machine learning methods for genotype to phenotype prediction

Author
Affiliation

Istvan Csabai

ELTE Eötvös Loránd University

Published

June 7, 2023

Short summary
In this notebook, we explore the relation between mutations and antigenic properties of flu NA protein. In collaboration with Ron Fouchier (EMC), Daniel Remondini (UNIBO), Francesco Durazzi (UNIBO)

1 Data

1.1 The titer table

Titer data for 301 strains measured against 24 sera. Date of the flu strains are encoded in the sample names.

Code
###---#| output: false
# df = pd.read_json('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8.ace')
df = pd.read_csv('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_dist.csv')
df = df.rename(columns={"Unnamed: 0":"strain"})
df
strain F12050 F12052 F12021 F12026 F12027 F12029 F12031 F12034 F12035 ... F12048 F13039 F13041 F13045 F14004 F20028 F20030 F20031 F20034 F20035
0 M1/57 1.574416 3.402557 2.898778 1.942968 7.229467 7.093932 7.939687 9.517563 10.359909 ... 9.615901 12.123658 12.828431 12.156852 13.593330 11.677755 14.704255 15.289147 15.123345 16.697156
1 B1/68 2.600630 1.430592 3.880875 2.615082 6.864780 6.216929 6.578444 7.853448 7.738927 ... 8.441434 9.317535 10.502744 9.998033 11.486862 9.712056 12.089350 12.497773 12.437677 13.903903
2 BI/16190/68 3.011607 1.195503 4.316032 2.675801 6.538964 5.848507 6.188585 7.434459 7.276061 ... 8.140292 8.973482 10.118619 9.659428 11.182185 9.304252 11.655434 12.054789 12.000709 13.458979
3 BI/21793/72 7.559219 4.792659 8.788941 5.179885 1.874458 1.322527 1.207710 2.748615 5.021993 ... 7.520932 9.757911 9.416751 9.660916 11.522350 7.563519 10.007944 10.281167 10.315149 11.478862
4 BI/1761/76 7.689574 4.992261 8.939485 5.262361 1.545108 1.211213 1.464863 2.929424 5.314439 ... 7.522217 9.992182 9.556238 9.789711 11.645269 7.641819 10.178269 10.485362 10.496395 11.672114
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
296 rec A/Hong Kong/2671/2019 15.341979 12.378078 16.883226 13.812782 10.841371 10.086912 10.042302 9.007271 6.301289 ... 8.565258 5.863700 4.496721 6.152295 6.982223 5.022823 1.221563 1.358903 0.831859 1.841209
297 rec A/SouthAustralia/34/2019 15.753855 12.787633 17.182149 14.297536 11.494104 10.685924 10.389738 9.263233 6.418597 ... 9.602043 6.270547 5.450944 7.065334 7.865231 6.094661 2.147825 0.836129 1.483691 0.680975
298 rec A/Netherlands/16/2020 15.090016 12.109388 16.643504 13.518047 10.459409 9.719283 9.698070 8.667638 6.036871 ... 8.283305 5.851700 4.367475 6.031341 6.947526 4.717432 1.189683 1.515029 0.981041 2.132923
299 rec A/Netherlands/122/2020 16.855485 13.849411 18.172795 15.362140 12.403461 11.611856 11.043661 9.755267 7.122762 ... 11.133307 7.770089 7.105910 8.720389 9.514380 7.596464 3.792909 2.257337 3.140030 1.052983
300 rec A/Cambodia/e0826360/2020 16.228734 13.254948 17.613645 14.776721 11.951957 11.143226 10.745410 9.561260 6.756559 ... 10.239361 6.802866 6.088806 7.695222 8.460796 6.726722 2.793297 1.354244 2.126825 0.184532

301 rows × 25 columns

1.2 Amino acid sequences

Aligned neuraminidase (NA) sequennces for the 301 flu strains.

  • BI/16190/68 is one of the earliest cases, we may take that as reference (Same was used in the HA study)
Code
fastaNames = ! grep ">" ../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta | sed 's/>//g'
tableNames = df['strain']
#tableNames[:5]
set(fastaNames)==set(tableNames.values) # check if both data sources have the same samples
Code
from Bio import SeqIO

ref_id = 'BI/16190/68'

fasta_seq = []
fasta_id = []

fnam = "../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta"

fasta_sequences = SeqIO.parse(open(fnam),'fasta')       
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    fasta_seq.append(sequence)
    fasta_id.append(name)
    if name==ref_id:
        ref_seq1nt = sequence

fasta_df = pd.DataFrame(np.column_stack([fasta_id,fasta_seq]),columns=['strain','seq'])
Code
# SeqIO does not parse names with whitespaces
fasta_df['strain'] = fastaNames
fasta_df
strain seq
0 AK/4/93 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECNSPP...
1 AM/1609/77 MNPNQKIITIGSVSLTIATVCFLMQIAILVTTVTLHFKQYECDSPP...
2 AM/4112/92 MNPNQKIITIGSVSLTIATICFLMQITILVTTVTLHFKQYECNSPP...
3 AT/211/89 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
4 AT/3572/88 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
... ... ...
296 rec A/Switzerland/8060/2017 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
297 rec A/Switzerland/9715293/2013 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
298 wt A/Netherlands/2249/2013 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
299 wt A/Perth/16/2009 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
300 wt A/Victoria/361/2011 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...

301 rows × 2 columns

Code
df = pd.merge(fasta_df,df,on="strain")
df.head(3)
Code
years = df['strain'].str.split("/").str[-1].values.astype(int)
years = np.array([ y+(y<2000)*1900 for y in [y+(y<23)*2000 for y in years ]], dtype=int)
years

1.3 MDS projection of titer data

There are 24 measured values for each strain. These 24 dimensional data vectors are projected down to 3D with multidimensional scaling (MDS). This method tries to keep the topology indformation of the original data. Please note that a new MDS projection of the same data may provide a different projection depending on the random_state seed value.

Code
embedding = MDS(n_components=3, random_state=42)
#Xmds_transformed = embedding.fit_transform(Xmds[:,np.argsort(model2.feature_importances_)[::-1][:200]])
titer_mds = embedding.fit_transform(df.iloc[:,2:].values)
titer_mds.shape
# Wall time: 57.6 s
Code
%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0].set_xlabel("titer MDS0")
axs[0].set_ylabel("titer MDS1")
#axs[0].set_title('titer01')

axs[1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('titer02')
axs[1].set_xlabel("titer MDS0")
axs[1].set_ylabel("titer MDS2")

im = axs[2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('titer12')
axs[2].set_xlabel("titer MDS1")
axs[2].set_ylabel("titer MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("3D MDS projections of the titer data")

Try rotating to optimal view in 3D!

Code
%matplotlib notebook
Code
fig = px.scatter_3d(x=titer_mds[:,0],y=titer_mds[:,1],z=titer_mds[:,2],
              color=years)
fig.show()

2 Blosum representation of amino acids and mutations

In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. There are different versions, we will use the BLOSUM62

Code
# https://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt

#   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
blosum62_raw = np.array ([
['A',  4, -1, -2, -2,  0, -1, -1,  0, -2, -1, -1, -1, -1, -2, -1,  1,  0, -3, -2,  0, -2, -1,  0], 
['R', -1,  5,  0, -2, -3,  1,  0, -2,  0, -3, -2,  2, -1, -3, -2, -1, -1, -3, -2, -3, -1,  0, -1], 
['N', -2,  0,  6,  1, -3,  0,  0,  0,  1, -3, -3,  0, -2, -3, -2,  1,  0, -4, -2, -3,  3,  0, -1], 
['D', -2, -2,  1,  6, -3,  0,  2, -1, -1, -3, -4, -1, -3, -3, -1,  0, -1, -4, -3, -3,  4,  1, -1], 
['C',  0, -3, -3, -3,  9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1, -3, -3, -2], 
['Q', -1,  1,  0,  0, -3,  5,  2, -2,  0, -3, -2,  1,  0, -3, -1,  0, -1, -2, -1, -2,  0,  3, -1], 
['E', -1,  0,  0,  2, -4,  2,  5, -2,  0, -3, -3,  1, -2, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['G',  0, -2,  0, -1, -3, -2, -2,  6, -2, -4, -4, -2, -3, -3, -2,  0, -2, -2, -3, -3, -1, -2, -1], 
['H', -2,  0,  1, -1, -3,  0,  0, -2,  8, -3, -3, -1, -2, -1, -2, -1, -2, -2,  2, -3,  0,  0, -1], 
['I', -1, -3, -3, -3, -1, -3, -3, -4, -3,  4,  2, -3,  1,  0, -3, -2, -1, -3, -1,  3, -3, -3, -1], 
['L', -1, -2, -3, -4, -1, -2, -3, -4, -3,  2,  4, -2,  2,  0, -3, -2, -1, -2, -1,  1, -4, -3, -1], 
['K', -1,  2,  0, -1, -3,  1,  1, -2, -1, -3, -2,  5, -1, -3, -1,  0, -1, -3, -2, -2,  0,  1, -1], 
['M', -1, -1, -2, -3, -1,  0, -2, -3, -2,  1,  2, -1,  5,  0, -2, -1, -1, -1, -1,  1, -3, -1, -1], 
['F', -2, -3, -3, -3, -2, -3, -3, -3, -1,  0,  0, -3,  0,  6, -4, -2, -2,  1,  3, -1, -3, -3, -1], 
['P', -1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4,  7, -1, -1, -4, -3, -2, -2, -1, -2], 
['S',  1, -1,  1,  0, -1,  0,  0,  0, -1, -2, -2,  0, -1, -2, -1,  4,  1, -3, -2, -2,  0,  0,  0], 
['T',  0, -1,  0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1,  1,  5, -2, -2,  0, -1, -1,  0], 
['W', -3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1,  1, -4, -3, -2, 11,  2, -3, -4, -3, -2], 
['Y', -2, -2, -2, -3, -2, -1, -2, -3,  2, -1, -1, -2, -1,  3, -3, -2, -2,  2,  7, -1, -3, -2, -1], 
['V',  0, -3, -3, -3, -1, -2, -2, -3, -3,  3,  1, -2,  1, -1, -2, -2,  0, -3, -1,  4, -3, -2, -1], 
['B', -2, -1,  3,  4, -3,  0,  1, -1,  0, -3, -4,  0, -3, -3, -2,  0, -1, -4, -3, -3,  4,  1, -1], 
['Z', -1,  0,  0,  1, -3,  3,  4, -2,  0, -3, -3,  1, -1, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['X',  0, -1, -1, -1, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2,  0,  0, -2, -1, -1, -1, -1, -1]
])

blosum62 = pd.DataFrame(blosum62_raw).rename(columns={0:'AminoAcid'})
cols = blosum62.columns.drop('AminoAcid')
blosum62[cols] = blosum62[cols].apply(pd.to_numeric, errors='coerce')

blosum62['-'] = -np.diag(blosum62_raw[:,1:].astype(int))
#blosum62.loc[len(blosum62)] = list(np.append('-',np.append(-np.diag(blosum62_raw[:,1:].astype(int)),0) ))
_ = list(-np.diag(blosum62_raw[:,1:].astype(np.int64)))
blosum62.loc[len(blosum62)] = blosum62.loc[len(blosum62)-1]
blosum62.iloc[len(blosum62)-1,0] = '-'
blosum62.iloc[len(blosum62)-1,1:] = _ + [0]

aminoAcids = list(blosum62.AminoAcid.values)
blosum62_matrix = blosum62.iloc[:,1:].values.astype(float)
#
blosum62.columns = np.concatenate([['AminoAcids'],aminoAcids])

blosum62
AminoAcids A R N D C Q E G H ... P S T W Y V B Z X -
0 A 4 -1 -2 -2 0 -1 -1 0 -2 ... -1 1 0 -3 -2 0 -2 -1 0 -4
1 R -1 5 0 -2 -3 1 0 -2 0 ... -2 -1 -1 -3 -2 -3 -1 0 -1 -5
2 N -2 0 6 1 -3 0 0 0 1 ... -2 1 0 -4 -2 -3 3 0 -1 -6
3 D -2 -2 1 6 -3 0 2 -1 -1 ... -1 0 -1 -4 -3 -3 4 1 -1 -6
4 C 0 -3 -3 -3 9 -3 -4 -3 -3 ... -3 -1 -1 -2 -2 -1 -3 -3 -2 -9
5 Q -1 1 0 0 -3 5 2 -2 0 ... -1 0 -1 -2 -1 -2 0 3 -1 -5
6 E -1 0 0 2 -4 2 5 -2 0 ... -1 0 -1 -3 -2 -2 1 4 -1 -5
7 G 0 -2 0 -1 -3 -2 -2 6 -2 ... -2 0 -2 -2 -3 -3 -1 -2 -1 -6
8 H -2 0 1 -1 -3 0 0 -2 8 ... -2 -1 -2 -2 2 -3 0 0 -1 -8
9 I -1 -3 -3 -3 -1 -3 -3 -4 -3 ... -3 -2 -1 -3 -1 3 -3 -3 -1 -4
10 L -1 -2 -3 -4 -1 -2 -3 -4 -3 ... -3 -2 -1 -2 -1 1 -4 -3 -1 -4
11 K -1 2 0 -1 -3 1 1 -2 -1 ... -1 0 -1 -3 -2 -2 0 1 -1 -5
12 M -1 -1 -2 -3 -1 0 -2 -3 -2 ... -2 -1 -1 -1 -1 1 -3 -1 -1 -5
13 F -2 -3 -3 -3 -2 -3 -3 -3 -1 ... -4 -2 -2 1 3 -1 -3 -3 -1 -6
14 P -1 -2 -2 -1 -3 -1 -1 -2 -2 ... 7 -1 -1 -4 -3 -2 -2 -1 -2 -7
15 S 1 -1 1 0 -1 0 0 0 -1 ... -1 4 1 -3 -2 -2 0 0 0 -4
16 T 0 -1 0 -1 -1 -1 -1 -2 -2 ... -1 1 5 -2 -2 0 -1 -1 0 -5
17 W -3 -3 -4 -4 -2 -2 -3 -2 -2 ... -4 -3 -2 11 2 -3 -4 -3 -2 -11
18 Y -2 -2 -2 -3 -2 -1 -2 -3 2 ... -3 -2 -2 2 7 -1 -3 -2 -1 -7
19 V 0 -3 -3 -3 -1 -2 -2 -3 -3 ... -2 -2 0 -3 -1 4 -3 -2 -1 -4
20 B -2 -1 3 4 -3 0 1 -1 0 ... -2 0 -1 -4 -3 -3 4 1 -1 -4
21 Z -1 0 0 1 -3 3 4 -2 0 ... -1 0 -1 -3 -2 -2 1 4 -1 -4
22 X 0 -1 -1 -1 -2 -1 -1 -1 -1 ... -2 0 0 -2 -1 -1 -1 -1 -1 1
23 - -4 -5 -6 -6 -9 -5 -5 -6 -8 ... -7 -4 -5 -11 -7 -4 -4 -4 1 0

24 rows × 25 columns

2.1 BLOSUM62 substitution w.r.t the ref seq

BI/16190/68 is one of the earliest cases, we may take that as reference. Same was used in the HA study. We calculate the BLOSUM62 substitution vectors for the mutations in each strain with respect to this reference.

Please note that in the FASTA file there were spurious single amino-acid extra deletions at the last positions, they were deleted.

Intra-sequence deletions are encoded with the negative value of the matching AA.

Code
dummy = []
for x in fasta_df.iterrows():
    idx1 = x[1]['strain']
    #print(idx1)
    # there are sporadic lower case letters and probably false indels
    seq1 = x[1]['seq'].upper()
    if len(seq1)>len(ref_seq1nt):
        seq1 = seq1.replace('-','')
#    else: # spurious single deletions are replaced with ref
#        ii = seq1.find('-')
#        if ii>=0:
#            s = list(seq1)
#            s[ii] = ref_seq1nt[ii]
#            seq1 = ''.join(s)
    dummy.append([idx1,
                  [ blosum62_matrix[aminoAcids.index(seq1[i]),aminoAcids.index(ref_seq1nt[i])]  for i in range(len(seq1)) ]])
        
ref_blosum62 = pd.DataFrame(dummy,columns=['strain','blosum62'])

#CPU times: user 128 ms, sys: 15.7 ms, total: 143 ms
Code
embedding = MDS(n_components=3, random_state=42)
blosum_mds= embedding.fit_transform(np.row_stack(ref_blosum62['blosum62'].values))
blosum_mds.shape
#1min 16s

The sequence projections are similar, but not the same as the titer MDS projections. But we did not expect it either as beyond rotations and mirror images, top-bottom flip, other small differences are expected. On the other hand these projections also nicely correlate with the date, maybe even better than the titer projections. This can be understood, as sequences evolve with time, but sometimes antigenic changes do not follow.

Code
%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(blosum_mds[:,0],blosum_mds[:,1],c=years, cmap=cmap, s=siz)
#axs[0].set_title('blosum01')
axs[0].set_xlabel("BLOSUM62 MDS0")
axs[0].set_ylabel("BLOSUM62 MDS1")

axs[1].scatter(blosum_mds[:,0],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('blosum02')
axs[1].set_xlabel("BLOSUM62 MDS0")
axs[1].set_ylabel("BLOSUM62 MDS2")


im = axs[2].scatter(blosum_mds[:,1],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('blosum12')
axs[2].set_xlabel("BLOSUM62 MDS1")
axs[2].set_ylabel("BLOSUM62 MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("Projections of the BLOSUM62 encoded sequence data")

Try rotating to optimal view in 3D!

Code
fig = px.scatter_3d(x=blosum_mds[:,0],y=blosum_mds[:,1],z=blosum_mds[:,2],
              color=years)
fig.show()

2.2 Predict date from titer

Up to now we used unsupervised projections. Now we will perform basic machine learning regressions. Part of the data will be used as training set to fit the relations, than it will be evaluated on an independent test set. The train test split can be done randomly or based on dates. The first option makes the predictions easier as similar data can be present in the training and test set. Extrapolating future properties would be the most interesting, interpolation is easier.

First let check if we can estimate the date from the titer values. This may not has any use, but can inform us on the relation between properties and in a sense measure the information content.

Code
%matplotlib inline
X = df.iloc[:,2:].values 
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.5035031707908746
Test RMS: 2.725250995631021  MAD: 1.986651428801343

As expected it is hard to extrapolate future as the training set does not contain future dates. The exact values are not correct, but a trend can be seen.

Code

X = df.iloc[:,2:].values 
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.5910112899690743
Test RMS: 17.360129600702983  MAD: 15.309164997196916

2.3 Predict date from amino acids

The same as before, but now the model’s inputs are not the titer values, but the BLOSUM62 encoded sequences.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Blosum62")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.44203423102499495
Test RMS: 0.9996396331279678  MAD: 0.6920104662905252

2.4 Important positions

The model allows us to list the AA positions that were most influential in estimating the date of the strain.

Code

#  - train on all data
plt.figure(figsize=(12,6))
model.fit(X, years)
plt.bar(x=range(len(model.feature_importances_)),height=model.feature_importances_)
plt.plot(model.feature_importances_,'o')
plt.xlabel("NA - amino acid position")
plt.ylabel("Feature importance")
Text(0, 0.5, 'Feature importance')

Code
top20 = np.argsort(model.feature_importances_)[::-1]
dftop_date = pd.DataFrame(index=top20.astype(int), data=model.feature_importances_[top20],columns=['Importance']).head(20).T
dftop_date.rename_axis(columns="Position")
Position 142 140 148 380 368 42 220 400 214 366 371 345 401 266 248 463 312 146 198 337
Importance 0.368931 0.120453 0.104074 0.08837 0.054158 0.044541 0.037659 0.025245 0.018386 0.01421 0.011318 0.011128 0.009755 0.009006 0.008715 0.008032 0.006504 0.006491 0.00637 0.004328

2.5 Extrapolate future

As for the titer it is not possible to guess future dates, but we see even better correlated trend.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5,c='orange', label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.423688698237624
Test RMS: 16.237387002262007  MAD: 14.246689010423573

2.6 Interpolate

Now not the last decades are kept as test set, but dates between 1990 and 2000. Interpolation is pretty good for this range.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange',label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.40164925862060424
Test RMS: 2.506124099751673  MAD: 2.3518982738108387

2.7 Predict titer MDS 3D map coordinates from BLOSUM62 (not that important)

Use BLOSUM62 encoded sequences as input and try to estimate the 3 MDS titer coordinates. The task is not easy, now the model also tries to mimic the MDS projection, that is a complex nonlinear transformation.

For the training set the correlation is very good, but that may be result of overfitting. For the test set there are larger errors.

Code
X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []

for i in range(3):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    print('Train RMS:',RMS_train)
    print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
for i in range(3):
    axs[i].scatter(y_train[:,i],trains[i], alpha=1, label='train')
    axs[i].scatter(y_test[:,i],tests[i], alpha=0.5, c='orange', label= 'test')
    axs[i].set_title("Titer MDS "+str(i)+" from Blosum62")
_ = plt.legend()
Train RMS: 1.686085448601462
Test RMS: 3.4235557468050892  MAD: 2.5787491053749187
Train RMS: 1.76055945608206
Test RMS: 2.9031495256923745  MAD: 2.1939659988484204
Train RMS: 1.0749697143048584
Test RMS: 2.189875130864545  MAD: 1.702785397625311

Code

%matplotlib inline
fig, axs = plt.subplots(2, 3, figsize=(16,12), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0,0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0,0].set_title('titer01')
axs[0,1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,1].set_title('titer02')
im = axs[0,2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,2].set_title('titer12')

axs[1,0].scatter(trains[0],trains[1],c=year_train, cmap=cmap, s=siz)
axs[1,0].scatter(tests[0],tests[1],c=year_test, cmap=cmap, s=siz)
axs[1,0].set_title('predicted titer01')

axs[1,1].scatter(trains[0],trains[2],c=year_train, cmap=cmap, s=siz)
axs[1,1].scatter(tests[0],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,1].set_title('predicted titer02')

im = axs[1,2].scatter(trains[1],trains[2],c=year_train, cmap=cmap, s=siz)
im = axs[1,2].scatter(tests[1],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,2].set_title('predicted titer12')

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')

2.7.1 Importance scores

Here we can get importance scores for all 3D coordinate model. They differ, as for each if them different positions seem to be important.

Code
# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()

model = GradientBoostingRegressor()

nTiter = y.shape[1]
importance_blosum_MDS3D = []

for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_MDS3D.append(model.feature_importances_)
importance_blosum_MDS3D = np.array(importance_blosum_MDS3D)
Code
fig, ax = plt.subplots(1,1,figsize=(16,4))
names = np.arange(len(importance_blosum_MDS3D),dtype=int).astype(str)
img = ax.imshow((importance_blosum_MDS3D+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

Code
x = importance_blosum_MDS3D.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_MDS = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_MDS.rename_axis(columns="Position")
Position 368 42 140 152 244 337 401 367 312 198 436 246 371 220 247 335 389 343 308 80
Importance 0.273574 0.102225 0.051161 0.046234 0.042765 0.039661 0.039077 0.037007 0.028736 0.022507 0.021308 0.01982 0.018747 0.017828 0.016492 0.011336 0.011193 0.010379 0.008198 0.007892

2.8 Predict individual titer values from blosum - Random split

This is more interesting: can we predict the original 24 titer values (not the projected MDS) from the sequences?

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]

for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)


#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

The correlation is not bad. Let’s see cumulatively for all 24 titer values in one plot.

Code

rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN  RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Random split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.9 Predict individual titer values from blosum - date split

Same as above but not random train-test split rather date range 1990-2000 is in the test and all others are in the train set.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = y[idx]
y_test = y[~idx]

#X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
#year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
#idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

Code

rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN before 1990 and after 2000    RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST 1990-2000    RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Date split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.10 Importance scores

Thes scores may be more relevant for immune properties. Interesting to see that for different sera, different positions come up with high importance scores. Now there is no train-test split, we use all the data.

Code
# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
importance_blosum_titer = []
for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_titer.append(model.feature_importances_)
importance_blosum_titer = np.array(importance_blosum_titer)
Code
fig, ax = plt.subplots(1,1,figsize=(16,8))
names = df.iloc[:,2:].columns
img = ax.imshow((importance_blosum_titer+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

We can get a cumulative top list by averaging the importance scores for the 24 titer predictions.

Code
x = importance_blosum_titer.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_blosum_titer = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_blosum_titer.rename_axis(columns="Position")
Position 337 335 343 80 198 369 22 436 92 42 41 368 312 149 220 431 247 338 17 246
Importance 0.120426 0.098993 0.067251 0.050272 0.044978 0.043522 0.039964 0.039215 0.038438 0.037392 0.036661 0.03622 0.032844 0.027331 0.020773 0.018414 0.016897 0.013058 0.012213 0.011736