Flu NA data exploration

Simple machine learning methods for genotype to phenotype prediction

Author
Affiliation

Istvan Csabai

ELTE Eötvös Loránd University

Published

June 8, 2023

Short summary
In this notebook, we explore the relation between mutations and antigenic properties of flu NA protein. In collaboration with Ron Fouchier (EMC), Daniel Remondini (UNIBO), Francesco Durazzi (UNIBO)

1 Data

1.1 The titer table

Titer data for 301 strains measured against 25 sera. Date of the flu strains are encoded in the sample names.

Code
###---#| output: false
# df = pd.read_json('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8.ace')
df = pd.read_csv('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_dist.csv')
df = df.rename(columns={"Unnamed: 0":"strain"})
df
strain F12050 F12052 F12021 F12026 F12027 F12029 F12031 F12034 F12035 ... F12048 F13039 F13041 F13045 F14004 F20028 F20030 F20031 F20034 F20035
0 M1/57 1.574416 3.402557 2.898778 1.942968 7.229467 7.093932 7.939687 9.517563 10.359909 ... 9.615901 12.123658 12.828431 12.156852 13.593330 11.677755 14.704255 15.289147 15.123345 16.697156
1 B1/68 2.600630 1.430592 3.880875 2.615082 6.864780 6.216929 6.578444 7.853448 7.738927 ... 8.441434 9.317535 10.502744 9.998033 11.486862 9.712056 12.089350 12.497773 12.437677 13.903903
2 BI/16190/68 3.011607 1.195503 4.316032 2.675801 6.538964 5.848507 6.188585 7.434459 7.276061 ... 8.140292 8.973482 10.118619 9.659428 11.182185 9.304252 11.655434 12.054789 12.000709 13.458979
3 BI/21793/72 7.559219 4.792659 8.788941 5.179885 1.874458 1.322527 1.207710 2.748615 5.021993 ... 7.520932 9.757911 9.416751 9.660916 11.522350 7.563519 10.007944 10.281167 10.315149 11.478862
4 BI/1761/76 7.689574 4.992261 8.939485 5.262361 1.545108 1.211213 1.464863 2.929424 5.314439 ... 7.522217 9.992182 9.556238 9.789711 11.645269 7.641819 10.178269 10.485362 10.496395 11.672114
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
296 rec A/Hong Kong/2671/2019 15.341979 12.378078 16.883226 13.812782 10.841371 10.086912 10.042302 9.007271 6.301289 ... 8.565258 5.863700 4.496721 6.152295 6.982223 5.022823 1.221563 1.358903 0.831859 1.841209
297 rec A/SouthAustralia/34/2019 15.753855 12.787633 17.182149 14.297536 11.494104 10.685924 10.389738 9.263233 6.418597 ... 9.602043 6.270547 5.450944 7.065334 7.865231 6.094661 2.147825 0.836129 1.483691 0.680975
298 rec A/Netherlands/16/2020 15.090016 12.109388 16.643504 13.518047 10.459409 9.719283 9.698070 8.667638 6.036871 ... 8.283305 5.851700 4.367475 6.031341 6.947526 4.717432 1.189683 1.515029 0.981041 2.132923
299 rec A/Netherlands/122/2020 16.855485 13.849411 18.172795 15.362140 12.403461 11.611856 11.043661 9.755267 7.122762 ... 11.133307 7.770089 7.105910 8.720389 9.514380 7.596464 3.792909 2.257337 3.140030 1.052983
300 rec A/Cambodia/e0826360/2020 16.228734 13.254948 17.613645 14.776721 11.951957 11.143226 10.745410 9.561260 6.756559 ... 10.239361 6.802866 6.088806 7.695222 8.460796 6.726722 2.793297 1.354244 2.126825 0.184532

301 rows × 25 columns

Code
df.iloc[:,1:].shape, df.iloc[:,1:].drop_duplicates().shape

1.2 Amino acid sequences

Aligned neuraminidase (NA) sequennces for the 301 flu strains.

  • BI/16190/68 is one of the earliest cases, we may take that as reference (Same was used in the HA study)
Code
fastaNames = ! grep ">" ../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta | sed 's/>//g'
tableNames = df['strain']
#tableNames[:5]
set(fastaNames)==set(tableNames.values) # check if both data sources have the same samples
Code
from Bio import SeqIO

ref_id = 'BI/16190/68'

fasta_seq = []
fasta_id = []

fnam = "../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta"

fasta_sequences = SeqIO.parse(open(fnam),'fasta')       
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    fasta_seq.append(sequence)
    fasta_id.append(name)
    if name==ref_id:
        ref_seq1nt = sequence

fasta_df = pd.DataFrame(np.column_stack([fasta_id,fasta_seq]),columns=['strain','seq'])
Code
# SeqIO does not parse names with whitespaces
fasta_df['strain'] = fastaNames
fasta_df
strain seq
0 AK/4/93 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECNSPP...
1 AM/1609/77 MNPNQKIITIGSVSLTIATVCFLMQIAILVTTVTLHFKQYECDSPP...
2 AM/4112/92 MNPNQKIITIGSVSLTIATICFLMQITILVTTVTLHFKQYECNSPP...
3 AT/211/89 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
4 AT/3572/88 MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
... ... ...
296 rec A/Switzerland/8060/2017 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
297 rec A/Switzerland/9715293/2013 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
298 wt A/Netherlands/2249/2013 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
299 wt A/Perth/16/2009 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
300 wt A/Victoria/361/2011 MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...

301 rows × 2 columns

Code
fasta_df.iloc[:,1:].shape, fasta_df.iloc[:,1:].drop_duplicates().shape
Code
df = pd.merge(fasta_df,df,on="strain")
df.head(3)
Code
years = df['strain'].str.split("/").str[-1].values.astype(int)
years = np.array([ y+(y<2000)*1900 for y in [y+(y<23)*2000 for y in years ]], dtype=int)
years
Code
dfy = df.copy()
#dfy.iloc[:,2:-1] = dfy.iloc[:,2:-1].reindex(sorted(dfy.iloc[:,2:-1].columns), axis=1)
dfy['year'] = years
dfy = dfy.sort_values('year')
fig, ax = plt.subplots(1,1,figsize=(16,8))
names = dfy.iloc[:,2:-1].columns
strains = dfy['strain'].values
yy = dfy.year.values 
img = ax.imshow(dfy.iloc[:,2:-1].T,aspect='auto',interpolation='nearest')
plt.title("All titer values")
plt.xlabel("strains ordered by years")
plt.ylabel("sera")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
#ax.set_xticks(range(len(strains)))
#ax.set_xticklabels(strains,rotation=45,size=3)
ax.set_xticks(range(len(yy)))
ax.locator_params(nbins=10, axis='x')
ax.set_xticklabels(yy,rotation=45)
_ =fig.colorbar(img,label="measured titer")

1.3 Remove duplicates

Code
print("Number of all strains:",len(df))
df = df.drop_duplicates(subset=['seq'])
print("Number strains after removing duplicates:",len(df))
Number of all strains: 301
Number strains after removing duplicates: 203
Code
years = df['strain'].str.split("/").str[-1].values.astype(int)
years = np.array([ y+(y<2000)*1900 for y in [y+(y<23)*2000 for y in years ]], dtype=int)

1.4 MDS projection of titer data

There are 25 measured values for each strain. These 25 dimensional data vectors are projected down to 3D with multidimensional scaling (MDS). This method tries to keep the topology indformation of the original data. Please note that a new MDS projection of the same data may provide a different projection depending on the random_state seed value.

Code
embedding = MDS(n_components=3, random_state=42)
#Xmds_transformed = embedding.fit_transform(Xmds[:,np.argsort(model2.feature_importances_)[::-1][:200]])
titer_mds = embedding.fit_transform(df.iloc[:,2:].values)
titer_mds.shape
# Wall time: 57.6 s
Code
%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0].set_xlabel("titer MDS0")
axs[0].set_ylabel("titer MDS1")
#axs[0].set_title('titer01')

axs[1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('titer02')
axs[1].set_xlabel("titer MDS0")
axs[1].set_ylabel("titer MDS2")

im = axs[2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('titer12')
axs[2].set_xlabel("titer MDS1")
axs[2].set_ylabel("titer MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("3D MDS projections of the titer data")

Try rotating to optimal view in 3D!

Code
%matplotlib notebook
Code
fig = px.scatter_3d(x=titer_mds[:,0],y=titer_mds[:,1],z=titer_mds[:,2],
              color=years)
fig.show()

2 Blosum representation of amino acids and mutations

In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. There are different versions, we will use the BLOSUM62

Code
# https://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt

#   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
blosum62_raw = np.array ([
['A',  4, -1, -2, -2,  0, -1, -1,  0, -2, -1, -1, -1, -1, -2, -1,  1,  0, -3, -2,  0, -2, -1,  0], 
['R', -1,  5,  0, -2, -3,  1,  0, -2,  0, -3, -2,  2, -1, -3, -2, -1, -1, -3, -2, -3, -1,  0, -1], 
['N', -2,  0,  6,  1, -3,  0,  0,  0,  1, -3, -3,  0, -2, -3, -2,  1,  0, -4, -2, -3,  3,  0, -1], 
['D', -2, -2,  1,  6, -3,  0,  2, -1, -1, -3, -4, -1, -3, -3, -1,  0, -1, -4, -3, -3,  4,  1, -1], 
['C',  0, -3, -3, -3,  9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1, -3, -3, -2], 
['Q', -1,  1,  0,  0, -3,  5,  2, -2,  0, -3, -2,  1,  0, -3, -1,  0, -1, -2, -1, -2,  0,  3, -1], 
['E', -1,  0,  0,  2, -4,  2,  5, -2,  0, -3, -3,  1, -2, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['G',  0, -2,  0, -1, -3, -2, -2,  6, -2, -4, -4, -2, -3, -3, -2,  0, -2, -2, -3, -3, -1, -2, -1], 
['H', -2,  0,  1, -1, -3,  0,  0, -2,  8, -3, -3, -1, -2, -1, -2, -1, -2, -2,  2, -3,  0,  0, -1], 
['I', -1, -3, -3, -3, -1, -3, -3, -4, -3,  4,  2, -3,  1,  0, -3, -2, -1, -3, -1,  3, -3, -3, -1], 
['L', -1, -2, -3, -4, -1, -2, -3, -4, -3,  2,  4, -2,  2,  0, -3, -2, -1, -2, -1,  1, -4, -3, -1], 
['K', -1,  2,  0, -1, -3,  1,  1, -2, -1, -3, -2,  5, -1, -3, -1,  0, -1, -3, -2, -2,  0,  1, -1], 
['M', -1, -1, -2, -3, -1,  0, -2, -3, -2,  1,  2, -1,  5,  0, -2, -1, -1, -1, -1,  1, -3, -1, -1], 
['F', -2, -3, -3, -3, -2, -3, -3, -3, -1,  0,  0, -3,  0,  6, -4, -2, -2,  1,  3, -1, -3, -3, -1], 
['P', -1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4,  7, -1, -1, -4, -3, -2, -2, -1, -2], 
['S',  1, -1,  1,  0, -1,  0,  0,  0, -1, -2, -2,  0, -1, -2, -1,  4,  1, -3, -2, -2,  0,  0,  0], 
['T',  0, -1,  0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1,  1,  5, -2, -2,  0, -1, -1,  0], 
['W', -3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1,  1, -4, -3, -2, 11,  2, -3, -4, -3, -2], 
['Y', -2, -2, -2, -3, -2, -1, -2, -3,  2, -1, -1, -2, -1,  3, -3, -2, -2,  2,  7, -1, -3, -2, -1], 
['V',  0, -3, -3, -3, -1, -2, -2, -3, -3,  3,  1, -2,  1, -1, -2, -2,  0, -3, -1,  4, -3, -2, -1], 
['B', -2, -1,  3,  4, -3,  0,  1, -1,  0, -3, -4,  0, -3, -3, -2,  0, -1, -4, -3, -3,  4,  1, -1], 
['Z', -1,  0,  0,  1, -3,  3,  4, -2,  0, -3, -3,  1, -1, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['X',  0, -1, -1, -1, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2,  0,  0, -2, -1, -1, -1, -1, -1]
])

blosum62 = pd.DataFrame(blosum62_raw).rename(columns={0:'AminoAcid'})
cols = blosum62.columns.drop('AminoAcid')
blosum62[cols] = blosum62[cols].apply(pd.to_numeric, errors='coerce')

blosum62['-'] = -np.diag(blosum62_raw[:,1:].astype(int))
#blosum62.loc[len(blosum62)] = list(np.append('-',np.append(-np.diag(blosum62_raw[:,1:].astype(int)),0) ))
_ = list(-np.diag(blosum62_raw[:,1:].astype(np.int64)))
blosum62.loc[len(blosum62)] = blosum62.loc[len(blosum62)-1]
blosum62.iloc[len(blosum62)-1,0] = '-'
blosum62.iloc[len(blosum62)-1,1:] = _ + [0]

aminoAcids = list(blosum62.AminoAcid.values)
blosum62_matrix = blosum62.iloc[:,1:].values.astype(float)
#
blosum62.columns = np.concatenate([['AminoAcids'],aminoAcids])

blosum62
AminoAcids A R N D C Q E G H ... P S T W Y V B Z X -
0 A 4 -1 -2 -2 0 -1 -1 0 -2 ... -1 1 0 -3 -2 0 -2 -1 0 -4
1 R -1 5 0 -2 -3 1 0 -2 0 ... -2 -1 -1 -3 -2 -3 -1 0 -1 -5
2 N -2 0 6 1 -3 0 0 0 1 ... -2 1 0 -4 -2 -3 3 0 -1 -6
3 D -2 -2 1 6 -3 0 2 -1 -1 ... -1 0 -1 -4 -3 -3 4 1 -1 -6
4 C 0 -3 -3 -3 9 -3 -4 -3 -3 ... -3 -1 -1 -2 -2 -1 -3 -3 -2 -9
5 Q -1 1 0 0 -3 5 2 -2 0 ... -1 0 -1 -2 -1 -2 0 3 -1 -5
6 E -1 0 0 2 -4 2 5 -2 0 ... -1 0 -1 -3 -2 -2 1 4 -1 -5
7 G 0 -2 0 -1 -3 -2 -2 6 -2 ... -2 0 -2 -2 -3 -3 -1 -2 -1 -6
8 H -2 0 1 -1 -3 0 0 -2 8 ... -2 -1 -2 -2 2 -3 0 0 -1 -8
9 I -1 -3 -3 -3 -1 -3 -3 -4 -3 ... -3 -2 -1 -3 -1 3 -3 -3 -1 -4
10 L -1 -2 -3 -4 -1 -2 -3 -4 -3 ... -3 -2 -1 -2 -1 1 -4 -3 -1 -4
11 K -1 2 0 -1 -3 1 1 -2 -1 ... -1 0 -1 -3 -2 -2 0 1 -1 -5
12 M -1 -1 -2 -3 -1 0 -2 -3 -2 ... -2 -1 -1 -1 -1 1 -3 -1 -1 -5
13 F -2 -3 -3 -3 -2 -3 -3 -3 -1 ... -4 -2 -2 1 3 -1 -3 -3 -1 -6
14 P -1 -2 -2 -1 -3 -1 -1 -2 -2 ... 7 -1 -1 -4 -3 -2 -2 -1 -2 -7
15 S 1 -1 1 0 -1 0 0 0 -1 ... -1 4 1 -3 -2 -2 0 0 0 -4
16 T 0 -1 0 -1 -1 -1 -1 -2 -2 ... -1 1 5 -2 -2 0 -1 -1 0 -5
17 W -3 -3 -4 -4 -2 -2 -3 -2 -2 ... -4 -3 -2 11 2 -3 -4 -3 -2 -11
18 Y -2 -2 -2 -3 -2 -1 -2 -3 2 ... -3 -2 -2 2 7 -1 -3 -2 -1 -7
19 V 0 -3 -3 -3 -1 -2 -2 -3 -3 ... -2 -2 0 -3 -1 4 -3 -2 -1 -4
20 B -2 -1 3 4 -3 0 1 -1 0 ... -2 0 -1 -4 -3 -3 4 1 -1 -4
21 Z -1 0 0 1 -3 3 4 -2 0 ... -1 0 -1 -3 -2 -2 1 4 -1 -4
22 X 0 -1 -1 -1 -2 -1 -1 -1 -1 ... -2 0 0 -2 -1 -1 -1 -1 -1 1
23 - -4 -5 -6 -6 -9 -5 -5 -6 -8 ... -7 -4 -5 -11 -7 -4 -4 -4 1 0

24 rows × 25 columns

2.1 BLOSUM62 substitution w.r.t the ref seq

BI/16190/68 is one of the earliest cases, we may take that as reference. Same was used in the HA study. We calculate the BLOSUM62 substitution vectors for the mutations in each strain with respect to this reference.

Please note that in the FASTA file there were spurious single amino-acid extra deletions at the last positions, they were deleted.

Intra-sequence deletions are encoded with the negative value of the matching AA.

Code
dummy = []
for x in df.iterrows():
    idx1 = x[1]['strain']
    #print(idx1)
    # there are sporadic lower case letters and probably false indels
    seq1 = x[1]['seq'].upper()
    if len(seq1)>len(ref_seq1nt):
        seq1 = seq1.replace('-','')
#    else: # spurious single deletions are replaced with ref
#        ii = seq1.find('-')
#        if ii>=0:
#            s = list(seq1)
#            s[ii] = ref_seq1nt[ii]
#            seq1 = ''.join(s)
    dummy.append([idx1,
                  [ blosum62_matrix[aminoAcids.index(seq1[i]),aminoAcids.index(ref_seq1nt[i])]  for i in range(len(seq1)) ]])
        
ref_blosum62 = pd.DataFrame(dummy,columns=['strain','blosum62'])

#CPU times: user 128 ms, sys: 15.7 ms, total: 143 ms
Code
embedding = MDS(n_components=3, random_state=42)
blosum_mds= embedding.fit_transform(np.row_stack(ref_blosum62['blosum62'].values))
blosum_mds.shape
#1min 16s

The sequence projections are similar, but not the same as the titer MDS projections. But we did not expect it either as beyond rotations and mirror images, top-bottom flip, other small differences are expected. On the other hand these projections also nicely correlate with the date, maybe even better than the titer projections. This can be understood, as sequences evolve with time, but sometimes antigenic changes do not follow.

Code
%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(blosum_mds[:,0],blosum_mds[:,1],c=years, cmap=cmap, s=siz)
#axs[0].set_title('blosum01')
axs[0].set_xlabel("BLOSUM62 MDS0")
axs[0].set_ylabel("BLOSUM62 MDS1")

axs[1].scatter(blosum_mds[:,0],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('blosum02')
axs[1].set_xlabel("BLOSUM62 MDS0")
axs[1].set_ylabel("BLOSUM62 MDS2")


im = axs[2].scatter(blosum_mds[:,1],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('blosum12')
axs[2].set_xlabel("BLOSUM62 MDS1")
axs[2].set_ylabel("BLOSUM62 MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("Projections of the BLOSUM62 encoded sequence data")

Try rotating to optimal view in 3D!

Code
fig = px.scatter_3d(x=blosum_mds[:,0],y=blosum_mds[:,1],z=blosum_mds[:,2],
              color=years)
fig.show()

2.2 Predict date from titer

Up to now we used unsupervised projections. Now we will perform basic machine learning regressions. Part of the data will be used as training set to fit the relations, than it will be evaluated on an independent test set. The train test split can be done randomly or based on dates. The first option makes the predictions easier as similar data can be present in the training and test set. Extrapolating future properties would be the most interesting, interpolation is easier.

First let check if we can estimate the date from the titer values. This may not has any use, but can inform us on the relation between properties and in a sense measure the information content.

Code
%matplotlib inline
X = df.iloc[:,2:].values 
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.39037686470911254
Test RMS: 3.866197790653565  MAD: 2.5511748530778915

As expected it is hard to extrapolate future as the training set does not contain future dates. The exact values are not correct, but a trend can be seen.

Code

X = df.iloc[:,2:].values 
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.3762572373169865
Test RMS: 17.695494435932396  MAD: 15.699180703931628

2.3 Predict date from amino acids

The same as before, but now the model’s inputs are not the titer values, but the BLOSUM62 encoded sequences.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Blosum62")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.2968988425143859
Test RMS: 1.8825734305831914  MAD: 0.9007159532010938

2.4 Important positions

The model allows us to list the AA positions that were most influential in estimating the date of the strain.

Code

#  - train on all data
plt.figure(figsize=(12,6))
model.fit(X, years)
plt.bar(x=range(len(model.feature_importances_)),height=model.feature_importances_)
plt.plot(model.feature_importances_,'o')
plt.xlabel("NA - amino acid position")
plt.ylabel("Feature importance")
Text(0, 0.5, 'Feature importance')

Code
top20 = np.argsort(model.feature_importances_)[::-1]
dftop_date = pd.DataFrame(index=top20.astype(int), data=model.feature_importances_[top20],columns=['Importance']).head(20).T
dftop_date.rename_axis(columns="Position")
Position 142 380 140 148 368 248 42 214 400 337 401 366 371 146 463 345 434 220 198 389
Importance 0.33694 0.172388 0.103847 0.071032 0.060787 0.03746 0.035222 0.02284 0.017074 0.015146 0.012363 0.012339 0.010792 0.008597 0.008354 0.007542 0.007469 0.007191 0.005671 0.005365

2.5 Extrapolate future

As for the titer it is not possible to guess future dates, but we see even better correlated trend.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5,c='orange', label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.3508946620338886
Test RMS: 15.45656849472422  MAD: 13.64992307797976

2.6 Interpolate

Now not the last decades are kept as test set, but dates between 1990 and 2000. Interpolation is pretty good for this range.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange',label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()
Train RMS: 0.28871900588519556
Test RMS: 2.0734810260544148  MAD: 1.6127714884907578

2.7 Predict titer MDS 3D map coordinates from BLOSUM62 (not that important)

Use BLOSUM62 encoded sequences as input and try to estimate the 3 MDS titer coordinates. The task is not easy, now the model also tries to mimic the MDS projection, that is a complex nonlinear transformation.

For the training set the correlation is very good, but that may be result of overfitting. For the test set there are larger errors.

Code
X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []

for i in range(3):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    print('Train RMS:',RMS_train)
    print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
for i in range(3):
    axs[i].scatter(y_train[:,i],trains[i], alpha=1, label='train')
    axs[i].scatter(y_test[:,i],tests[i], alpha=0.5, c='orange', label= 'test')
    axs[i].set_title("Titer MDS "+str(i)+" from Blosum62")
_ = plt.legend()
Train RMS: 0.8467581574082075
Test RMS: 2.3886533922034943  MAD: 1.7520059195020274
Train RMS: 1.5348564419999207
Test RMS: 3.141217283960353  MAD: 2.3909111793389775
Train RMS: 1.490509233475044
Test RMS: 3.3366229858250156  MAD: 2.5586576079737684

Code

%matplotlib inline
fig, axs = plt.subplots(2, 3, figsize=(16,12), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0,0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0,0].set_title('titer01')
axs[0,1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,1].set_title('titer02')
im = axs[0,2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,2].set_title('titer12')

axs[1,0].scatter(trains[0],trains[1],c=year_train, cmap=cmap, s=siz)
axs[1,0].scatter(tests[0],tests[1],c=year_test, cmap=cmap, s=siz)
axs[1,0].set_title('predicted titer01')

axs[1,1].scatter(trains[0],trains[2],c=year_train, cmap=cmap, s=siz)
axs[1,1].scatter(tests[0],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,1].set_title('predicted titer02')

im = axs[1,2].scatter(trains[1],trains[2],c=year_train, cmap=cmap, s=siz)
im = axs[1,2].scatter(tests[1],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,2].set_title('predicted titer12')

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')

2.7.1 Importance scores

Here we can get importance scores for all 3D coordinate model. They differ, as for each if them different positions seem to be important.

Code
# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()

model = GradientBoostingRegressor()

nTiter = y.shape[1]
importance_blosum_MDS3D = []

for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_MDS3D.append(model.feature_importances_)
importance_blosum_MDS3D = np.array(importance_blosum_MDS3D)
Code
fig, ax = plt.subplots(1,1,figsize=(16,4))
names = np.arange(len(importance_blosum_MDS3D),dtype=int).astype(str)
img = ax.imshow((importance_blosum_MDS3D+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

Code
x = importance_blosum_MDS3D.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_MDS = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_MDS.rename_axis(columns="Position")
Position 368 335 42 244 312 337 198 369 246 80 436 401 220 17 400 140 247 92 430 346
Importance 0.244614 0.069826 0.064236 0.061688 0.05997 0.053575 0.048406 0.037802 0.032628 0.029656 0.026239 0.020708 0.018504 0.016499 0.013726 0.011957 0.010052 0.008089 0.007759 0.00667

2.8 Predict individual titer values from blosum - Random split

This is more interesting: can we predict the original 25 titer values (not the projected MDS) from the sequences?

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]

for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)


#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

The correlation is not bad. Let’s see cumulatively for all the titer values in one plot.

Code

rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN  RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Random split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.9 Predict individual titer values from blosum - date split

Same as above but not random train-test split rather date range 1990-2000 is in the test and all others are in the train set.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = y[idx]
y_test = y[~idx]

#X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
#year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
#idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

Code

rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN before 1990 and after 2000    RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST 1990-2000    RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Date split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.10 Importance scores

Thes scores may be more relevant for immune properties. Interesting to see that for different sera, different positions come up with high importance scores. Now there is no train-test split, we use all the data.

Code
# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
importance_blosum_titer = []
for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_titer.append(model.feature_importances_)
importance_blosum_titer = np.array(importance_blosum_titer)
Code
fig, ax = plt.subplots(1,1,figsize=(16,8))
names = df.iloc[:,2:].columns
img = ax.imshow((importance_blosum_titer+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

We can get a cumulative top list by averaging the importance scores for all the titer predictions.

Code
x = importance_blosum_titer.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_blosum_titer = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_blosum_titer.rename_axis(columns="Position")
Position 337 335 22 343 80 198 369 17 92 312 436 41 368 220 42 247 196 338 244 149
Importance 0.129865 0.091392 0.074904 0.058259 0.047636 0.036571 0.036069 0.035766 0.035186 0.033318 0.033074 0.033059 0.031427 0.022868 0.020601 0.015602 0.014866 0.014212 0.013396 0.012325