Personalized medicineĀ¶
Some molecular markers were profiled for individuals with and without disease X.
Let us load the data acquired from this cohort study:
import pandas as pd
df = pd.read_csv('expression.csv')
df.head(20)
mRNA A | protein A | mRNA B | protein B | class | |
---|---|---|---|---|---|
0 | 5.8 | 16.2 | 17.3 | 15.4 | healthy |
1 | 7.2 | 17.4 | 16.9 | 13.2 | healthy |
2 | 7.7 | 18.2 | 14.1 | 13.0 | healthy |
3 | 8.8 | 30.0 | 9.1 | 8.9 | healthy |
4 | 10.0 | 21.1 | 8.1 | 6.9 | healthy |
5 | 10.8 | 23.3 | 7.2 | 5.7 | healthy |
6 | 11.1 | 24.4 | 6.9 | 5.1 | healthy |
7 | 11.4 | 26.1 | 5.8 | 4.7 | healthy |
8 | 12.8 | 27.2 | 5.6 | 4.6 | healthy |
9 | 12.9 | 27.4 | 5.4 | 4.1 | healthy |
10 | 11.1 | 26.1 | 8.4 | 8.2 | ill |
11 | 11.8 | 26.3 | 8.2 | 8.0 | ill |
12 | 12.4 | 27.1 | 6.5 | 7.7 | ill |
13 | 12.9 | 28.6 | 6.4 | 6.3 | ill |
14 | 13.3 | 30.1 | 5.6 | 6.1 | ill |
15 | 13.9 | 30.3 | 5.1 | 5.8 | ill |
16 | 14.1 | 30.7 | 4.4 | 5.8 | ill |
17 | 14.2 | 31.3 | 3.9 | 4.7 | ill |
18 | 15.1 | 32.1 | 3.6 | 4.5 | ill |
19 | 18.1 | 34.9 | 2.5 | 2.1 | ill |
Let us now plot the population-wide expression of genes A and B, and their corresponding proteins
import matplotlib.pyplot as plt
import seaborn as sns
dfm = df.melt(id_vars='class', var_name='columns')
sns.displot(dfm, x='value', col='columns', hue="class", kind="kde", rug=True,
col_wrap=2, fill=True, facet_kws={'sharey': True, 'sharex': True})
plt.show()
1. Can you draw some hypothesis on what may be explaining the differences between RNA and protein expression?
Answer notes
Expression of molecules for case-control individuals:
- gene/protein A is generally more expressed for ill individuals (e.g. oncogene in case of cancer)
- gene/protein B is generally less expressed for ill individuals (e.g. tumor supressor gene in case of cancer)
- gene/protein A appears to be slightly more discriminative of illness condition than B
Expression of RNA versus proteins:
- expression of protein A is higher than RNA A: lifetime of proteins is generally higher than RNA (due to its translation)
- yet expression of protein B in line with the expression of RNA B: a good amount of proteins B may be migrating to other tissues (other than the ones profiled)
Let us now consider data science techniques! We learn the following decision tree from the given data:
from sklearn import tree
X, y = df.drop('class', axis=1), df['class']
clf = tree.DecisionTreeClassifier()
clf.fit(X, y)
plt.figure(figsize=(11, 9))
tree.plot_tree(clf,feature_names=X.columns.values.tolist(),class_names=['healthy','ill'],impurity=False,filled=False)
plt.show()
2. A friend of ours was profiled, yielding the following record: (mRNA-A=11, protein-A=18, mRNA-B=5.5, protein-B=5.4)
Provide your guess on whether our friend tested positive for disease X.
#after your guess, discover the result by running this code
clf.predict([[11,18,5.5,5.4]])[0]
clf.predict([[15,21,4,3]])[0]
'ill'
Genomics, epigenomics, and glycoproteomicsĀ¶
The gravity of disease X considerably varies for the monitored population with the 10 diseased individuals.
Let us now explore the role of gene C in the gravity of disease X. What we know regarding gene C:
- mutation C1 associated with 30-60% malformed C proteins when occurs in one allele and 70-100% malformed C proteins when occurs in both alleles;
- lack/excess of glycosylation of protein C can be further associated with protein malformations;
- the level of methylation in the promoter region of gene C highly varies across individuals
Let us load the data acquired from this second cohort study:
import pandas as pd
df = pd.read_csv('omics.csv')
df.head(20)
mutation C1 | glycosylation C | methylation C (0-1) | gravity (1-5) | |
---|---|---|---|---|
0 | none | lack | 0.01 | 1 |
1 | one allele | normal | 0.10 | 1 |
2 | none | normal | 0.01 | 2 |
3 | one allele | lack | 0.20 | 3 |
4 | both alleles | excessive | 0.30 | 3 |
5 | one allele | excessive | 0.20 | 3 |
6 | one allele | lack | 0.40 | 4 |
7 | both alleles | excessive | 0.60 | 4 |
8 | both alleles | excessive | 0.40 | 5 |
9 | both alleles | excessive | 0.30 | 5 |
3. Let us now how the collected markers are correlated with disease gravity. Any guess?
sns.lmplot(data=df, x='gravity (1-5)', y='methylation C (0-1)', line_kws={'color': 'g'})
plt.show()
fig, ax = plt.subplots(1,2,figsize=(10,3))
sns.histplot(data=df, x="gravity (1-5)", hue="mutation C1", kde=True, bins=5, binrange=(0.5,5.5), ax=ax[0])
sns.histplot(data=df, x="gravity (1-5)", hue="glycosylation C", kde=True, bins=5, binrange=(0.5,5.5), ax=ax[1])
plt.show()
Answer notes:
- Genomics: mutated C alleles are associated with increased gravity (e.g. hampered tumor supressor gene unable to regulate proliferation in cancer)
- Epigenomics: higher promoter methylation (associated with lower expression) is correlated with higher gravity (e.g. cancer with C being a tumor supressor/DNA repair gene)
- Glycoproteomics: excessive glycosylation is also associated with disease gravity (e.g. negatively impacting the final structure of C protein, hence hampering its function)
We sequenced gene C, obtaining the following nucleotide sequence:
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGP
4. Using UniProt, can you identify what type of gene is this one? How does the codified protein looks like?
Link: https://www.uniprot.org/
Answer recurring to BLAST utils in UniProt: p53 (tumor suppressor gene that is linked to approximately 50% of cancers)
Now, let us assume that malformation of protein C is only observed in cancer cells.
We want to check if white cells can detect cells with malformed protein C and kill them.
This can be done by marking malformed C (whether driven by mutations or glycosylation).
Yet, to guarantee if gene C is eligible for successful marking, we need first check if the codified protein is able to exit the cell and stay at its surface.
5. Knowing C is CD47, check the odds of finding the protein CD47 at the surface of the target cells using UniProt.
Answer: High, CD47 is generally observed at the cell surface, hence good candidate for marking/immunotherapy.
CD47 is an oncogene, aiding cell proliferation.
Link: https://www.uniprot.org/uniprotkb/Q08722/entry#subcellular_location