Student Thesis Projects

We are looking for motivated students (f/m/d) from bioinformatics, informatics, molecular biology, medicine, biotechnology or similar. We offer Master, Bachelor, and selected Internship projects in translational bioinformatics with a focus on omics data. The specific know how requirements vary by project but generally include good programming and data analysis skills, as well as willingness to learn the relevant biomedical context.

If you are interested, please contact Dieter Beule.

Example Project

GC-MS Metabolomics Quality Control and Peak Picking (Master Thesis)

Background: Gas chromatography-mass spectrometry (GC-MS) is a commonly used technique in metabolomics for the identification and quantification of the small molecule complement (typically <1000 Da size) of a biological system. Typical analysis workflows for first level analysis consists of noise filtering, feature detection and alignment for compound identification. Data interpretation, both bioinformatics and statistical, usually depends on the details of the experimental design of the individual study and is not part of the pipeline. Within the Metabolomics Platform, a first level analysis pipeline has been established using software 3rd-party and inhouse developed open source software. While the workflow has been successfully used in a number of small-scale studies it still involves a substantial amount of manual interaction with the data to confirm and validate algorithmically determined feature detection to achieve satisfactory results. This is a bottleneck for processing epidemiological-size studies and also introduces undesirable operator biases.

Project Description: The project aim is to improve the workflow for GC-MS peak picking and quality control [1] through transfering and adapting deep neural networks and machine learning methods using in LC-MS data analysis [2]. The improved workflow solution should be applied to existing data sets (e.g. time series of mouse model with Alzheimers disease). The influence of workflow improvements on existing downstream analysis (e.g. differential regulation, pathway enrichment analysis) and the biomedical interpretation should be explored. This project is run in collaboration with Jennifer Kirwan in the Metabolomics facility.

In this project, you will:

  • implement reproducible analysis workflows for GC-MS data
  • learn to design and evaluate performance of ML algorithms
  • perform an in depth metabolimcs data analysis

Required skills

  • software development with python and git, some R knowledge is a plus
  • basic understanding of high-dimensional data analysis (e.g., multivariate statistics, PCA, etc.)
  • interest in machine learning and metabolomics

[1] Borgsmüller et al. WiPP: Workflow for Improved Peak Picking for Gas Chromatography-Mass Spectrometry (GC-MS) Data. Metabolites 2019 https://doi.org/10.3390/metabo9090171.

[2] Gloaguen, Y.; Kirwan, J.; Beule, D. Deep Learning assisted Peak Curation for large scale LC-MS Metabolomics. 2022. Anal. Chem. https://doi.org/10.1021/acs.analchem.1c02220

Analysis and visualization of Imaging Mass-Cytometry Data

Imaging Mass-Cytometry (IMC) is a novel method for single-cell, spatially resolved omics analysis of tissues. Existing computational tools for the analysis and visualization of such datasets require extensive manual input, are hard to integrate into high-performance computing infrastructure and automated computational pipelines, and build upon proprietary software.

In this project, you will:

  • implement reproducible analysis workflows for IMC data connecting to our in-house data management system
  • implement a web-based app for interactive visualization of IMC data building upon similar tools for single-cell genomics data
  • analyze IMC data for a range of human tissues and disease conditions, investigating the interplay between diseased tissue and surrounding immune cells in a spatially resolved manner

Required skills

  • software development with python, dash and git
  • basic understanding of high-dimensional data analysis (e.g., multivariate statistics, PCA, etc.)
  • experience in image analysis is a plus

Examples Past Completed Projects

Below you find representative project examples from past students.

Somatic Mutational Signatures (Bachelor Thesis)

Background: Somatic mutations are present in all cells of the human body and accumulate throughout life. They are the consequence of multiple mutational processes, including imperfection of the DNA replication, mutagen exposures, enzymatic modification of DNA and defective DNA repair. Different mutational processes generate unique combinations of mutation types, termed “Mutational Signatures”. The concept was originally introduced by [Alexandrov et al. (Nature 2013)] (https://cancer.sanger.ac.uk/cosmic/signatures) and has now been studied and extended by many groups. It has been possible to associate some of these signatures to specific biochemical processes leading to the fixation of single nucleotide mutations in cancer cells during disease onset and progression.

Project Description: The student will consolidate and further develop current in-house pipeline for the computation of mutational signatures from exomes and genomes. The student will test, evaluated and integrate open source software tools for the signature analysis and variant filtration for quality and artifacts. Once the workflow is in place, this will allow for exploration of the potential effect somatic variant filtration will have on the signature detection and strength in patients. The student will process and analyze data sets to ensure that somatic variant filtering algorithms don’t significantly alter the relative composition of mutation signatures in patients, and if some filtration algorithms indeed affect signature profiles, to identify such patterns. Furthermore clinical and research data sets will be analyzed. Specifically two questions could/should be addressed: Signature 3 has been connected with germline and somatic BRCA1 and BRCA2 mutations in several cancer types. For an existing cohort of patients (cooperation with Charite CCCC, Molecular Tumor Conference) the clinical usefulness of signature 3 analysis should be explored and compared to existing mutational and transcriptome analysis. Signature 18 has been connected with neuroblastoma, however this signature is similar to DNA degradation effects occurring in library preparation (oxoG). Methods for reliably separating the two effects should be established. (Many Data Sets available: neuroblastoma cohort (genomes and exomes, inhouse as well as public).

This thesis has lead to the publication:
Schumann, F.; Blanc, E.; Messerschmidt, C.; Blankenstein, T.; Busse, A.; Beule, D. SigsPack, a Package for Cancer Mutational Signatures. BMC Bioinformatics 2019, 20 (1), 450. https://doi.org/10.1186/s12859-019-3043-7.

GC-MS Metabolomics Peak Picking (Master Thesis)

Background: Gas chromatography-mass spectrometry (GC-MS) is a commonly used technique in metabolomics for the identification and quantification of the small molecule complement (typically <1000 Da size) of a biological system. Typical analysis workflows for first level analysis consists of noise filtering, feature detection and alignment for compound identification. Data interpretation, both bioinformatics and statistical, usually depends on the details of the experimental design of the individual study and is not part of the pipeline. Within the Metabolomics Platform, a first level analysis pipeline has been established using software tools ChromaTOF (data format conversion and noise filtering) and MAUI (feature detection and alignment, as well as manual editing of feature assignments, especially splitting and merging of neighboring peaks). These tools also support the specific external and internal calibration and alkane mixtures that the unit has established. While the workflow has been successfully used in a number of small-scale studies it still involves a substantial amount of manual interaction with the data to confirm and validate algorithmically determined feature detection to achieve satisfactory results. This is a bottleneck for processing epidemiological-size studies and also introduces undesirable operator biases.

Project Description: The project aim is to improve the workflow to make it better suited for large-scale studies and less prone to the above mentioned biases via better automation and algorithmic improvements. Firstly, the feature detection and alignment functionality must be separated from the manual editing and visualization functionality and turned into/replaced by command line modules (basic feasibility of this step has been confirmed already). The next step involves a comparative analysis of existing GC-MS noise filtering, feature detection and alignment tools/modules (including but not limited to CAMERA, XCMS, metaMS). This will require establishing an automated data analysis pipeline (e.g. using snakemake), creating wrapper codes, and interlinking of modules where necessary. Scoring metrics for quantitative comparison of processing results need to be defined and established. Existing repositories of carefully manually curated data sets will be used as a “silver standard”. Besides the comparison of existing modules a systematic investigation/screening of important algorithmic parameters is expected in order to identify optimal processing parameters for the various algorithms/tools/modules. Adapting processing parameters to signal quality should be considered as should the automation of processing parameter optimization. Ideally the pipeline output format should be realized such that MAUI can still be used to visualize, check and edit the pipeline output. This will guarantee that the developed pipeline can be easily adapted into routine usage. Additional algorithmic improvements within the pipeline might be realized by better integration of signal-to-noise-filtering with feature detection and alignment functionality. Finally the improved workflow solution must be applied to existing data sets (e.g. time series of mouse model with Alzheimers disease). The influence of workflow improvements on existing downstream analysis (e.g. differential regulation, pathway enrichment analysis) and the biomedical interpretation should be explored. This project is run in collaboration with Jennifer Kirwan in the Metabolomics facility.

This thesis has lead to the publication:
Borgsmüller, N.; Gloaguen, Y.; Opialla, T.; Blanc, E.; Sicard, E.; Royer, A.-L.; Le Bizec, B.; Durand, S.; Migné, C.; Pétéra, M.; Pujos-Guillot, E.; Giacomoni, F.; Guitton, Y.; Beule, D.; Kirwan, J. WiPP: Workflow for Improved Peak Picking for Gas Chromatography-Mass Spectrometry (GC-MS) Data. Metabolites 2019, 9 (9). https://doi.org/10.3390/metabo9090171.

Detection of Focal CNVs in Whole Exome Sequencing Data (Bachelor Thesis)

Background: High-throughput sequencing has become the de facto standard for the identification of genomic variants in the germline (Yang et al., 2013; Bamshad et al., 2011) and tumors (Weinstein, et al., 2013). While whole genome sequencing has become more affordable in recent years, whole exome sequencing (WES) still is the main workhorse for the identification for clinically relevant variants with, e.g., as shown in the gnomAD project aggregating eight times as many exomes as genomes (Karczewski et al., 2019). Most analysis of WES data in rare disease genetics focuses on the detection of small coding variants, but WES can also be used for the identification of copy number variants (CNVs) that are also highly relevant for Mendelian diseases (Zhang et al., 2009). Many existing tools focus on the identification of large-scale CNVs (Fromer et al., 2014; Talevich et al., 2016) and comparatively fewer address focal small variation in the germline (e.g., Johansson et al., 2016) or somatic samples (Koboldt et al., 2012).

Project Description: The aim of this Bachelor’s thesis is the identification of small copy number variants in existing inhouse rare disease WES data sets (several hundred, mostly trios) based on exome kits from different vendors and establishing of a corresponding analysis pipeline. The WES data have already been analyzed for small coding variants and large-scale CNVs, but single-exon deletions or partial deletions of exons have not been specifically targeted yet. The student must design and implement counting and normalization strategies applicable for whole exon loss (count fragments) and partial exon loss (count coverage of windows of longer exons). The student must design and implement quality control methods for identifying unreliable exons and low-quality samples and the visualization thereof. The methods must be applied to the data described above and should be compared to results form existing tools using suitable measures which the student is must define. He/she should design implement and test approaches for detecting outliers in exon coverage in single samples for genotyping (partial) exon-level copy number variation in single samples. Optionally the student could cooperate with clinical scientist on the phenotypically relevance of found focal events and/or integrate the methods describe above to create a focal CNV caller package.

The paper for this thesis is in preparation.

Last modified: Jun 20, 2022