Research Group

Data science

A Black woman is standing on the left, pointing at some data displayed on a large screen on the wall. A white woman is sitting on her right, looking up at the screen she is pointing at.

Data science unifies statistics, data analysis and their related methods in order to understand and analyse phenomena in applications ranging from healthcare to finance.

It is an interdisciplinary research area that employs techniques and theories drawn from many fields. In our department our main areas of research interest are statistics and actuarial science, and operational research. Professor George Alfred Barnard was one of the first professors of the department (1966-1975). He served as President of the Operational Research Society (1962–1964), the Institute of Mathematics and its Applications in (1970–1971) and the Royal Statistical Society (1971-1972).

Many of our academics are also members of the university's Institute for Analytics and Data Science, and work closely with the Institute for Social and Economic Research, the Essex Business School, the School of Computer Science and Electronic Engineering, and the School of Life Sciences.

The data science research group has the following four research themes:

  • Actuarial Mathematics theme (Tolulope Fadina, Junlei Hu, Peng Liu, Spyridon Vrontos, and Jackie Wong) - Theme members conduct multidisciplinary research in the broad areas of actuarial science and finance, including predictability, asset-liability management, risk management and risk theory, mathematical finance, financial data science and applied probability in actuarial science and queueing systems.
  • Data Science and Statistical Learning theme (Joe Bailey, Mario Gutierrez-Roig, Stella Hadjiantoni, Andrew Harrison, Berthold Lausen, and Osama Mahmoud) - Theme members work on a range of data science methodologies covering artificial intelligence, statistical learning, computational statistics, epidemiology, bioinformatics and environmental statistics.
  • Operational Research (OR) theme (Georgios Amanatidis, Fanlin Meng, Abdel Salhi, and Xinan Yang) - Theme members conduct multidisciplinary research in the broad areas of OR and mathematical modelling including linear and nonlinear programming, combinatorial optimisation, deterministic and stochastic dynamic programming, algorithm (heuristics) design and analysis (including the novel Plant Propagation Algorithm (PPA) developed by Salhi), implementation of algorithms, data analytics and applications in portfolio selection, labour scheduling, green distribution, and predictive modelling.
  • Statistical Methodology theme (Yanchun Bao, Hongsheng Dai, and Yassir Rabhi) - Theme members work on a broad range of statistics and applied probability topics, including Bayesian statistics, longitudinal and survival analysis, causal inference, applied probability, exact Monte Carlo simulation such as Monte Carlo (or Bayesian) Fusion methods, nonparametric estimation for bivariate survival functions, semiparametric and nonparametric methods for length-biased and censored data.

Data science projects

Essex Data Science Seminar Series

Our group runs a regular research seminar series throughout the academic year. Along with hosting talks from our academics and research students, we also invite experts from other institutions to present their latest work.

Our seminars are open to anyone at the University of Essex who may be interested in the topic being discussed.

Upcoming research seminars

Autumn term 2021

4th November 2021 - "Analysing the behaviour of dairy cows" - Kareemah Chopra, University of Essex.

Highlights of Summer 2021 seminars

Slope-Hunter: A robust method for collider bias correction in conditional genome-wide association studies

Our department's own Dr Osama Mahmoud led this seminar on bias correction in genetic studies.

Osama explained that studying genetic associations conditioned on another phenotype, for example a study of blood pressure conditional on weight, could be affected by selection bias. An example of this is the study of genetic associations with prognosis (e.g. survival, subsequent events).

Selection on disease status can induce associations between causes of incidence with prognosis, potentially leading to selection bias - also called "index event bias" or "collider bias". At moment one method for adjusting genetic associations for this bias assumes there is no genetic correlation between incidence and prognosis, which may not be a plausible assumption.

Osama proposed the ‘Slope-Hunter’ approach, which has two stages. In the first stage he showed how to use cluster-based techniques to identify: variants affecting neither incidence nor prognosis (these should not suffer bias and only a random sub-sample of them are retained in the analysis); variants affecting prognosis only (excluded from the analysis).

In the second stage, Osama demonstrated cluster-based model to identify the class of variants only affecting incidence. This class was used to estimate the adjustment factor. Simulation studies showed that the approach eliminates the bias and outperforms alternatives in the presence of genetic correlation, and performs as well as alternatives under no genetic correlation when its assumptions are satisfied.

Recent papers

Paving the Road to Open Science via Automated Data Documentation

In this talk University of Essex graduate Ahmed Abdelmaksoud looked at how research can be transformed by better data documentation.

Many research projects collect data, which gives us an unparalleled opportunity to not only produce integrated research, but also improve open-access. For example, socio-economic projects could collect and generate data that would also be useful for a health research project. This pre-collected data could help shorten the research timeframe for the health project, but any insights from a health perspective could be used to further generate impact on the original socio-economic project, or provide a foundation for a third piece of research.

Additionally, better data collection and documentation of the process of collecting the data can allow other researchers to replicate a study, proving that the conclusions drawn in the original are accurate (or alternatively identifying issues that need to be addressed).

At the moment there is no standardised way to document data, especially as it often has to be done manually and requires a great deal of time and effort for little reward beyond altruism.

In his talk Ahmed discussed a potential automated solution. Called ‘D2’, it is written in ‘R’ and mainly employs ‘R-Markdown’ and ‘Knitr` to generate documentation for the most commonly used data formats. Minimal input is required from the user, so researchers can utilise it without it taking up too much time.

Ahmed's proposal was particularly interesting as the pandemic and rush to create a Covid-19 vaccine have shown the importance of open science and sharing of data. A tool like this could be a cost and time-effective way to improve this in the future.

Highlights of Spring 2021 seminars

Selection bias, missing data and causal inference

In this talk Professor Kate Tilling from the University of Bristol discussed her work on data and bias.

When utilising data different statistical methods can be used for causal inference, but this involves using untestable assumptions. Kate used directed acyclic graphs (DAGs) to show that causal inference is also impacted by selection (for example picking people to take part in a study) or missing data, and gave several examples in which these issues cause bias within studies.

It was a good reminder for our data science audience to keep bias in mind when working with data, particularly when it comes to working in the fields of health and medicine.

Related papers

Approximating images by optimally arranging polygons: a heuristic study into computational art

Dr Daan van den Berg, from the University of Amsterdam, gave a talk on his research on computational art.

Daan started off explaining that it is possible to approximate artistic images from a limited number of stacked semi-transparent coloured polygons. He demonstrated with with sets of randomised polygons and showed how to find optimal arrangements for several well-known paintings using three iterative optimisation algorithms: stochastic hillclimbing, simulated annealing and the plant propagation algorithm.

A discussion was then held on the performance of the algorithms, and Daan demonstrated how the found objective values related to the polygonal invariants.

Related papers

Highlights of Autumn 2020 seminars

EpiViz: an implementation of Circos plots for epidemiologists

Matt Lee, a PhD student from the University of Bristol, delivered a talk on the use of Circos plots in epidemiology.

Biological pathways involve numerous processes, but epidemiology studies predominantly focus on single exposure and single outcome associations. This is primarily because identifying meaningful intermediate associations that can be taken forward for further analysis is complex.

In his talk, Matt discussed how tools like EpiViz can be used to produce simple and efficient Circos plots for those new to programming and data visualisation. By giving people a tool that makes data visualisation easier to produce, epidemiologists can gain a better understanding of the results of complex epidemiological studies. Greater insight in to the results can help increase the impact of such studies.

Related papers

A Statistician’s Botanical Garden - The Ideas behind Trees, Model-Based Trees and Random Forests

Classification and regression trees, model-based trees and random forests are powerful statistical methods from the field of machine learning. However, while individual trees are easy to interpret, random forests are "black box" prediction methods. Despite this, they provide variable importance measures, that are being used to judge the relevance of the individual predictor variables.

In this seminar, Professor Carolin Strobl introduced the rationale behind trees, model-based trees and random forests, and illustrated their potential for high-dimensional data exploration, while also pointing out limitations and potential pitfalls in their practical application.

Related papers

Detecting the hierarchical structure of the cell nucleus

Chromatin consists of DNA wrapped around histones and forms complex three-dimensional structures within the cell nucleus with various degrees of compaction.

Genes have been shown to be repressed by their proximity to the nuclear periphery or activated by being in contact with special regulatory regions called enhancers. Thus the relative positioning of genes and their interactions with other regions are very important in determining whether they are expressed or not.

In this talk, Iona Olan from the University of Cambridge discussed her work on cellular senescence, a phenotype associated with dramatic changes in its chromatin interactions network relative to normal cells. Senescence corresponds to permanent cell cycle arrest and has been shown to act as a protective barrier against tumourigenesis.

Related papers

Our academics

Dr Georgios Amanatidis

Lecturer in Mathematics

Department of Mathematical Sciences, University of Essex

Dr Joseph Bailey

Lecturer in Environmetrics

Department of Mathematical Sciences, University of Essex

Research area: Statistics

Dr Yanchun Bao

Lecturer in Data Science and Statistics

Department of Mathematical Sciences, University of Essex

Research area: Statistics

Dr Hongsheng Dai

Reader in Statistics

Department of Mathematical Sciences, University of Essex

Research area: Statistics

Dr Tolulope Fadina

Lecturer in Actuarial Science and Finance

Department of Mathematical Sciences, University of Essex

Research area: Actuarial science.

Dr Dongjiao Ge

Lecturer in Data Science

Department of Mathematical Sciences, University of Essex

Research area: Data science and statistical learning.

Dr Mario Gutierrez-Roig

Lecturer in Data Science and Statistics

Department of Mathematical Sciences, University of Essex

Research area: Statistics

Dr Stella Hadjiantoni

Lecturer in Data Science and Statistics

Department of Mathematical Sciences, University of Essex

Research area: Statistics

Dr Andrew Harrison

Senior Lecturer in Data Science

Department of Mathematical Sciences, University of Essex

Research area: Statistics

Dr Junlei Hu

Lecturer in Actuarial Science

Department of Mathematical Sciences, University of Essex

Research area: Actuarial science

Professor Berthold Lausen

Professor of Data Science

Department of Mathematical Sciences, University of Essex

Research area: Statistics.

Dr Peng Liu

Lecturer in Actuarial Science and Finance

Department of Mathematical Sciences, University of Essex

Research area: Actuarial science.

Dr Osama Mahmoud

Lecturer in Data Science and Statistics

Department of Mathematical Sciences, University of Essex

Research area: Statistics.

Dr Felipe Maldonado

Lecturer in Operational Research

Department of Mathematical Sciences, University of Essex

Research area: Operational Research

Dr Fanlin Meng

Lecturer in Data Science

Department of Mathematical Sciences, University of Essex

Research area: Statistics

Dr John O'Hara

Senior Lecturer in Actuarial Science

Department of Mathematical Sciences, University of Essex

Research area: Actuarial science.

Dr Yassir Rabhi

Lecturer in Data Science and Statistics

Department of Mathematical Sciences, University of Essex

Research area: Statistics.

Professor Abdellah Salhi

Professor of Operational Research

Department of Mathematical Sciences, University of Essex

Research area: Operational research

Dr Spyridon Vrontos

Senior Lecturer in Actuarial Science

Department of Mathematical Sciences, University of Essex

Research area: Actuarial science

Dr Jackie Wong Siaw Tze

Lecturer in Actuarial Science

Department of Mathematical Sciences, University of Essex

Research area: Statistics, actuarial science.

Dr Xinan Yang

Senior Lecturer in Operational Research

Departmental of Mathematical Sciences, University of Essex

Research area: Operational research

Dr Jingjing Zhang

Lecturer in Statistics

Department of Mathematical Sciences, University of Essex

Research area: Statistical methodology.