3D Interactive Visualization to Explore Immune-Specific *omics Datasets

A Cornell Tech graduate thesis done in collaboration with Mount Sinai, Gumus Lab

Tools & Software: Figma, Python, Javascript, CSS, HTML, Three.js, Rhino 3D

Role: Data Visualization Designer & Developer

NYC, 2020

This project is the resyukt Cornell Tech Master's Thesis in collaboration with the Mount Sinai research group Gumus Lab.

Challenge

To visually represent the changing relationship between kinases and substrates (proteins) to allow researchers to uncover and share insights of their experiments

Solution

A web-based tool to generate, explore, and share interactive phosphoproteomics networks in 3D

Interact with the tool here!

Impact

Very positive feedback from our users. Visualizations generated with our tool included in scientific papers to communicate findings. A paper about the tool at Cell (Patterns) currently under review.

1

Context and Challenge

Why is Data Visualization necessary?

To better understand the human immune system and how diseases influence immune cells and protein expression, researchers are interested in observing cells before and after infection, vaccination, or treatment, collecting massive *omics datasets. More specifically, some of these researchers study protein phosphorylation and interactions, analyzing perturbations in the system depending on the state of infection.

This complex phosphoproteomic data is best represented and understood with network visualization. Typically, nodes are kinases or substrates, which are types of proteins, and edges are interactions between them. When kinases interact with substrates, they get up-regulated (positive phosphorylation) or down-regulated (negative phosphorylation). Each substrate section (accession site) is affected differently, and its regulation change over time. A tool that allows researchers to visualize this data in an intuitive, interactive, and unbiased way can help them better understand the immune system, what components are perturbed in disease states, and how to reverse these changes to better treat infectious diseases such as HIV or COVID-19.

Our pilot was conducted using HIV data, and further iterations were developed using COVID-19, and Chronic Fatigue Syndrome (CFS) data. All data was provided by Dr. Jeffrey Johnson and Dr. Phillip Comella from Mount Sinai.

There are a lot of dimensions of interest that need to be visualized and co-live in the same visual space

Screen Shot 2023-01-16 at 12.29.08 PM.png

*Omics datasets are massive, with lots of data points for each node, and lots of dimensions to represent visually.

These are (modified) screenshots of the COVID-19 dataset we used to build our first visualization, which contained more than 14.000 entries.

0

2

Research Process

Exploratory & Definition Phase

Developing & Testing Phase

Exploratory Interviews to define research objetives

Usability Testing to identify tool's requirements

Lo-Fi & Hi-Fi Prototyping

Coding!

Exploratory Interviews to Define Research Objectives

We conducted exploratory semi-structured interviews with seven users (proteomics researchers) at Mount Sinai. During this process, we gathered information about the context of use and their current pain points. This information helped us define our two research objectives for this project:

Enable researchers’ interactive exploration and understanding of large, complex immune-specific proteomics data

Facilitate the communication of researchers’ findings to scientific and nonscientific communities through the tool

Usability Testing with Existing Solutions to Identify Pain Points

After our interviews, we conducted usability studies where we observed users' current workflows using existing visualization solutions. During the process, we uncovered our users' pain points, which we used to help us strategize our tool's requirements. These are some of the strongest pain points our users encountered:

Current tools...

Require manual parameter adjusting for clutter minimization and correct readability. All the existing solutions require some manual adjustment to improve the clarity and usefulness of the visualization. Usually, the manual process can be automated using specific programming languages (e.g., Perl). A user with prior coding experience described this process as a “nightmare”, and it took him several weeks to generate the visualization that fit his needs using one of the state-of-the-art tools.

Require local installation, as opposed to browser applications, which difficult network sharing with other researchers in the space. A primary goal of our users is to help inspire new research by allowing other researchers to explore and interact with their network.

Do not allow for 3D networks. Visualizing a wide range of nodes using only two dimensions can be challenging and make the visualization cluttered and not useful.

Do not allow users to compare between states easily. Current tools are static, and if users want to compare the network between different states, they need to create multiple networks and compare them manually.
Take too long to run and create a visualization from the imported data. The loading time of some of these tools can exceed one hour, and only to show that there has been an error in the data importing process.

Lo-Fi & Hi-Fi Prototyping

Next, we started a round of paper and high-fidelity prototyping (in Figma), followed by multiple rounds of feedback from our users. These were created to present our assumptions of the functionalities and workflow quickly and efficiently.

Exploring different ways to represent the nodes in 2D and in 3D with different accession sites. The image on the right is built using Rhino 3D.

Screen Shot 2023-01-16 at 2.50.10 PM.png

[V0] Lo-Fi prototyping

[V1] Lo-Fi prototyping

Exploring ways to compare phosphoproteomic networks from different infections, such as HIV, COVID-19, and Dengue

[V2] Hi-Fi prototyping

Exploring and testing three different ways to visualize and compare the phosphorylation states (timestamps)

The user can use a slider to visualize the data changing over time. In this example, the user can visualize the change in phosphorylation at timestamps 0, 2, 4, 7, 12, and 24h. This was the preferred option among our users and the one that we implemented.

The user can simultaneously compare all the states (timestamps) in the same node. The substrates are partitioned vertically as many times as selected states. In this example, there are six selected states. We also include a "Show disparity" feature, which, if turned on, selects the nodes that suffer more change across the states.

The user can simultaneously compare all the states (timestamps) using a split screen. Since the network can be rotated and zoomed in and out with the mouse, the challenge here was to make sure all networks would move exactly the same, to allow for proper node comparison.

Coding! Technical Implementation

The 3D network visualization was created using JavaScript. Specifically, we used Three.js, a library that uses WebGL to create 3D web-based graphics, and 3d-force-graph, a library that uses Three.js to create force-directed graphs.

The most challenging step when designing a generalized tool is handling data import when input data files are not necessarily standardized across users in the proteomics community. As observed in our past iterations, different researchers have different data templates and attributes as well as processing methods.

First, we created an ideal JSON file that contained nodes, edges, and node/edge attributes nested accordingly. The minimum amount of information users need to upload to visualize their data is node IDs and links between nodes. The user can also specify a “group” attribute for nodes, representing kinases and substrates, for example. The user can also input a list attribute for nodes called “positions”, which contains more granular information if the node has multiple partitions with different information within it (e.g., accession sites with varying phosphorylation and p-values). The user can specify any number of source, target, partition, and edge attributes.

Screen Shot 2023-01-16 at 3.26.41 PM.png

Example of our data structure in a JSON file

The functionality to transform any input data files into this ideal JSON format was built using JavaScript, specifically, Danfo.JS, a JS library created by TensorFlow to read the CSV into a data table format, similar to the Pandas library from Python. To transform the data into the desired JSON format, the state column is iterated through to consider one state at a time. For each state, unique gene IDs are extracted from the source column and assigned a group value of 1. Then, attributes from the unique gene IDs are set as source attributes. This procedure is repeated for the target nodes.

Different assumptions are made regarding the links (1) users’ input data will contain separate columns for target nodes and source nodes, (2) if partitions exist, edge attributes are equal as there is only one edge connected to a node with partitions, and (3) if partitions exist, the data corresponding to each partition will be in its own row.

This is an example of how a dataset gets translated into visual elements:

0

3

Result

Interact with the tool here!

or go to this link: https://irenefp.github.io/general-covid.html

Watch me present and demonstrate the tool to our users! (User feedback included!)

The video will start on hover.

0

4

Impact

Scientific Publications Under Review:

[Cell Press Journal] Submitted a paper at Patters, a Cell Press Journal. The manuscript is currently under peer review.
[Nature] Build a visualization using Chronic Fatigue Syndrome data and co-author a paper with Dr. Phillip Comella, which is under review at Nature.

Presentations & Positive Feedback

Presentation at Mount Sinai Dengue Human Immunology Project Consortium (DHIPC)
Received positive feedback from the scientific community. Researchers like Dr. Mehdi Bouhaddou, from the Krogan Lab at Mount Sinai, commented after Dr. Zeynep Gumus tweeted about the tool. Dr. Mehdi Bouhaddou was interested in using the tool for their studies.