| Title: |
Large-scale discovery, analysis, and design of protein energy landscapes |
| Authors: |
Ramos Ferrari, Allan Jhonathan; Dixit, Sugyan; Thibeault, Jane; Mario Garcia; Houliston, Scott; Ludwig, Robert; Notin, Pascal; Phoumyvong, Claire; Martell, Cydney; Jung, Michelle D.; Tsuboyama, Kotaro; Carter, Lauren; Arrowsmith, Cheryl; Guttman, Miklos; Rocklin, Gabriel |
| Publisher Information: |
Zenodo |
| Publication Year: |
2025 |
| Collection: |
Zenodo |
| Subject Terms: |
Hydrogen Deuterium Exchange-Mass Spectrometry; Protein biophysics; Protein energy landscapes; Protein design |
| Description: |
*** IMPORTANT! Please Register to use of these data so that we can continue to release new useful datasets! This will take 10 seconds!! ***This repository contains datasets generated for our study on protein energy landscapes using our multiplex hydrogen-deuterium exchange (mHDX) analysis. The datasets include raw and processed HDX data, NMR results, curated subsets, and machine learning splits with interpretable and deep learning-derived features. These resources support various analyses, including protein stability assessment, EX1 kinetics evaluation, and predictive modeling. Available Datasets: Dataset_0_InitialOrder: Initial DNA sequences from all libraries (15,715 unique sequences). Dataset_1_UnfilteredData: Minimally filtered HDX data based on confident identifications and PO score < 50 (8,293 unique sequences). Dataset_2_SuccessfulHDX: Proteins passing quality control metrics, including EX1 kinetics (5,778 unique sequences). Dataset_3_MeasurablyStable: Proteins reaching full deuteration with ΔGunfold > 2 kcal/mol and passing EX1 kinetics filter (3,590 unique sequences). Dataset_4_HDXNMR: HDX-NMR results per condition, including average ΔGopen per position (16 unique sequences). Dataset_5_MesophilicThermophilic: Subset of proteins from natural domains classified as mesophilic or thermophilic based on optimal growth temperature (>40°C) (1,637 unique sequences). Dataset_6_splits_interpretable: Machine learning splits with interpretable features (3,193 unique sequences). Dataset_6_splits_esm2: Machine learning splits with ESM2-derived features (3,465 unique sequences). Dataset_6_splits_unirep: Machine learning splits with Unirep-derived features (3,465 unique sequences). Dataset_6_splits_saprot: Machine learning splits with SaProt-derived features (3,465 unique sequences). Dataset_7_mHDX_cDNA: Subset of Dataset_2 (best PO scored candidate, EX1 kinetics excluded) overlapping with cDNA proteolysis assay data from Tsuboyama et al. (2023) (4,464 unique sequences). Dataset_8_PDFs: Comprehensive plots ... |
| Document Type: |
dataset |
| Language: |
unknown |
| Relation: |
https://zenodo.org/records/14983481; oai:zenodo.org:14983481; https://doi.org/10.5281/zenodo.14983481 |
| DOI: |
10.5281/zenodo.14983481 |
| Availability: |
https://doi.org/10.5281/zenodo.14983481; https://zenodo.org/records/14983481 |
| Rights: |
Creative Commons Attribution 4.0 International ; cc-by-4.0 ; https://creativecommons.org/licenses/by/4.0/legalcode |
| Accession Number: |
edsbas.1C8E0CE2 |
| Database: |
BASE |