| Title: |
Adaptive Dirichlet Process mixture model with unknown concentration parameter and variance: Scaling high dimensional clustering via collapsed variational inference |
| Authors: |
Pal, Annesh; Mimoun, Aguirre; Thiébaut, Rodolphe; Hejblum, Boris P. |
| Contributors: |
Statistics In System biology and Translational Medicine (SISTM); Centre Inria de l'Université de Bordeaux; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Bordeaux population health (BPH); Université de Bordeaux (UB)-Institut de Santé Publique, d'Épidémiologie et de Développement (ISPED)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Bordeaux (UB)-Institut de Santé Publique, d'Épidémiologie et de Développement (ISPED)-Institut National de la Santé et de la Recherche Médicale (INSERM); Vaccine Research Institute Créteil, France (VRI); Université Paris-Est Créteil Val-de-Marne - Paris 12 (UPEC UP12); Bordeaux population health (BPH); Université de Bordeaux (UB)-Institut de Santé Publique, d'Épidémiologie et de Développement (ISPED)-Institut National de la Santé et de la Recherche Médicale (INSERM); Centre Hospitalier Universitaire de Bordeaux (CHU Bordeaux); ANR-22-PESN-0003,SMATCH,Statistical and AI based Methods for Advanced Clinical Trial CHallenges in Digital Health(2022) |
| Source: |
https://inria.hal.science/hal-05490235 ; 2026. |
| Publisher Information: |
CCSD |
| Publication Year: |
2026 |
| Subject Terms: |
Bayesian Nonparametrics; Clustering; Dirichlet process mixture model; Variational inference; Unstructured covariance; Concentration parameter; High-dimension; Gene expression data; [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM]; [SDV.SPEE]Life Sciences [q-bio]/Santé publique et épidémiologie; [STAT.ME]Statistics [stat]/Methodology [stat.ME] |
| Description: |
We propose a novel method that performs adaptive clustering with DPMM using collapsed VI, while incorporating weakly-informative priors for DP concentration parameter alpha and base distribution G0. We illustrate the importance of G0 covariance structure and prior choice by considering different parameterisations of the data covariance matrix. On high-dimensional Gaussian simulations, our model demonstrates substantially faster convergence than a state-of-the-art MCMC splice sampler. We further evaluate performances on Negative Binomial simulations and conduct sensitivity analyses to assess robustness on realistic data conditions. Application to a publicly available leukemia transcriptomic data set comprising 72 samples and 2,194 gene expression successfully recovers every known sub-type, all while identifying additional gene expression-based sub-clusters with meaningful biological interpretation. |
| Document Type: |
report |
| Language: |
English |
| Relation: |
info:eu-repo/semantics/altIdentifier/arxiv/2601.21106; ARXIV: 2601.21106 |
| Availability: |
https://inria.hal.science/hal-05490235; https://inria.hal.science/hal-05490235v1/document; https://inria.hal.science/hal-05490235v1/file/2601.21106v1.pdf |
| Rights: |
https://creativecommons.org/licenses/by-sa/4.0/ ; info:eu-repo/semantics/OpenAccess |
| Accession Number: |
edsbas.C2141E1 |
| Database: |
BASE |