class: middle, center # Biostatistics for Fluid Biomarkers Michael Donohue, PhD University of Southern California ### Biomarkers in Neurodegenerative Disorders University of Gothenburg April 20, 2020. .pull-left[ <img src="./images/atri.png" width="57%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="./images/actc_logo.png" width="47%" style="display: block; margin: auto;" /> ] --- # About me .large[ - 2001 - 2005: PhD Mathematics, University of California, San Diego - *Rank Regression & Synergy Detection* - 2005 - 2015: University of California, San Diego - Alzheimer's Disease Neuroimaging Initiative (ADNI) - Alzheimer's Disease Cooperative Study (ADCS) - 2015 - Present: University of Southern California, San Diego - Associate Director of Biostatistics, [Alzheimer's Therapeutic Research Institute (ATRI)](https://keck.usc.edu/atri/) - Biostatistics Unit Co-Lead, [Alzheimer's Clinical Trial Consortium (ACTC)](https://www.actcinfo.org/) ] --- # Course Overview .large[ Topics: - 9:00 - 9:50 -- Biostatistics for Fluid Biomarkers - 10:00 - 10:50 -- Biostatistics for Imaging Biomarkers - 11:00 - 11:50 -- Modeling Longitudinal Data Emphases: - Visuliazation - Demonstrations using R, code available from: - [https://github.com/atrihub/biomarkers-neuro-disorders-2020](https://github.com/atrihub/biomarkers-neuro-disorders-2020) ] --- # Session 1 Outline .large[ - Batch Effects - Experimental Design (Sample Randomization) - Statistical Models for Assay Calibration/Quantification - Classification (Supervised Learning) - Logistic Regression - Binary Trees - Random Forest - Mixture Modeling (Unsupervised Learning) - Univariate - Bivariate ] --- class: inverse, middle, center # Batch Effects --- # Batch Effects: Boxplot <img src="fluid_fig/batch_data_plot-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Coefficient of Variation .pull-left[ <table class="table table-striped table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> batch </th> <th style="text-align:right;"> N </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> SD/Mean = CV (%) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 790 </td> <td style="text-align:right;"> 379 </td> <td style="text-align:right;"> 48 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 925 </td> <td style="text-align:right;"> 299 </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 725 </td> <td style="text-align:right;"> 389 </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 951 </td> <td style="text-align:right;"> 332 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 690 </td> <td style="text-align:right;"> 312 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 867 </td> <td style="text-align:right;"> 349 </td> <td style="text-align:right;"> 40 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 837 </td> <td style="text-align:right;"> 446 </td> <td style="text-align:right;"> 53 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 914 </td> <td style="text-align:right;"> 348 </td> <td style="text-align:right;"> 38 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 883 </td> <td style="text-align:right;"> 271 </td> <td style="text-align:right;"> 31 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 763 </td> <td style="text-align:right;"> 266 </td> <td style="text-align:right;"> 35 </td> </tr> </tbody> </table> ] .pull-right[ - Coefficient of Variation (CV) = SD/Mean - Often used for quality control (reject batch with CV > `\(x\)`) ] --- # Testing for Batch Effects ```r anova(lm(Biomarker ~ batch, batch_data)) Analysis of Variance Table Response: Biomarker Df Sum Sq Mean Sq F value Pr(>F) batch 9 3573109 397012 3.37 0.00051 *** Residuals 490 57758046 117874 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * Batch explains a significant amount of the variation in this simulated data * R note: `batch` variable must be a `factor`, not `numeric` (otherwise, you will get a batch slope) --- # Batch effects: Confounds <img src="fluid_fig/unnamed-chunk-4-1.svg" width="100%" style="display: block; margin: auto;" /> --- class: inverse, middle, center # Experimental Design for Fluid Biomarkers --- # Randomized assignment of samples to plates <img src="fluid_fig/unnamed-chunk-5-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Experimental Design for Fluid Biomarkers .large[ - Randomize samples to batches/plates - Longitudinally collected samples (samples collected over time on same individual): - If batch effects are expected to be larger than storage effects, consider randomizing *individuals* to batches - (Keep all samples from individual on the same plate) - Randomization can be stratified to ensure important factors (e.g. treatment group, age, APOE `\(\epsilon4\)`) are balanced ] --- # Sample Randomization We use an `R` package [SRS](https://github.com/atrihub/SRS) ("Subject Randomization System"), which we have modified to deal with the constraints of plate capacity, and keeping samples from the same subject together. <table class="table table-striped table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Subject ID </th> <th style="text-align:left;"> Num. of samples </th> <th style="text-align:left;"> Group </th> <th style="text-align:left;"> Age </th> <th style="text-align:left;"> Plate </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 10 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 12 </td> </tr> </tbody> </table> --- # Sample Randomization .pull-left[ <table class="table table-striped table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Plate </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> 12 </td> <td style="text-align:left;"> 13 </td> </tr> <tr> <td style="text-align:left;"> old </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> young </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Num. samples </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 29 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 27 </td> <td style="text-align:left;"> 30 </td> </tr> </tbody> </table> ] .pull-right[ - Number of young and old well balanced across the 13 plates - Number of samples per plate is also reasonable (plate capacity was set at 30 samples) ] --- class: inverse, middle, center # Calibration --- # Calibration .large[ - Calibration: developing a map from "raw" assay responses to concentrations (ng/ml) using samples of *known* concentrations - We will explore some approaches to calibration with methods from the `R` package `calibFit` (Haaland, Samarov, and McVey, 2011; Davidian and Haaland, 1990) - The package includes some example data: - High Performance Liquid Chromatography (HPLC) and - Enzyme Linked Immunosorbent Assay (ELISA) ] --- # Calibration .pull-leftWider[ <img src="fluid_fig/calibFit_fits-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-rightNarrower[ - *Calibration* is *inverse regression* in which these fitted curves would be used to map assay responses from samples of unkown concentration (vertical axis) to concentration values (horizontal axis). - Both fits exhibit *heteroscedasticity*: the error variance is not constant with respect to Concentration - Most models assume *homoscedasticity*, or constant error variance. ] --- # Residuals (Response - Fitted values) <img src="fluid_fig/calibFit_residuals-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Typical Regression Typically, regression models are of the form: `\begin{equation} Y_{i}=f(x_i,\beta)+\epsilon_{i}, \end{equation}` where: - `\(Y_{i}\)` is the observed response/outcome for `\(i\)`th individual ( `\(i=1,\ldots,n\)` ) - `\(x_i\)` are covariates/predictors for `\(i\)`th individual - `\(\beta\)` are regression coefficients to be estimated - `\(f(\cdot,\cdot)\)` is the model (assumed "known" or to be estimated) - In linear regression `\(f(x_i,\beta)=x_i\beta\)` - `\(\epsilon_i\)` is the residual error - We assume `\(\epsilon\sim\mathcal{N}(0,\sigma^2)\)` - `\(\sigma\)` is the *constant* standard deviation (*homoscedastic*) If the standard deviation is not actually constant (*heteroscedastic*), estimates might be unreliable. --- # Modeling Heteroscedastic Errors The `calibFit` package includes models of the form: `\begin{equation} Y_{ij}=f(x_i,\beta)+\sigma g(\mu_i,z_i,\theta) \epsilon_{ij}, \end{equation}` where, - `\(Y_{ij}\)` are observed assay values/responses for `\(i\)`th individual ( `\(i=1,\ldots,n\)` ), `\(j\)`th replicate - `\(g(\mu_i,z_i,\theta)\)` is a function that allows the variances to depend on: - `\(\mu_i\)` (the mean response `\(f(x_i,\beta)\)`), - covariates `\(z_i\)`, and - a parameter ("known" or unknown) `\(\theta\)`. - `\(\epsilon_{ij}\sim\mathcal{N}(0,1)\)` In particular, `calibFit` implements the Power of the Mean (POM) function `\begin{equation} g(\mu_i,\theta) = \mu_i^{2\theta} \end{equation}` which results in `\begin{equation} \operatorname{var}(Y_{ij}) = \sigma^2\mu_i^{2\theta} \end{equation}` --- # Residuals From Fits with POM <img src="fluid_fig/calibFit_pom_residuals-1.svg" width="100%" style="display: block; margin: auto;" /> --- # HPLC Calibration With/Without POM Variance <img src="fluid_fig/unnamed-chunk-8-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Elisa Calibration With/Without POM Variance <img src="fluid_fig/unnamed-chunk-9-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Calibrated Estimates for Each Sample .pull-left[ <img src="fluid_fig/unnamed-chunk-10-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fluid_fig/unnamed-chunk-11-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # Calibration Statistics .large[ - _Minimum Detectable Concentration (MDC)_ is the lowest concentration where the curve is increasing (decreasing) - For an increasing curve, `\(x_{\textrm{MDC}} = \min\{x : f(x, \beta) \leq \textrm{LCL}_0\}\)` - For a decreasing curve, `\(x_{\textrm{MDC}} = \min\{x : f(x, \beta) \geq \textrm{UCL}_0\}\)` - `\(\textrm{LCL}_0\)` and `\(\textrm{UCL}_0\)` are lower/upper confidence limits at 0 - _Reliable Detection Limit (RDL)_ for an increasing (decreasing) curve, is the lowest concentration that has a high probability of producing a response that is significantly greater (less) than the response at 0. - For an increasing curve, `\(x_{\textrm{RDL}} = \min\{x : \textrm{UCL}_x \leq \textrm{LCL}_0\}\)` - For a decreasing curve, `\(x_{\textrm{RDL}} = \min\{x : \textrm{LCL}_x \geq \textrm{UCL}_0\}\)` - _Limit of Quantitization (LOQ)_ is the lowest concentration at which the coefficient of variation is less than a fixed percent (default is 20% in the `calibFit` package). ] --- class: inverse, middle, center # Supervised Learning ## Classification --- # Classification .pull-leftWider[ <img src="fluid_fig/unnamed-chunk-12-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-rightNarrower[ - Data from [adni.loni.usc.edu](adni.loni.usc.edu) - CSF Abeta 1-42 and t-tau assayed using the automated Roche Elecsys and cobas e 601 immunoassay analyzer system - Filter time points associated with first assay, and ignore subsequent time points - We'll ignore MCI and focus on CN vs Dementia ] --- # Classification <img src="fluid_fig/unnamed-chunk-13-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Reciever Operatoring Characteristic (ROC) Curves .pull-left[ <img src="fluid_fig/unnamed-chunk-14-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ For each potential threshold applied to CSF `\(\textrm{A}\beta 42\)`, we calculate: - Sensitivity: True Positive Rate = TP/(TP+FN) - Specificity: True Negative Rate = TN/(TN+FP) This traces out the ROC curve. A typical summary of a classifier's performance is the Area Under the Curve (AUC) AUC=0.83 in this case, with 95% CI ( 0.8, 0.86 ) AUCs close to one indicate good performance. The threshold shown here maximizes the distance between the curve and the diagonal line (chance) (Youden, 1950) ] --- # Comparing ROC Curves .pull-left[ <img src="fluid_fig/unnamed-chunk-15-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ | Marker | AUC | 95% CI | P-value `\(^*\)` | | ---------------------- |:--------------------:| ----------------------------:| ------------:| | `\(\textrm{A}\beta\)` | 0.83 | 0.8, 0.86 | | | Tau | 0.78 | 0.75, 0.82 | 0.07 | | Tau/ `\(\textrm{A}\beta\)` | 0.9 | 0.87, 0.92 | <0.001 | `\(^*\)` Bootstrap test comparing each row to `\(\textrm{A}\beta\)` (Robin, Turck, Hainard, Tiberti, Lisacek, Sanchez, and Müller, 2011) So the ratio of Tau / `\(\textrm{A}\beta\)` shows the best discrimination of NC from Dementia cases. ] --- # Youden's Cutoff for Tau / `\(\textrm{A}\beta\)` Ratio <img src="fluid_fig/unnamed-chunk-17-1.svg" width="100%" style="display: block; margin: auto;" /> Line is Tau = 0.394 `\(\times\)` Abeta (or Tau/Abeta = 0.394, Youden's cutoff) --- # Logistic Regression <table> <thead> <tr> <th style="text-align:left;"> Coefficient </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> z value </th> <th style="text-align:left;"> Pr(>|z|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.89 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> -6.7 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(ABETA) </td> <td style="text-align:right;"> -1.59 </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> -10.6 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(TAU) </td> <td style="text-align:right;"> 1.26 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 9.0 </td> <td style="text-align:left;"> <0.001 </td> </tr> </tbody> </table> --- # Logistic Regression Predicted Probabilities <img src="fluid_fig/unnamed-chunk-19-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Comparing ROC Curves .pull-left[ <img src="fluid_fig/unnamed-chunk-20-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ | Marker | AUC | 95% CI | P-value `\(^*\)` | | ---------------------- |:------------------------:| --------------------------------:| ------------:| | `\(\textrm{A}\beta\)` | 0.83 | 0.8, 0.86 | | | Tau | 0.78 | 0.75, 0.82 | 0.07 | | Tau/ `\(\textrm{A}\beta\)` | 0.9 | 0.87, 0.92 | <0.001 | | Logisitic model | 0.9 | 0.87, 0.92 | <0.001 | `\(^*\)` Bootstrap test comparing each row to `\(\textrm{A}\beta\)` (Robin, Turck, Hainard, et al., 2011) Logisitic model ROC is very similar to Tau/ `\(\textrm{A}\beta\)` ratio ROC. ] --- # Logistic Regression with Age and APOE <table> <thead> <tr> <th style="text-align:left;"> Coefficient </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> z value </th> <th style="text-align:left;"> Pr(>|z|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.12 </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> -6.5 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(ABETA) </td> <td style="text-align:right;"> -1.43 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:right;"> -9.0 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(TAU) </td> <td style="text-align:right;"> 1.19 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 8.5 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> scale(I(AGE + Years.bl)) </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 1.2 </td> <td style="text-align:left;"> 0.230 </td> </tr> <tr> <td style="text-align:left;"> as.factor(APOE4)1 </td> <td style="text-align:right;"> 0.37 </td> <td style="text-align:right;"> 0.25 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:left;"> 0.144 </td> </tr> <tr> <td style="text-align:left;"> as.factor(APOE4)2 </td> <td style="text-align:right;"> 1.26 </td> <td style="text-align:right;"> 0.45 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:left;"> 0.005 </td> </tr> </tbody> </table> This model does not provide much better ROC. --- # Tree-based Methods <img src="fluid_fig/unnamed-chunk-24-1.svg" width="100%" style="display: block; margin: auto;" /> Hothorn, Hornik, and Zeileis (2006) --- # Tree-based Methods <img src="fluid_fig/unnamed-chunk-25-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Comparing ROC Curves .pull-left[ <img src="fluid_fig/unnamed-chunk-26-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ | Marker | AUC | 95% CI | P-value `\(^*\)` | | ---------------------- |:------------------------:| --------------------------------:| ------------:| | `\(\textrm{A}\beta\)` | 0.83 | 0.8, 0.86 | | | Tau | 0.78 | 0.75, 0.82 | 0.07 | | Tau/ `\(\textrm{A}\beta\)` | 0.9 | 0.87, 0.92 | <0.001 | | Logisitic model | 0.9 | 0.87, 0.92 | <0.001 | | Binary Tree | 0.88 | 0.86, 0.91 | <0.001 | | Random Forest | 0.95 | 0.93, 0.96 | <0.001 | `\(^*\)` Bootstrap test comparing each row to `\(\textrm{A}\beta\)` (Robin, Turck, Hainard, et al., 2011) Random Forests (Breiman, 2001; Hothorn, Buehlmann, Dudoit, Molinaro, and Van Der Laan, 2006) re-fit binary trees on random subsamples of the data, then aggregate resulting trees into a "forest". This results in smoother predictions and a smoother ROC curve. ] --- class: inverse, middle, center # Unsupervised Learning ## Mixture Modeling --- # Unsupervised Learning .large[ - The classification techniques we just reviewed can be thought of as *Supervised Learning* in which we attempt to learn known "labels" (CN, Dementia). - *Mixture Modeling* is type of *Unsupervised Learning* technique in which we try to identify clusters of populations which appear to be arising from different distributions - Don't confuse *Mixture Models* with *Mixed-Effects Models* (which we'll discuss later) - Think: "Mixture of Distributions" ] --- # Distribution of ABETA <img src="fluid_fig/unnamed-chunk-28-1.svg" width="100%" style="display: block; margin: auto;" /> - Distribution is bimodal - Can we identify the two sub-distributions? - We'll explore with `mixtools` package (Benaglia, Chauveau, Hunter, and Young, 2009) --- # Distribution of ABETA <img src="fluid_fig/unnamed-chunk-30-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Posterior Membership Probabilities <table> <thead> <tr> <th style="text-align:right;"> Abeta </th> <th style="text-align:right;"> Prob. Abnormal </th> <th style="text-align:right;"> Prob. Normal </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1033 </td> <td style="text-align:right;"> 0.58 </td> <td style="text-align:right;"> 0.42 </td> </tr> <tr> <td style="text-align:right;"> 1036 </td> <td style="text-align:right;"> 0.57 </td> <td style="text-align:right;"> 0.43 </td> </tr> <tr> <td style="text-align:right;"> 1044 </td> <td style="text-align:right;"> 0.53 </td> <td style="text-align:right;"> 0.47 </td> </tr> <tr> <td style="text-align:right;"> 1048 </td> <td style="text-align:right;"> 0.52 </td> <td style="text-align:right;"> 0.48 </td> </tr> <tr> <td style="text-align:right;"> 1061 </td> <td style="text-align:right;"> 0.46 </td> <td style="text-align:right;"> 0.54 </td> </tr> <tr> <td style="text-align:right;"> 1071 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 0.58 </td> </tr> <tr> <td style="text-align:right;"> 1071 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 0.58 </td> </tr> <tr> <td style="text-align:right;"> 1072 </td> <td style="text-align:right;"> 0.41 </td> <td style="text-align:right;"> 0.59 </td> </tr> </tbody> </table> --- ## Bivariate Density <iframe src="3dfig.html" width="100%" height="500" id="igraph" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- # Bivariate Density Contour Plot <img src="fluid_fig/unnamed-chunk-33-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Bivariate Mixture Model Posterior Probabilities .pull-left[ <img src="fluid_fig/unnamed-chunk-34-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fluid_fig/unnamed-chunk-35-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # Summary .large[ - Batch Effects - Experimental Design (Sample Randomization) - Statistical Models for Assay Calibration/Quantification - Classification (Supervised Learning) - Logistic Regression - Binary Trees - Random Forest - Mixture Modeling (Unsupervised Learning) - Univariate - Bivariate ] --- # References Benaglia, T, D. Chauveau, D. R. Hunter, et al. (2009). "mixtools: An R Package for Analyzing Finite Mixture Models". In: _Journal of Statistical Software_ 32.6, pp. 1-29. URL: [http://www.jstatsoft.org/v32/i06/](http://www.jstatsoft.org/v32/i06/). Breiman, L. (2001). "Random forests". In: _Machine learning_ 45.1, pp. 5-32. Davidian, M. and P. D. Haaland (1990). "Regression and calibration with nonconstant error variance". In: _Chemometrics and Intelligent Laboratory Systems_ 9.3, pp. 231-248. Haaland, P, D. Samarov, and E. McVey (2011). _calibFit: Statistical models and tools for assay calibration_. R package version 2.1.0. URL: [https://CRAN.R-project.org/package=calibFit](https://CRAN.R-project.org/package=calibFit). Hothorn, T, P. Buehlmann, S. Dudoit, et al. (2006). "Survival Ensembles". In: _Biostatistics_ 7.3, pp. 355-373. Hothorn, T, K. Hornik, and A. Zeileis (2006). "Unbiased Recursive Partitioning: A Conditional Inference Framework". In: _Journal of Computational and Graphical Statistics_ 15.3, pp. 651-674. Robin, X, N. Turck, A. Hainard, et al. (2011). "pROC: an open-source package for R and S+ to analyze and compare ROC curves". In: _BMC Bioinformatics_ 12, p. 77. Youden, W. J. (1950). "Index for rating diagnostic tests". In: _Cancer_ 3.1, pp. 32-35.