ANOVA, short for “Analysis of Variance”, is a statistical tool used to test differences between 2 or more populations. Analysts often use the ANOVA test to determine the influence that independent variables have on the dependent variable of interest.
In Uncountable, you can run ANOVA by clicking on the Calculate tab, and then “Run ANOVA”.
The first thing you will need to do is to select the output you are most interested in analyzing, by clicking on where it says “Select Output”.
Next, you will be able to select the different components, or “features”, you wish to include in the analysis. These features can be categorical inputs, like the type of instrument used, some kind of instrument settings… or they can be continuous variables like the amount of polymer or solvent in the formulation or even input calculations such as total cost or calculated specific gravity. Numeric features are listed in the middle dropdown where it says “Select Numeric Variables”, and categorical features are on the right side of the page where it says “Select Categorical Variables”. You may select multiple features of each type, if you wish.
In the example below, I’ve selected Tensile as my property of interest, and then one continuous variable and one categorical variable as features. After that – one equation and three tables will appear on the ANOVA page.
At the top of the page, you will see an equation written out with the coefficients and features described in the ANOVA. By default, the ANOVA analysis uses a linear function. If you would like to add more complicated terms, like polynomial terms or second order terms to the equation, you can do so by clicking on the “OPTIONS” button on the right side of the Features panel.
Table 1 will list out the summary statistics for the output in question, including the number of samples, as well as the predictive accuracy of a model trained on the specific features selected in the page. The predictive accuracy can be assessed by the use of root mean squared error and r^2 values.
Table 2 describes exactly how variation in the output is attributable to the features used in the ANOVA model. The key characteristics to observe in this table are the “% of total variation”, and the “p Value”. The % of total variation defines the percentage of variability in an output that can be attributed to the feature listed in the first column. The higher the value, the greater the effect that feature has on the output. Any unexplained variation in the output is represented under the “Residuals” feature. The “p Value” represents the probability that variance reduction at least as extreme as the one observed can be explained by data generated by the null hypothesis, meaning the result is statistically significant enough to warrant that the impact of one feature on the output is not due to random noise. The lower the p Value, the greater the likelihood of a statistically significant result.
Table 3 highlights the number of samples used for each feature, the mean value for each feature, and the linear regression coefficient for each feature.
In both tables, the p Summary column represents a special notation based on the p value (the column on its left).
p<= 0.0001: ****
0.0001<p<=0.001: ***
0.001<p<=0.01: **
0.01<p<=0.05: *
There would not be anything in display if p is above 0.05.
Treating Replicates
By default, we use the average of experiment replicates. For example, if Experiment 1 has 3 replicates for tensile: 10, 12 and 14, the model will take the average of the three replicates, which would be 12 and treat them as a single recording/data point. However, if you prefer to treat them as separate data points (this might increase the number of data points and result in a smaller p value), you can uncheck the “Use Average of Experiment Replicates” box which is right below the “Select Output” box.
Export your results
Once you are complete with your analysis, you can always export the table of coefficients and associated predictions for each data point that was included in the analysis via clicking the “Export to XLSX” button which is on the right side of the screen.
ANOVA related FAQs
Why are there three different tables? What is each table useful for?
- The first table is of Summary Statistics about the overall model and variables you have incorporated into it. The Root Mean Square Error is an overall indicator of how accurate the predictions of the model are in general, and the r^2 or Coefficient of Determination is an overall indicator of how much of the spread, or variance, of the data is captured by the model.
- The second table is the ANOVA (Analysis of Variance) table, and includes results for tests of statistical significance. This table helps to determine whether specific individual variables are significant and should be included in the overall model and analysis.
- The third table displays the Coefficients of the model. These coefficients are multiplied with the independent variable terms as part of the overall multilinear model, as displayed in the equation at the top of the ANOVA tools view.
Why are there two different p-values in two different tables?
- P-values are part of a procedure known as hypothesis testing. For both p-values, a lower value indicates greater significance. A common threshold for p-values in scientific usage is 0.05 and lower for statistical significance.
- In the ANOVA table, the p-value tests the significance of a particular variable, and whether it contributes to explain the variance in the overall model. This first p-value helps to choose whether it is worthwhile including a particular variable in the model or not.
- In the Coefficients table, the p-value is from a t-test of whether the particular variable has an effect on the output value. This can be helpful in assessing the reliability of the magnitude of the coefficient of the particular variable.
How do I decide whether to include or exclude an input/independent variable?
- Include variables that, in your scientific understanding, would affect the final measurable outcome, and that vary through the experiments. One simple way to check whether the variables are significant is to see if the p-value of the ANOVA table for that particular variable is small enough.
Why can’t I just add all available variables as predictors to build a large model?
- Technically speaking, you can include all available variables as predictors.
- Most often, we want to only include as many variables as are needed to adequately explain the dependent variable. Including too many independent variables can cause over-fitting, where the model loses predictability for unseen, future data points.
How do I decide whether my overall model is good?
- Three simple indicators: Root Mean Square Error, r^2, and the Sum of Squares of the Residuals. The first two are included in table 1 and the last one is included in table 2.
- The Root Mean Square Error should be small.
- The r^2 should be high, ideally close to 1.
- The Sum of Squares of the Residuals should be small.