Whenever you learn a Bayesian network from a small dataset, you must consider whether the number of observations is sufficient for correctly estimating all Probability Tables and Conditional Probability Tables in the network.
For instance, using the Occurrences Report, you can evaluate whether all Conditional Probability Tables in your network meet the rule-of-thumb criterion of at least 5 observations per cell.
For a deeper analysis, BayesiaLab can produce the Confidence Intervals Report, which we discuss on this page.
To understand how Confidence Intervals can be computed, we first need to explain the estimation of probabilities in the Probability Tables and Conditional Probability Tables, the so-called parameters.
In BayesiaLab, these parameters are estimated using Maximum Likelihood, i.e., using the frequencies observed in the dataset:
where:
So, the Parameter Estimation is straightforward and happens entirely in the background in BayesiaLab.
As a result, we may not always be aware of what numbers gave rise to the probabilities we see in a Probability Table or Conditional Probability Table, as the following diagram illustrates:
However, in terms of our confidence in the estimate, the two approaches are not the same. Our intuition tells us that we should have more confidence in the 0.1 value calculated based on the sample of 10,000.
BayesiaLab is using precisely the same approach for the Confidence Intervals Report.
However, in BayesiaLab, you can avoid resorting to this heuristic by using Uniform Prior Samples.
Within this network, focus on the three nodes BMI, Age, and Gender:
Go to Main Menu > Network > Reports > Confidence Intervals
to start the Confidence Intervals Report.
The Confidence Interval Report window opens up.
At the top of the report, the Confidence Level that serves as the basis for the reported Confidence Intervals is displayed.
Then, for each node, one table is shown.
For each cell containing a parameter estimate, an adjacent cell to the right displays the corresponding Confidence Interval in percentage points.
The color-coding scheme is identical to the one used in the Occurrences Report.
The fields in the report are color-coded to highlight potential issues:
Cells with 0 Occurrences are marked with a red background.
Cells with 5 Occurrences are highlighted with a yellow background. This is generally considered the minimum acceptable number of Occurrences.
Cells with 40 or more Occurrences are marked with a green background.
You can adjust the Confidence Level used for this report.
Go to Main Menu > Window > Preferences > Tools > Statistical Tools
.
Select the desired value from the Confidence Level dropdown menu.
Note that your selection here also applies to all other statistical tools and tests used in BayesiaLab.
is the estimated probability,
is the state of variable ,
represents the number of occurrences of the argument in the data set.
So, BayesiaLab could have estimated a probability of 0.1 (or 10%) for in numerous ways, e.g., based on a sample of 10 or 10,000: .
From Frequentist Statistics, we know how to calculate a Confidence Interval
for a proportion in a sample, which is exactly what the parameter represents.
So, for a Confidence Level of 95%, the Confidence Interval is calculated as:
where
If zero observations were observed for a given state, e.g., , the Rule of Three would have to be used instead to produce Confidence Intervals:
To illustrate the Confidence Intervals Report, we use the following network: NHANES_DEMO_BMX.xbl