Introduction
This is my lecture notes for UC Berkeley course Data Science for Research Psychology instructed by Professor Charles Frye. This post contains some basic data science and statistic knowledge. Most of the content showed following is from this course’s lecture slides with some of my understanding. You can check the original slides at here. If there are any copyright issues please contact me: haroldliuj@gmail.com.
There are some Chinese in this post, since I think my native language can explain those point more accurately. If you can understand Chinese, that’s great. If you can’t, those content won’t inhibit you from learning the whole picture. Feel free to translate it with Google!
Have a nice trip!
Lecture 07 Categorical Effects
Mixture Distribution
- The combination of different distributions
Categorical Effects Models
- Mixture Distribution-based model are also called categorical effect models
- Whenever the parameters of one variable depend on another grouping variable, we say there is a categorical effect:
- the category or group that an observation comes from has an effect on the distribution we choose to model the observation.
ANOVA Model
- Normal mixture distribution
Analysis of Variance
- Core idea: if group membership has a strong effect on an observed variable, then splitting the data up by group should reduce the variance substantially, while the mean squared difference of the groups should go up. This isn’t true of every possible way that an observed variable might depend on group membership, but it is true in many cases, as described below in “When to Use ANOVA”.
Implict Model in ANOVA
- Grand Mean: The average value of all observations
- Group Effect: The difference between that grand mean and the average of observations in a particular group
- Unknown Effects: The difference between the sum of the first two terms and the value that was observed. (Gaussian)
$F$ Statistic
When to use ANOVA
- However, in the case that we have more than two groups (more than two dog breeds) this requires us to perform multiple hypothesis tests. Each time we perform a hypothesis test, there is a chance of a false positive. As we perform more and more hypothesis tests, the chance that at least one of them fails goes up very quickly. This chance is called the *familywise error rate*, since its the rate at which an entire family of tests has an error. Issues with rising familywise error rates are called issues of *multiple comparisons*, since they arise from comparing multiple test statistics to their critical values.
- The utility of ANOVA comes from the fact that it lets us test the hypothesis that the mean of **at least one of the levels of the variable we are using to group our observations** is different from the overall mean without specifying which level it is.
Lecture 08 Muti-way Modeling
We can use Muti-way Model to see many variables’ effect on the value that we want to observe.
Randomness
- Our ignorance of some factors leads to randomness
- If we know which factors are present and absent, the data lokks deterministic
Use Posterior to See Effects
1 | with pm.Model() as synthetic_data_model: |
1 | with synthetic_data_model: |
The use mus
posterior to calculate the effect of one factor or two factors.
See factors’ effect separately
- Set a baseline
- Use the mean that represent one factor occur while the other doesn’t subtract baseline’s mean.
See two factors’ interaction
No Interaction
If we draw a plot of one factor with another occur or not:
If two graph overlaps greatly, we can say that Factor2 does not interact with Factor1
Compute Interaction
- Two factors have same baseline
1 | def compute_interaction_effect(accuracies): |
The idea here is if two factors do not interact then the prediction_from_separate
is similar to actually_observed_value
since two factors effect should equal to two factors’ separate effect. So the interaction_effect
is the difference between actually_observed_value
and prediction_from_separate
Two factors have different baseline
- in this case
hard
regardyoung
as baseline andold
regardyoung
as baseline
1
2
3
4
5with pm.Model() as eeg_combined_model:
means = pm.Normal("mus", mu=0, sd=1e6, shape=(3, 2))
sigma = pm.Exponential("sigma", lam=0.1)
observations = pm.Normal("d", mu=means[difficulty_indexer, age_indexer], sd=sigma, observed=data["d"])- in this case
1 | def compute_delta_age_easy(mus): |
1 | def compute_interaction_effect(mus): |
Lecture 09 Regression
Here we use Sir Francis Galton’s parent-child height dataset (source)
First we map parents mid-height to categrical variables (0-5)
Categorical Model
1 | grand_mean = df.groupby("midparental_height_category_idx")["height"].mean().mean() |
We can use mean as predictor for categorical model with a Normal likelihood
But we can not encode every continuous variable into categorical since if the numerical values are meaningful(like have order relationship), categories are unnatural because real categorical variables are not depend on order.Numerical Regression Model(Linear Regression)
Basic Idea:
We can represent the $predicted_height$ as:
so we can assume:
But there are issues for this:
we didn’t define what we meant by “$\approx$”.
- This left us with no choice but to pick the parameters of the line by hand.
we no longer have a probabilistic model for our data.
- It’s unclear how we might express our uncertainty about the size of our errors
or our uncertainty about the parameters of the line.
- It’s unclear how we might express our uncertainty about the size of our errors
Solution:
We solve both of these issues at once by specifying a likelihood
Build Model
1 | with pm.Model() as linear_model: |
mu = slope * midparental_height + intercept
part gave the line that we have in last section but with sigma
we can know the expected variability in observed values
Make Predictions
We can use pm.find_MAP
to find the best slope and intercept under the circumstance that we observed the data.
Linear Regression Model
if the function relating the two variables is linear, the model is a linear regression model
Logistic Regression Model
1 | with pm.Model() as golf_binomial: |
Deeper Discuss of Linear Model
Log Probabilities
Turning Normal distribution to Log Normal distribution can turn the ratio term in Normal distribution to subtact and turn the shape of the curve from bell shape to parabola.
- Notice two things: first, the value most probably observed is equal to the prediction, and second, the lots of other values are also quite probable — more so the closer they are to the prediction.
- The “steepness” of the parabola and its height also depend on the standard deviation, as we can see if we use the logarithm rules on the logarithm of the Normal distribution function.
Mathematical details below:
Maximum Likelihood Estimation
Changing the parameter to maximize the value of the log-likelihood
Doing MLE in pyMC
pyMC
doesn’t have a find_MLE
function directly, but With the right chocie of prior, we can trun MAP inference into MLE
- Since:
- If we choose a prior where $p(slope, inercept, \sigma)$ is constant, then we can have:
- The idea here is the more likely the data looks given the parameters, the more likely it is that those parameters are correct.
1 | with pm.Model() as ordinary_least_squares: |
This is called the “ordinary least squares” model because
- maximizing the likelihood means minimizing the squared error
- it’s the “ordinary” or “typical” model for frequentists
Measure the Performance
Variance Explained
- $VE == 1$ means perfect performance
- If the variance explained by a linear model is not 0, we say the variable are correlated
$R^2$(Correlation)
z-scoring(standardize)
- The Correlation is equal to the slope of the MLE regression line for standardized data
Meature the Correlation
ROPE: Region of Practical Equivalence
- The core idea of ROPE is if two variables that we are interested in are linear correlated, there will be a
slope
if we regard one variable as dependent variable and another as independent variable. If we can see the slope is close to 0, we can say these two variables do not have linear correlation.
- If the fraction of posterior samples that are inside of the ROPE for that parameter is large(e.g. $\geq{0.99}$), then we can say we are very certain the parameters is not meaningfully different from 0
- If the MAP estimate of the parameter to be exactly, 0 we can say the most likely single choice for the parameter is 0
Ridge Regression
Just make the prior of Slope
to be Normal
1 | with pm.Model() as ridge_regression_model: |
We can see the overlap between the psoterior and the ROPE is now much stronger. But the MAP value is still not 0.
LASSO Regression
Set Slope’s prior as Laplace
1 | with pm.Model() as lasso_regression_model: |
- In LASSO Regression, the MAP estimate is 0. But note that the posterior probability of the parameter being 0 is 0, because the posterior is still a density, over continuous values.
If our data is not Normal, we can use a different likelihood
- One of the most common cause of non-Normality is outliers: large, rare effects.
Solution: Robust Regression
- A technique for doing accurate regression in the presence of outliers
- Specifically, chocie of probability distribution with “heavy tails” like
Cauchy
,StudentT
orLaplace
Lecture 10 Formulas and Linear Models
Building Models with Formulas
For example, the model
becomes
So the model
becomes
1 | with pm.Model() as mean_glm_model: |
Build Categorical Model with Formula
1 | with pm.Model() as categorical_glm_model: |
By default, categorical vaiables are handled by converting them to "Treatment Coding"
Intercept
corresponds to what we called the “mean of the baseline group” and C(is_setosa)[T.True]
corresponds to what we called the “effect of the is_setosa
factor”.
Multiway Categorical Linear Model
Prediction
Define a function, make_predictions_twoway
,
that takes as its first two arguments two Series
of factor indices,
one for the first factor and one for the second factor,
and an np.ndarray
of group_means
as its third argument,
and returns a Series
of predictions from the two-way categorical model
with those group_means
as its parameters.
The index for factor 1 should be used to select the row of group_means
,
while the index for factor 2 should be used to select the column.
So if the group_means
were the array
1 | [[0, 1, 2], |
then make_predictions_twoway
would return
1 | pd.Series([0, 1, 5]) |
on the inputs
1 | pd.Series([0, 0, 1]), pd.Series([0, 1, 2]) |
1 | def make_predictions_twoway(factor1, factor2, group_means): |
Model Enconding
Encoding Multiway Categorical Data
In the linear model encoding of data for categorical models,including multiway models,
each data point is still represented by a Series
beginning with 1
.
The other entries of the Series
are all 1
or 0
. As with a one-way model, the length of the Series
is equal to the total number of groups. In addition to starting with a 1
, there are 1
s to indicate whether this data point comes from a non-zero level of each factor and if it comes from a given _combination_ of non-zero levels, as in
1 | [1, int(factor1_idx == 1), int(factor1_idx == 2), ... |
for a a model with two factors with total numbers of factor levels J
and K
.
We will focus on the simplest case: a multiway model with two factors, each of which has two levels. We will call such a model a _two-by-two_ model. If there are two levels of each two factors, for a total of four groups, a datapoint from level 0 of both factors would be coded as
1 | [1, 0, 0, 0] |
and one from level 1 of both factors would be coded as
1 | [1, 1, 1, 1] |
while a datapoint from level 1 of factor 1 and level 0 of factor 2 would be coded as
1 | [1, 1, 0, 0] |
1 | def encode_data_twobytwo(factor1_idx, factor2_idx): |
Encoding Multiway Categorical Parameters
When the inputs to categorical models are written in the linear model encoding, the outputs are computed in the same way as or the alternate encoding of a linear regression model:
Again, $\text{data features}[0]$ is always 1
. In categorical models, instead of the $\text{data features}$ being the observed value, they are instead 0
or 1
to indicate to which groups the data point belonged, as levels and combinations of levels from each factor.
For a two by two model, then, the prediction for the observations in level 0 of both facors, aka the baseline group, is
while that for an observation in level 1 of both factors is
and so
which is what we called the “interaction of the factors” when working with categorical models in the mean parameterization.
In general, when we computed effects and interactions in categorical models, we compared the group_means
to each other.
When the inputs to categorical models are written in the linear model encoding, the parameters are directly in terms of the effects and interactions.
As a concrete example, for a collection of group_means
1 | [[1, 2], |
the corresponding parameters
are
1 | [1, 0, 1, 2] |
1 | def encode_parameters_twobytwo_model(group_means): |
Build Interacting Models
1 | with pm.Model() as combined_glm_model: |
In this case, an “interaction” means that the slope of the line is different for setosas and non-setosas, in addition to the intercept being different.
Lecture 11 Over-Fitting and Cross-Validation
If we use $R^2$ as our criteria, then MLE will always be winner since MLE parameters “explain” the data as best as possible.
But the goal of our task is not just to explain the data but to predict data
Linearization
If two variables x
, y
do not have linear relationship, we can use a function f(x)
to transform one variable so that they can have linear relationship.
- e.g. y ~ 1/x, f(x) = 1 / x
- then y ~ f(x), that is y ~ x
Polynomial Models
Any function can be approximated as a weighted sum of polynomial (power) functions:
Cross-Validation
In cross-validation, the data is split into two pieces: one set for picking parameters with MLE/MAP or getting the posterior and the other set for checking $R^2$ (or a similar metric).
These are called the _training_ and _testing_ sets: one is used to “train” the model, improving its performance, much like an athlete trains, while the other is used to “test” its performance.
Leave One Out (LOO)
Only use one data as test set. The advantage of this method is this won’t be influenced by how we split the training and test set and this method uses almost all the data for training. So we can make bias lower.
Lecture 12 Markov Chain and Monte Carlo
- Any method that uses samples to approximate distributions and averages is a Monte Carlo method.
- The method we use to draw samples is called Mrkov Chain sampling
- Using samples from a MRakov Chain to approximate Bayesian posteriors is known as Bayesian Markov Chain Monte Carlo.
Monte Carlo Sampling
Question: How to draw samples in a circle area
- We can mimic the diffusion effect that a drop of ink will have in a plate of water
Core Idea
- We initial the first point that in the circle
- On each step, we adjust the position of our point a samll amount
1 | def collision_result(starting_point): |
- Repeat step 2 and check if the point is in the circle
1 | def sample_from_circle_diffusion(circle_checker, init, collider, n): |
By running this function we can see the program is exploring the circle
Trace
Trace is a visualization of a sampler’s trajectory focuses on one variable at a time
Notice the relatively slow motion of the particle: when it is at an extremely positive or negative value (close to -1 and 1), it will tend to stay near that value for tens of steps. This is visible as a “waviness” of the trajectory.
If we shuffle the data, the “wave” will disappear and the trace will become:
Because we use Markov Chain to make points, each point will correlate with the points some steps ahead.
The correlation between values as a function of the time-lag between them is known as the autocorrelation.
Auto-correlation is bad for a sampler:it can be shown that the higher this auto-correlation, the worse our Monte Carlo estimates will be.
Eliminating Auto-correlation — $Metropolis-Hastings$
Metropolis-Hasting is a generalized version of the diffusion used above
1 | def metropolis_hastings( |
The biggest difference is that acceptance criterion is soft, insead of hard
1 | def metropolis_criterion(logp, current, proposal): |
Metropolis-Hastings-type alogorithms help us ample from posteriors
The goal of Markov Chain Monte Carlo is not just to sample uniformly from simple shapes like the circle.
Instead, the goal is to sample from interesting distributions that we can’t write down,
like the posterior of a complicated model.
One of the major benefits of the Metropolis-Hastings algorithm comes from the fact that it uses a _ratio_ of probabilities.
First, consider the definition of the posterior from Bayes’ Rule:
The troublesome part of this equation is the denominator. The numerator is part of our modeling process: it has the likelihood and the prior (in that order).
Now let’s consider comparing two possible values of the params, $A$ and $B$:
If we take the ratio of the probabilities, we get a complicated-looking expression:
But it simplifies, because $p(\text{data})$ is in the denominator of both the top and bottom:
This is only in terms of the likelihood and prior!
The trace of this method looks like above
Because our initial guess had low posterior probability, there was a brief period in which our samples were slightly “off”.This period is known as the “burn-in” time of our sampler.
And auto-correlation looks like:
pyMC combines the model-specification API we’ve worked with throughout the semester with _very_ sophisticated versions of Metropolis-Hastings.
The auto-correlation from pyMC is low and the trace seems more independent