## Introduction

This is my lecture notes for UC Berkeley course Data Science for Research Psychology instructed by Professor Charles Frye. This post contains some basic data science and statistic knowledge. Most of the content showed following is from this course’s lecture slides with some of my understanding. You can check the original slides at here. If there are any copyright issues please contact me: haroldliuj@gmail.com.

There are some Chinese in this post, since I think my native language can explain those point more accurately. If you can understand Chinese, that’s great. If you can’t, those content won’t inhibit you from learning the whole picture. Feel free to translate it with Google!

Have a nice trip!

## Lecture 07 Categorical Effects

### Mixture Distribution

- The combination of different distributions

### Categorical Effects Models

- Mixture Distribution-based model are also called
**categorical effect models** - Whenever the parameters of one variable depend on another grouping variable, we say there is a categorical effect:
**the***category*or*group*that an observation comes from has an*effect*on the distribution we choose to model the observation.

#### ANOVA Model

- Normal mixture distribution

### Analysis of Variance

- Core idea:
**if group membership has a strong effect on an observed variable, then splitting the data up by group should reduce the variance substantially, while the mean squared difference of the groups should go up. This isn’t true of every possible way that an observed variable might depend on group membership, but it is true in many cases, as described below in “When to Use ANOVA”.**

#### Implict Model in ANOVA

- Grand Mean: The average value of all observations
- Group Effect: The difference between that grand mean and the average of observations in a particular group
- Unknown Effects: The difference between the sum of the first two terms and the value that was observed. (Gaussian)

### $F$ Statistic

#### When to use ANOVA

- However, in the case that we have more than two groups (more than two dog breeds) this requires us to perform multiple hypothesis tests. Each time we perform a hypothesis test, there is a chance of a false positive. As we perform more and more hypothesis tests, the chance that at least one of them fails goes up very quickly. This chance is called the *familywise error rate*, since its the rate at which an entire family of tests has an error. Issues with rising familywise error rates are called issues of *multiple comparisons*, since they arise from comparing multiple test statistics to their critical values.
- The utility of ANOVA comes from the fact that it lets us test the hypothesis that the mean of **at least one of the levels of the variable we are using to group our observations** is different from the overall mean without specifying which level it is.

## Lecture 08 Muti-way Modeling

We can use Muti-way Model to see many variables’ effect on the value that we want to observe.

### Randomness

- Our ignorance of some factors leads to randomness
- If we know which factors are present and absent, the data lokks deterministic

### Use Posterior to See Effects

1 | with pm.Model() as synthetic_data_model: |

1 | with synthetic_data_model: |

The use `mus`

posterior to calculate the effect of one factor or two factors.

#### See factors’ effect separately

- Set a baseline
- Use the mean that represent one factor occur while the other doesn’t subtract baseline’s mean.

#### See two factors’ interaction

##### No Interaction

If we draw a plot of one factor with another occur or not:

If two graph overlaps greatly, we can say that Factor2 does not interact with Factor1

##### Compute Interaction

- Two factors have same baseline

1 | def compute_interaction_effect(accuracies): |

The idea here is if two factors do not interact then the `prediction_from_separate`

is similar to `actually_observed_value`

since two factors effect should equal to two factors’ separate effect. So the `interaction_effect`

is the difference between `actually_observed_value`

and `prediction_from_separate`

Two factors have different baseline

- in this case
`hard`

regard`young`

as baseline and`old`

regard`young`

as baseline

1

2

3

4

5with pm.Model() as eeg_combined_model:

means = pm.Normal("mus", mu=0, sd=1e6, shape=(3, 2))

sigma = pm.Exponential("sigma", lam=0.1)

observations = pm.Normal("d", mu=means[difficulty_indexer, age_indexer], sd=sigma, observed=data["d"])- in this case

1 | def compute_delta_age_easy(mus): |

1 | def compute_interaction_effect(mus): |

## Lecture 09 Regression

Here we use Sir Francis Galton’s parent-child height dataset (source)

First we map parents mid-height to categrical variables (0-5)

### Categorical Model

1 | grand_mean = df.groupby("midparental_height_category_idx")["height"].mean().mean() |

##### We can use mean as predictor for categorical model with a Normal likelihood

But we can not encode every continuous variable into categorical since if the numerical values are meaningful(like have order relationship), categories are unnatural because real categorical variables are not depend on order.### Numerical Regression Model(Linear Regression)

#### Basic Idea:

We can represent the $predicted_height$ as:

so we can assume:

**But there are issues for this:**

we didn’t define what we meant by “$\approx$”.

- This left us with no choice but to pick the parameters of the line by hand.

we no longer have a probabilistic model for our data.

- It’s unclear how we might express our uncertainty about the size of our errors

or our uncertainty about the parameters of the line.

- It’s unclear how we might express our uncertainty about the size of our errors

#### Solution:

We solve both of these issues at once by specifying a likelihood

##### Build Model

1 | with pm.Model() as linear_model: |

`mu = slope * midparental_height + intercept`

part gave the line that we have in last section but with `sigma`

we can know the expected variability in observed values

##### Make Predictions

We can use `pm.find_MAP`

to find the best slope and intercept under the circumstance that we observed the data.

#### Linear Regression Model

if the function relating the two variables is linear, the model is a linear regression model

#### Logistic Regression Model

1 | with pm.Model() as golf_binomial: |

### Deeper Discuss of Linear Model

#### Log Probabilities

Turning Normal distribution to Log Normal distribution can turn the ratio term in Normal distribution to subtact and turn the shape of the curve from bell shape to parabola.

- Notice two things: first, the value most probably observed is equal to the prediction, and second, the lots of other values are also quite probable — more so the closer they are to the prediction.
- The “steepness” of the parabola and its height also depend on the standard deviation, as we can see if we use the logarithm rules on the logarithm of the Normal distribution function.

Mathematical details below:

##### Maximum Likelihood Estimation

Changing the parameter to maximize the value of the log-likelihood

**Doing MLE in pyMC**

`pyMC`

doesn’t have a `find_MLE`

function directly, but **With the right chocie of prior, we can trun MAP inference into MLE**

- Since:

- If we choose a prior where $p(slope, inercept, \sigma)$ is constant, then we can have:

- The idea here is the more likely the data looks given the parameters, the more likely it is that those parameters are correct.

1 | with pm.Model() as ordinary_least_squares: |

This is called the **“ordinary least squares”** model because

- maximizing the likelihood means minimizing the squared error
- it’s the “ordinary” or “typical” model for frequentists

#### Measure the Performance

##### Variance Explained

- $VE == 1$ means perfect performance
- If the variance explained by a linear model is not 0, we say the variable are correlated

##### $R^2$(Correlation)

#### z-scoring(standardize)

- The Correlation is equal to the slope of the MLE regression line for
**standardized data**

#### Meature the Correlation

##### ROPE: Region of Practical Equivalence

- The core idea of ROPE is if two variables that we are interested in are linear correlated, there will be a
`slope`

if we regard one variable as dependent variable and another as independent variable. If we can see the slope is close to 0, we can say these two variables do not have linear correlation.

- If the fraction of posterior samples that are inside of the ROPE for that parameter is large(e.g. $\geq{0.99}$), then we can say
**we are very certain the parameters is not meaningfully different from 0** - If the
**MAP**estimate of the parameter to be exactly, 0 we can say**the most likely single choice for the parameter is 0**

#### Ridge Regression

Just make the prior of `Slope`

to be Normal

1 | with pm.Model() as ridge_regression_model: |

We can see the overlap between the psoterior and the ROPE is now much stronger. But the MAP value is still not 0.

#### LASSO Regression

Set Slope’s prior as `Laplace`

1 | with pm.Model() as lasso_regression_model: |

- In LASSO Regression, the MAP estimate is 0. But note that the posterior probability of the parameter being 0 is 0, because the posterior is still a density, over continuous values.

#### If our data is not Normal, we can use a different likelihood

- One of the most common cause of non-Normality is
*outliers*: large, rare effects.

##### Solution: Robust Regression

- A technique for doing accurate regression in the presence of outliers
- Specifically, chocie of probability distribution with “heavy tails” like
`Cauchy`

,`StudentT`

or`Laplace`

## Lecture 10 Formulas and Linear Models

### Building Models with Formulas

For example, the model

becomes

So the model

becomes

1 | with pm.Model() as mean_glm_model: |

#### Build Categorical Model with Formula

1 | with pm.Model() as categorical_glm_model: |

By default, categorical vaiables are handled by converting them to `"Treatment Coding"`

`Intercept`

corresponds to what we called the “mean of the baseline group” and `C(is_setosa)[T.True]`

corresponds to what we called the “effect of the `is_setosa`

factor”.

#### Multiway Categorical Linear Model

##### Prediction

Define a function, `make_predictions_twoway`

,

that takes as its first two arguments two `Series`

of factor indices,

one for the first factor and one for the second factor,

and an `np.ndarray`

of `group_means`

as its third argument,

and returns a `Series`

of predictions from the two-way categorical model

with those `group_means`

as its parameters.

The index for factor 1 should be used to select the row of `group_means`

,

while the index for factor 2 should be used to select the column.

So if the `group_means`

were the array

1 | [[0, 1, 2], |

then `make_predictions_twoway`

would return

1 | pd.Series([0, 1, 5]) |

on the inputs

1 | pd.Series([0, 0, 1]), pd.Series([0, 1, 2]) |

1 | def make_predictions_twoway(factor1, factor2, group_means): |

##### Model Enconding

**Encoding Multiway Categorical Data**

In the linear model encoding of data for categorical models,including multiway models,

each data point is still represented by a `Series`

beginning with `1`

.

The other entries of the `Series`

are all `1`

or `0`

. As with a one-way model, the length of the `Series`

is equal to the total number of groups. In addition to starting with a `1`

, there are `1`

s to indicate whether this data point comes from a non-zero level of each factor and if it comes from a given _combination_ of non-zero levels, as in

1 | [1, int(factor1_idx == 1), int(factor1_idx == 2), ... |

for a a model with two factors with total numbers of factor levels `J`

and `K`

.

We will focus on the simplest case: a multiway model with two factors, each of which has two levels. We will call such a model a _two-by-two_ model. If there are two levels of each two factors, for a total of four groups, a datapoint from level 0 of both factors would be coded as

1 | [1, 0, 0, 0] |

and one from level 1 of both factors would be coded as

1 | [1, 1, 1, 1] |

while a datapoint from level 1 of factor 1 and level 0 of factor 2 would be coded as

1 | [1, 1, 0, 0] |

1 | def encode_data_twobytwo(factor1_idx, factor2_idx): |

**Encoding Multiway Categorical Parameters**

When the inputs to categorical models are written in the linear model encoding, the outputs are computed in the same way as or the alternate encoding of a linear regression model:

Again, $\text{data features}[0]$ is always `1`

. In categorical models, instead of the $\text{data features}$ being the observed value, they are instead `0`

or `1`

to indicate to which groups the data point belonged, as levels and combinations of levels from each factor.

For a two by two model, then, the prediction for the observations in level 0 of both facors, aka the baseline group, is

while that for an observation in level 1 of both factors is

and so

which is what we called the “interaction of the factors” when working with categorical models in the mean parameterization.

In general, when we computed effects and interactions in categorical models, we compared the `group_means`

to each other.

When the inputs to categorical models are written in the linear model encoding, the parameters are directly in terms of the effects and interactions.

As a concrete example, for a collection of `group_means`

1 | [[1, 2], |

the corresponding `parameters`

are

1 | [1, 0, 1, 2] |

1 | def encode_parameters_twobytwo_model(group_means): |

#### Build Interacting Models

1 | with pm.Model() as combined_glm_model: |

In this case, an “interaction” means that the slope of the line is different for setosas and non-setosas, in addition to the intercept being different.

## Lecture 11 Over-Fitting and Cross-Validation

If we use $R^2$ as our criteria, then MLE will always be winner since MLE parameters “explain” the data as best as possible.

But the goal of our task is not just to explain the data but to predict data

### Linearization

If two variables `x`

, `y`

do not have linear relationship, we can use a function `f(x)`

to transform one variable so that they can have linear relationship.

- e.g. y ~ 1/x, f(x) = 1 / x
- then y ~ f(x), that is y ~ x

#### Polynomial Models

Any function can be approximated as a weighted sum of polynomial (power) functions:

### Cross-Validation

In cross-validation, the data is split into two pieces: one set for picking parameters with MLE/MAP or getting the posterior and the other set for checking $R^2$ (or a similar metric).

These are called the _training_ and _testing_ sets: one is used to “train” the model, improving its performance, much like an athlete trains, while the other is used to “test” its performance.

#### Leave One Out (LOO)

Only use one data as test set. The advantage of this method is this won’t be influenced by how we split the training and test set and this method uses almost all the data for training. So we can make bias lower.

## Lecture 12 Markov Chain and Monte Carlo

- Any method that uses samples to approximate distributions and averages is a Monte Carlo method.
- The method we use to draw samples is called Mrkov Chain sampling
- Using samples from a MRakov Chain to approximate Bayesian posteriors is known as Bayesian Markov Chain Monte Carlo.

### Monte Carlo Sampling

#### Question: How to draw samples in a circle area

- We can mimic the
*diffusion effect*that a drop of ink will have in a plate of water

##### Core Idea

- We initial the first point that in the circle
- On each step, we adjust the position of our point a samll amount

1 | def collision_result(starting_point): |

- Repeat step 2 and check if the point is in the circle

1 | def sample_from_circle_diffusion(circle_checker, init, collider, n): |

By running this function we can see the program is exploring the circle

##### Trace

Trace is a visualization of a sampler’s trajectory focuses on one variable at a time

Notice the relatively slow motion of the particle: when it is at an extremely positive or negative value (close to -1 and 1), it will tend to stay near that value for tens of steps. This is visible as a “waviness” of the trajectory.

If we shuffle the data, the “wave” will disappear and the trace will become:

Because we use Markov Chain to make points, each point will correlate with the points some steps ahead.

The correlation between values as a function of the time-lag between them is known as the *autocorrelation*.

**Auto-correlation is bad for a sampler:it can be shown that the higher this auto-correlation, the worse our Monte Carlo estimates will be.**

### Eliminating Auto-correlation — $Metropolis-Hastings$

Metropolis-Hasting is a generalized version of the diffusion used above

1 | def metropolis_hastings( |

**The biggest difference is that acceptance criterion is soft, insead of hard**

1 | def metropolis_criterion(logp, current, proposal): |

#### Metropolis-Hastings-type alogorithms help us ample from posteriors

The goal of Markov Chain Monte Carlo is not just to sample uniformly from simple shapes like the circle.

Instead, the goal is to sample from interesting distributions that we can’t write down,

like the posterior of a complicated model.

One of the major benefits of the Metropolis-Hastings algorithm comes from the fact that it uses a _ratio_ of probabilities.

First, consider the definition of the posterior from Bayes’ Rule:

The troublesome part of this equation is the denominator. The numerator is part of our modeling process: it has the likelihood and the prior (in that order).

Now let’s consider comparing two possible values of the params, $A$ and $B$:

If we take the ratio of the probabilities, we get a complicated-looking expression:

But it simplifies, because $p(\text{data})$ is in the denominator of both the top and bottom:

This is only in terms of the likelihood and prior!

The trace of this method looks like above

Because our initial guess had low posterior probability, there was a brief period in which our samples were slightly “off”.This period is known as the “burn-in” time of our sampler.

And auto-correlation looks like:

### pyMC combines the model-specification API we’ve worked with throughout the semester with _very_ sophisticated versions of Metropolis-Hastings.

The auto-correlation from pyMC is low and the trace seems more independent