0%

## Introduction

This is my lecture notes for UC Berkeley course Data Science for Research Psychology instructed by Professor Charles Frye. This post contains some basic data science and statistic knowledge. Most of the content showed following is from this course’s lecture slides with some of my understanding. You can check the original slides at here. If there are any copyright issues please contact me: haroldliuj@gmail.com.

There are some Chinese in this post, since I think my native language can explain those point more accurately. If you can understand Chinese, that’s great. If you can’t, those content won’t inhibit you from learning the whole picture. Feel free to translate it with Google!

Have a nice trip!

## Lecture 07 Categorical Effects

### Mixture Distribution

• The combination of different distributions

### Categorical Effects Models

• Mixture Distribution-based model are also called categorical effect models
• Whenever the parameters of one variable depend on another grouping variable, we say there is a categorical effect:
• the category or group that an observation comes from has an effect on the distribution we choose to model the observation.

#### ANOVA Model

• Normal mixture distribution

### Analysis of Variance

• Core idea: if group membership has a strong effect on an observed variable, then splitting the data up by group should reduce the variance substantially, while the mean squared difference of the groups should go up. This isn’t true of every possible way that an observed variable might depend on group membership, but it is true in many cases, as described below in “When to Use ANOVA”.

#### Implict Model in ANOVA

• Grand Mean: The average value of all observations
• Group Effect: The difference between that grand mean and the average of observations in a particular group
• Unknown Effects: The difference between the sum of the first two terms and the value that was observed. (Gaussian)

### $F$ Statistic

#### When to use ANOVA

• However, in the case that we have more than two groups (more than two dog breeds) this requires us to perform multiple hypothesis tests. Each time we perform a hypothesis test, there is a chance of a false positive. As we perform more and more hypothesis tests, the chance that at least one of them fails goes up very quickly. This chance is called the *familywise error rate*, since its the rate at which an entire family of tests has an error. Issues with rising familywise error rates are called issues of *multiple comparisons*, since they arise from comparing multiple test statistics to their critical values.
• The utility of ANOVA comes from the fact that it lets us test the hypothesis that the mean of **at least one of the levels of the variable we are using to group our observations** is different from the overall mean without specifying which level it is.

## Lecture 08 Muti-way Modeling

We can use Muti-way Model to see many variables’ effect on the value that we want to observe.

### Randomness

• Our ignorance of some factors leads to randomness
• If we know which factors are present and absent, the data lokks deterministic

### Use Posterior to See Effects

The use mus posterior to calculate the effect of one factor or two factors.

#### See factors’ effect separately

1. Set a baseline
2. Use the mean that represent one factor occur while the other doesn’t subtract baseline’s mean.

#### See two factors’ interaction

##### No Interaction

If we draw a plot of one factor with another occur or not:

If two graph overlaps greatly, we can say that Factor2 does not interact with Factor1

##### Compute Interaction
• Two factors have same baseline

The idea here is if two factors do not interact then the prediction_from_separate is similar to actually_observed_value since two factors effect should equal to two factors’ separate effect. So the interaction_effect is the difference between actually_observed_value and prediction_from_separate

• Two factors have different baseline

• in this case hard regard young as baseline and old regard young as baseline

## Lecture 09 Regression

Here we use Sir Francis Galton’s parent-child height dataset (source)

First we map parents mid-height to categrical variables (0-5)

### Categorical Model

##### We can use mean as predictor for categorical model with a Normal likelihood
But we can not encode every continuous variable into categorical since if the numerical values are meaningful(like have order relationship), categories are unnatural because real categorical variables are not depend on order.

### Numerical Regression Model(Linear Regression)

#### Basic Idea:

We can represent the $predicted_height$ as:

so we can assume:

But there are issues for this:

• we didn’t define what we meant by “$\approx$”.

• This left us with no choice but to pick the parameters of the line by hand.
• we no longer have a probabilistic model for our data.

• It’s unclear how we might express our uncertainty about the size of our errors
or our uncertainty about the parameters of the line.

#### Solution:

We solve both of these issues at once by specifying a likelihood

##### Build Model

mu = slope * midparental_height + intercept part gave the line that we have in last section but with sigma we can know the expected variability in observed values

##### Make Predictions

We can use pm.find_MAP to find the best slope and intercept under the circumstance that we observed the data.

#### Linear Regression Model

if the function relating the two variables is linear, the model is a linear regression model

### Deeper Discuss of Linear Model

#### Log Probabilities

Turning Normal distribution to Log Normal distribution can turn the ratio term in Normal distribution to subtact and turn the shape of the curve from bell shape to parabola.

• Notice two things: first, the value most probably observed is equal to the prediction, and second, the lots of other values are also quite probable — more so the closer they are to the prediction.
• The “steepness” of the parabola and its height also depend on the standard deviation, as we can see if we use the logarithm rules on the logarithm of the Normal distribution function.

Mathematical details below:

##### Maximum Likelihood Estimation

Changing the parameter to maximize the value of the log-likelihood

Doing MLE in pyMC

pyMC doesn’t have a find_MLE function directly, but With the right chocie of prior, we can trun MAP inference into MLE

• Since:
• If we choose a prior where $p(slope, inercept, \sigma)$ is constant, then we can have:
• The idea here is the more likely the data looks given the parameters, the more likely it is that those parameters are correct.

This is called the “ordinary least squares” model because

• maximizing the likelihood means minimizing the squared error
• it’s the “ordinary” or “typical” model for frequentists

#### Measure the Performance

##### Variance Explained
• $VE == 1$ means perfect performance
• If the variance explained by a linear model is not 0, we say the variable are correlated

#### z-scoring​(standardize)

• The Correlation is equal to the slope of the MLE regression line for standardized data

#### Meature the Correlation

##### ROPE: Region of Practical Equivalence
• The core idea of ROPE is if two variables that we are interested in are linear correlated, there will be a slope if we regard one variable as dependent variable and another as independent variable. If we can see the slope is close to 0, we can say these two variables do not have linear correlation.

• If the fraction of posterior samples that are inside of the ROPE for that parameter is large(e.g. $\geq{0.99}$), then we can say we are very certain the parameters is not meaningfully different from 0
• If the MAP estimate of the parameter to be exactly, 0 we can say the most likely single choice for the parameter is 0

#### Ridge Regression

Just make the prior of Slope to be Normal

We can see the overlap between the psoterior and the ROPE is now much stronger. But the MAP value is still not 0.

#### LASSO Regression

Set Slope’s prior as Laplace

• In LASSO Regression, the MAP estimate is 0. But note that the posterior probability of the parameter being 0 is 0, because the posterior is still a density, over continuous values.

#### If our data is not Normal, we can use a different likelihood

• One of the most common cause of non-Normality is outliers: large, rare effects.
##### Solution: Robust Regression
• A technique for doing accurate regression in the presence of outliers
• Specifically, chocie of probability distribution with “heavy tails” like Cauchy, StudentT or Laplace

## Lecture 10 Formulas and Linear Models

### Building Models with Formulas

For example, the model

becomes

So the model

becomes

#### Build Categorical Model with Formula

By default, categorical vaiables are handled by converting them to "Treatment Coding"

Intercept corresponds to what we called the “mean of the baseline group” and C(is_setosa)[T.True] corresponds to what we called the “effect of the is_setosa factor”.

#### Multiway Categorical Linear Model

##### Prediction

Define a function, make_predictions_twoway,
that takes as its first two arguments two Series of factor indices,
one for the first factor and one for the second factor,
and an np.ndarray of group_means as its third argument,
and returns a Series of predictions from the two-way categorical model
with those group_means as its parameters.

The index for factor 1 should be used to select the row of group_means,
while the index for factor 2 should be used to select the column.

So if the group_means were the array

then make_predictions_twoway would return

on the inputs

##### Model Enconding

Encoding Multiway Categorical Data

In the linear model encoding of data for categorical models,including multiway models,
each data point is still represented by a Series beginning with 1.

The other entries of the Series are all 1 or 0. As with a one-way model, the length of the Series is equal to the total number of groups. In addition to starting with a 1, there are 1s to indicate whether this data point comes from a non-zero level of each factor and if it comes from a given _combination_ of non-zero levels, as in

for a a model with two factors with total numbers of factor levels J and K.

We will focus on the simplest case: a multiway model with two factors, each of which has two levels. We will call such a model a _two-by-two_ model. If there are two levels of each two factors, for a total of four groups, a datapoint from level 0 of both factors would be coded as

and one from level 1 of both factors would be coded as

while a datapoint from level 1 of factor 1 and level 0 of factor 2 would be coded as

Encoding Multiway Categorical Parameters

When the inputs to categorical models are written in the linear model encoding, the outputs are computed in the same way as or the alternate encoding of a linear regression model:

Again, $\text{data features}[0]$ is always 1. In categorical models, instead of the $\text{data features}$ being the observed value, they are instead 0 or 1 to indicate to which groups the data point belonged, as levels and combinations of levels from each factor.

For a two by two model, then, the prediction for the observations in level 0 of both facors, aka the baseline group, is

while that for an observation in level 1 of both factors is

and so

which is what we called the “interaction of the factors” when working with categorical models in the mean parameterization.

In general, when we computed effects and interactions in categorical models, we compared the group_means to each other.
When the inputs to categorical models are written in the linear model encoding, the parameters are directly in terms of the effects and interactions.

As a concrete example, for a collection of group_means

the corresponding parameters are

#### Build Interacting Models

In this case, an “interaction” means that the slope of the line is different for setosas and non-setosas, in addition to the intercept being different.

## Lecture 11 Over-Fitting and Cross-Validation

If we use $R^2$ as our criteria, then MLE will always be winner since MLE parameters “explain” the data as best as possible.

But the goal of our task is not just to explain the data but to predict data

### Linearization

If two variables x, y do not have linear relationship, we can use a function f(x) to transform one variable so that they can have linear relationship.

• e.g. y ~ 1/x, f(x) = 1 / x
• then y ~ f(x), that is y ~ x

#### Polynomial Models

Any function can be approximated as a weighted sum of polynomial (power) functions:

### Cross-Validation

In cross-validation, the data is split into two pieces: one set for picking parameters with MLE/MAP or getting the posterior and the other set for checking $R^2$ (or a similar metric).

These are called the _training_ and _testing_ sets: one is used to “train” the model, improving its performance, much like an athlete trains, while the other is used to “test” its performance.

#### Leave One Out (LOO)

Only use one data as test set. The advantage of this method is this won’t be influenced by how we split the training and test set and this method uses almost all the data for training. So we can make bias lower.

## Lecture 12 Markov Chain and Monte Carlo

• Any method that uses samples to approximate distributions and averages is a Monte Carlo method.
• The method we use to draw samples is called Mrkov Chain sampling
• Using samples from a MRakov Chain to approximate Bayesian posteriors is known as Bayesian Markov Chain Monte Carlo.

### Monte Carlo Sampling

#### Question: How to draw samples in a circle area

• We can mimic the diffusion effect that a drop of ink will have in a plate of water
##### Core Idea
1. We initial the first point that in the circle
2. On each step, we adjust the position of our point a samll amount
1. Repeat step 2 and check if the point is in the circle

By running this function we can see the program is exploring the circle

##### Trace

Trace is a visualization of a sampler’s trajectory focuses on one variable at a time

Notice the relatively slow motion of the particle: when it is at an extremely positive or negative value (close to -1 and 1), it will tend to stay near that value for tens of steps. This is visible as a “waviness” of the trajectory.

If we shuffle the data, the “wave” will disappear and the trace will become:

Because we use Markov Chain to make points, each point will correlate with the points some steps ahead.

The correlation between values as a function of the time-lag between them is known as the autocorrelation.

Auto-correlation is bad for a sampler:it can be shown that the higher this auto-correlation, the worse our Monte Carlo estimates will be.

### Eliminating Auto-correlation — $Metropolis-Hastings$

Metropolis-Hasting is a generalized version of the diffusion used above

The biggest difference is that acceptance criterion is soft, insead of hard

#### Metropolis-Hastings-type alogorithms help us ample from posteriors

The goal of Markov Chain Monte Carlo is not just to sample uniformly from simple shapes like the circle.

Instead, the goal is to sample from interesting distributions that we can’t write down,
like the posterior of a complicated model.

One of the major benefits of the Metropolis-Hastings algorithm comes from the fact that it uses a _ratio_ of probabilities.

First, consider the definition of the posterior from Bayes’ Rule:

The troublesome part of this equation is the denominator. The numerator is part of our modeling process: it has the likelihood and the prior (in that order).

Now let’s consider comparing two possible values of the params, $A$ and $B$:

If we take the ratio of the probabilities, we get a complicated-looking expression:

But it simplifies, because $p(\text{data})$ is in the denominator of both the top and bottom:

This is only in terms of the likelihood and prior!

The trace of this method looks like above

Because our initial guess had low posterior probability, there was a brief period in which our samples were slightly “off”.This period is known as the “burn-in” time of our sampler.

And auto-correlation looks like:

### pyMC combines the model-specification API we’ve worked with throughout the semester with _very_ sophisticated versions of Metropolis-Hastings.

The auto-correlation from pyMC is low and the trace seems more independent