Regressions

I don’t think the person exists that doesn’t take a step backwards when they first hear they need to perform regressions especially the dreaded Multi-Variant Linear Regression.
Let me help everyone that is frightened by this….It looks scary, the formulas look like some secret spy code, the good news is most of us have done a similar analysis we just didn’t write the regression model out.
Simply put Regressions put the “variable of interest” (the dependent variable) on one side of the equation and the variables that we believe contribute or explain that variable of interest on the other side of the equation.
That explanation even sounds scary to me even with as simplified as it is.
The example used in many text books is the Salary analysis.
Salary = years of experience + years of education + average salary in a field
When it is put into the standard form we simplify variable names and add some requisite pieces to the equation that I’ll explain in a minute.
Variables
S – Salary (the dependent variable from dataset)
E – years of Experience (Independent variable from dataset)
D – years of Education (Independent variable from dataset)
F – average Salary in Field (Independent variable from dataset)
α – the Y-axis intercept
β – variables coefficient, one for each variable so subscripts are assigned to them. β1, β2, β3…
μ – error term component, this effectively is a stand in variable for all the possible variables we do not know of or do not have data for.
S = α + β1E + β2D + β3F + μ
The 3 parts you can’t do without. α, β, μ .
α : the Y-axis intercept
β –Coefficient of the variable for how much the variable “explains” the “variable of Interest”. While it may look like a simple “how much does this variable contribute to the ‘variable of interest’ like 45%” This IS NOT what the Coefficients function is in the regression.
μ – error term component, sometimes noted as U (unknown), e or ε (error).
The goal of a regression is to test how well the variables we Hypothosize explain the Dependent Variable of interest. It would be great if our Hypothesized equation (Model) explained 100% that is entirely explains the Variable of interest. Well let me tell you that won’t happen. The μ error term helps get us closer but we still won’t get all the way there.
It is not uncommon for one of the independent variables to be found to not explain the dependent variable at all, at which point a new model omitting that variable is in order. Unless we want to support the hypothesis that that variable has nothing to do with the dependent variable.
If we go back to the above Model and we found D (years of Education) did not explain Salary, we may be incline to re-specify the model without variable D. Personally, I would be re-examining the datasets for issues because logically we probably all believe years of education directly effects Salary, why else would we have studied stats.

Modeling

One of the biggest hurdles to overcome when dealing with statistics is the vocabulary and the very intimidating use of Greek symbols in the equations …er models.

Modeling is something we all do in everyday living and decision making. It defines a pattern for something to happen or that has already happened.

Note: In the world of statistics, the latter is common during statistical analysis, while the former is in the planning stages for a study or in Predictive Statistics for business planning.

  • Time to Wake Up For Work.TTWUFW = Start Work At Time – (Morning Personal Prep Time + Drive Time To Work + Amount Of Buffer Time To Not Be Late).
  • Cost of groceries from the store.COGFTS = cost of all items purchased.

Each of these can be expanded to better approximate the actual outcomes.

Lets’ take the Time to leave For Work model as an example.

We ask ourselves when creating this model “What are the components that contribute or effect this outcome.

You just got hired for a new job across town and want to know what time do you need to wake up in the morning to arrive there at a comfortable time so you’re not late.

You begin working on your model based on what you know and make a list of the variables that affect the solution.

Variables:

We place the outcome on the left side of an equation then the equal sign and on the right side we list all the components that contribute or effect that outcome. We separate each component with a plus sign (+) showing it adds/contributes to the outcome.

If we wrote this out we would say…

What time do you need to wake up for work (TTWUFW) If you Start work at 8:00 am (SWAT) and it takes you 30 minutes morning prep time (MPPT) and it takes 20 minutes to drive to work (DTTW) and you like to be at work 15 minutes before you start working (AOBTTNBL).

In equation form using shortened labels for each component we get.

TTWUFW = SWAT + MMPPT + DTTW + AOBTTNBL

Wait that doesn’t look exactly like the first equation! You’re right but this is closer to the exact form for statistical model even though we all know from getting ready for work some things negatively contribute to the outcome. Let’s not worry about the exact form quite yet, we will return to that later.

In the model  for what time do you need to wake up for work  begin with what time do you need to start work and subtract from that all the component times  for each of the pre-work activities and the result is what time you need to wake up for work.

To get a more accurate “prediction” of when we need to leave for work we expand as best we can all of the components to include their actual parts and rewrite the equation.

TTWUFW = SWAT – ( Time to Shower + Time to Shave + Time for personal Hygiene + Time to Dress + Time for Breakfast/Coffee + Time to Warm up Car + Time for Drive to Work + Amount of Buffer to Not Be Late + Unknown Unplanned Event time)

There is a practical limit to how detailed your “model” needs to be but it is easy to see that the time used in any component effects what time you will need to wake up for work.