I don’t think the person exists that doesn’t take a step backwards when they first hear they need to perform regressions especially the dreaded Multi-Variant Linear Regression.

Let me help everyone that is frightened by this….It looks scary, the formulas look like some secret spy code, the good news is most of us have done a similar analysis we just didn’t write the regression model out.

Simply put Regressions put the “variable of interest” (the dependent variable) on one side of the equation and the variables that we believe contribute or explain that variable of interest on the other side of the equation.

That explanation even sounds scary to me even with as simplified as it is.

The example used in many text books is the Salary analysis.

Salary = years of experience + years of education + average salary in a field

When it is put into the standard form we simplify variable names and add some requisite pieces to the equation that I’ll explain in a minute.

Variables

S – Salary (the dependent variable from dataset)

E – years of Experience (Independent variable from dataset)

D – years of Education (Independent variable from dataset)

F – average Salary in Field (Independent variable from dataset)

α – the Y-axis intercept

β – variables coefficient, one for each variable so subscripts are assigned to them. β1, β2, β3…

μ – error term component, this effectively is a stand in variable for all the possible variables we do not know of or do not have data for.

S = α + β1E + β2D + β3F + μ

The 3 parts you can’t do without. α, β, μ .

α : the Y-axis intercept

β –Coefficient of the variable for how much the variable “explains” the “variable of Interest”. While it may look like a simple “how much does this variable contribute to the ‘variable of interest’ like 45%” This IS NOT what the Coefficients function is in the regression.

μ – error term component, sometimes noted as U (unknown), e or ε (error).

The goal of a regression is to test how well the variables we Hypothosize explain the Dependent Variable of interest. It would be great if our Hypothesized equation (Model) explained 100% that is entirely explains the Variable of interest. Well let me tell you that won’t happen. The μ error term helps get us closer but we still won’t get all the way there.

It is not uncommon for one of the independent variables to be found to not explain the dependent variable at all, at which point a new model omitting that variable is in order. Unless we want to support the hypothesis that that variable has nothing to do with the dependent variable.

If we go back to the above Model and we found D (years of Education) did not explain Salary, we may be incline to re-specify the model without variable D. Personally, I would be re-examining the datasets for issues because logically we probably all believe years of education directly effects Salary, why else would we have studied stats.

# All posts by Ronald Finnerty

# Modeling

**O**ne of the biggest hurdles to overcome when dealing with statistics is the vocabulary and the very intimidating use of Greek symbols in the equations …er **models**.

Modeling is something we all do in everyday living and decision making. It defines a pattern for something to happen or that has already happened.

*Note: In the world of statistics, the latter is common during statistical analysis, while the former is in the planning stages for a study or in Predictive Statistics for business planning.*

- Time to Wake Up For Work.TTWUFW = Start Work At Time – (Morning Personal Prep Time + Drive Time To Work + Amount Of Buffer Time To Not Be Late).
- Cost of groceries from the store.COGFTS = cost of all items purchased.

Each of these can be expanded to better approximate the actual outcomes.

Lets’ take the Time to leave For Work model as an example.

We ask ourselves when creating this model “What are the components that contribute or effect this outcome.

You just got hired for a new job across town and want to know what time do you need to wake up in the morning to arrive there at a comfortable time so you’re not late.

You begin working on your model based on what you know and make a list of the variables that affect the solution.

Variables:

We place the outcome on the left side of an equation then the equal sign and on the right side we list all the components that contribute or effect that outcome. We separate each component with a plus sign (+) showing it adds/contributes to the outcome.

If we wrote this out we would say…

What time do you need to wake up for work (TTWUFW) If you Start work at 8:00 am (SWAT) and it takes you 30 minutes morning prep time (MPPT) and it takes 20 minutes to drive to work (DTTW) and you like to be at work 15 minutes before you start working (AOBTTNBL).

In equation form using shortened labels for each component we get.

TTWUFW = SWAT + MMPPT + DTTW + AOBTTNBL

Wait that doesn’t look exactly like the first equation! You’re right but this is closer to the exact form for statistical model even though we all know from getting ready for work some things negatively contribute to the outcome. Let’s not worry about the exact form quite yet, we will return to that later.

In the model for what time do you need to wake up for work begin with what time do you need to start work and subtract from that all the component times for each of the pre-work activities and the result is what time you need to wake up for work.

To get a more accurate “prediction” of when we need to leave for work we expand as best we can all of the components to include their actual parts and rewrite the equation.

TTWUFW = SWAT – ( Time to Shower + Time to Shave + Time for personal Hygiene + Time to Dress + Time for Breakfast/Coffee + Time to Warm up Car + Time for Drive to Work + Amount of Buffer to Not Be Late + Unknown Unplanned Event time)

There is a practical limit to how detailed your “model” needs to be but it is easy to see that the time used in any component effects what time you will need to wake up for work.

# Fair Access to Publicly Funded Research Results

The EFF {Electronic Frontier Foundation} has an initiative to get public access to publically funded research.

I have mixed feeling on this issue.

On the one hand we as taxpayers have paid for this research to be funded and should have a right to read the research. We shouldn’t have to subscribe to extremely expensive research data stores that charge for access to the research, like we currently do. Not having to wait for the peer review journal process to publish research will mean faster and more current research availability. On that I agree.

BUT, much government funded research produces garbage because it was poorly executed, conceived or done with an agenda that is contrary to public good and truthfulness. Many people won’t be able to grasp how the data was analyzed as rarely is this disclosed fully in the research. The best system we currently have is the Academic or Peer Journal review process. This Peer review/rewrite process can be a multi-year process where the research results value diminishes on time critical research. But the value of Peers reviewing the research can save many from acting on research results of a bad piece of research.

Refer to the economists Carmen Reinhart and Kenneth Rogoff article about “austerity” …”has shaped political decisions over the best way to deal with foundering economies.” Many governments have based financial strategies on this NON-Peer reviewed paper and less than 3 years after publication a Graduate student at Univ.of Mass. found math errors and omissions that some have said “the Reinhart/Rogoff claim was ideology, not social science.” [ibid]

Care should be taken when reviewing Research and Statistical Analysis. NEVER look at research as if it is a sound bite by a politician.

Still I think the public may have challenged the research quicker if there was government laws requiring disclosure.

To be fair to Carmen Reinhart and Kenneth Rogoff, they not only released their research to the doctoral student Thomas Herndon, he said on The Colbert Report talk show recently they also sent him the original spreadsheets that they had used in the calculations and it was on page one near the top where Thomas Herndon found the math error. Proof of the value of Peer Review.

I include below the text of the message that is sent to your representaives in Washington.:

As your constituent, I am urging you to support the Fair Access to Science & Technology Research Act (FASTR is S. 350 in the

Senate and H.R. 708 in the House).

Government agencies like the National Science Foundation invest millions of taxpayer dollars into scientific research every year, but the resulting research is locked up in expensive journals. As a result students and citizens have difficulty accessing information they need; professors have a harder time reviewing and teaching the state of the art; and cutting-edge research remains hidden.

FASTR fixes this. The bill makes government agencies design and implement a plan to facilitate public access to the results of their investments. Any researcher who receives federal funding must submit a copy of resulting journal articles to the funding agency, which will then make that research widely available within six months.

Please secure our rights as taxpayers and promote the progress of science by supporting FASTR.

If YOU would like to let your Representative in Washington D.C. know you support passage of this Bill you can have a email sent to your representatives by going to this web address. Tell my representative I want Publically Funded Research Available to the Public. Just enter in your zip codein the box on the right side, it will look up who your representatives are and send to them.

For California there is additional Initiative you can Let your representatives know by clicking Here to let california representatives Know I want state funded research to be available to the public

# Cross Functional software

Most software these days can perform many functions.Microsoft

ZZZ EXCEL AND SAS an

# Graphing and graphics

# Correlation

# Sampling: Sample Size and Population

In statistics, it is generally believed that the most accurate analysis is done with ALL of the data in a data set.

This is referred to as the Population of the data set notated as

# Learning SAS >>>do you want certification too?

# SAS

I came into the SAS world not at a Business nor during my college studies.

I was exposed to SAS because it was the tool PhD students I was helping with database issues needed the databases to work with: SAS for statistical analysis.

So the first thing I learned was Importing and Exporting datasets.

I crash coursed variables and functions and then decided I would attend the SAS institute workshops and get SAS base programmer Certification.

I recommend you have access to a working copy of SAS to practice on and be familiar with SAS prior to going through the workshops, it will be much less stressful.

As a learning tool SAS has a version of their Enterprise Guide program available for license for around $200 per year. It’s biggest limitations currently are limited ability to work with Microsoft Excel worksheets and files and the inability to utilize your own datasets. SAS, the company, obviously want you to purchase their full commercial products to do your own data analyses. They sell annual usage licenses based on what functional modules you need and each module are typically $2,000 to $10,000 each per year. That is cost prohibitive for most students.

Luckilly, most students have a supportive Professor that will allow them to use a license for research purposes, but that’s not guaranteed.

SAS has released some less expensive products since 2012 and

If you are studying for the Certification exams follow this link.

If you are wondering how to code your own statistical analysis follow this link.

I will try to tie together the statistics theory and the code snippets to help you get the job done.

# Regressions: Linear and Multi-Variant {GASP}

I don’t think the person exists that doesn’t take a step backwards when they first hear they need to perform regressions especially the dreaded Multi-Variant Linear Regression.

Let me help everyone that is frightened by this….It looks scary, the formulas look like some secret spy code, the good news is most of us have done a similar analysis we just didn’t write the regression model out.

Simply put Regressions put the “variable of interest” (the dependent variable) on one side of the equation and the variables that we believe contribute or explain that variable of interest on the other side of the equation.

That explanation even sounds scary to me even with as simplified as it is.

The example used in many text books is the Salary analysis.

Salary = years of experience + years of education + average salary in a field

When it is put into the standard form we simplify variable names and add some requisite pieces to the equation that I’ll explain in a minute.

Variables

S – Salary (the dependent variable from dataset)

E – years of Experience (Independent variable from dataset)

D – years of Education (Independent variable from dataset)

F – average Salary in Field (Independent variable from dataset)

α – the Y-axis intercept

β – variables coefficient, one for each variable so subscripts are assigned to them. β1, β2, β3…

μ – error term component, this effectively is a stand in variable for all the possible variables we do not know of or do not have data for.

S = α + β1E + β2D + β3F + μ

The 3 parts you can’t do without. α, β, μ .

α : the Y-axis intercept

β –Coefficient of the variable for how much the variable “explains” the “variable of Interest”. While it may look like a simple “how much does this variable contribute to the ‘variable of interest’ like 45%” This IS NOT what the Coefficients function is in the regression.

μ – error term component, sometimes noted as U (unknown), e or ε (error).

The goal of a regression is to test how well the variables we Hypothosize explain the Dependent Variable of interest. It would be great if our Hypothesized equation (Model) explained 100% that is entirely explains the Variable of interest. Well let me tell you that won’t happen. The μ error term helps get us closer but we still won’t get all the way there.

It is not uncommon for one of the independent variables to be found to not explain the dependent variable at all, at which point a new model omitting that variable is in order. Unless we want to support the hypothesis that that variable has nothing to do with the dependent variable.

If we go back to the above Model and we found D (years of Education) did not explain Salary, we may be incline to re-specify the model without variable D. Personally, I would be re-examining the datasets for issues because logically we probably all believe years of education directly effects Salary, why else would we have studied stats.