ECN 425: Introduction to Econometrics

Alvin Murphy Arizona State University: Fall 2018

Assignment #1

Due at the beginning of class on Thursday, September 6th

PART I: DERIVING OLS ESTIMATORS

(You must show all work to receive full credit)

1) 1) Suppose the population regression function can be written as: uxy 10 , where

0uE and 0| xuE . The sample equivalents to these two restrictions imply:

0ˆ 1

:1

n

i

iu n

and 0ˆ 1

:1

n

i

iiux n

. Parts (a)-(c) of this problem ask you to derive the OLS

estimators for 0 and 1 . Please show all of your work.

(20 points: 5/5/10)

(a) Use 0ˆ 1

:1

n

i

iu n

to demonstrate that the OLS estimator for 0 can be written as:

xy 10 ˆˆ , where

n

i

iy n

y :1

1 and

n

i

ix n

x :1

1 .

(b) Use 0ˆ 1

:1

n

i

iiux n

together with the result from (a) to demonstrate that the OLS

estimator for 1 can be written as:

n

i

ii

n

i

ii

xxx

yyx

1

:1 1̂ .

(c) Use your result from (b) together with the definition of the variance and covariance to

demonstrate that i

ii

x

yx

var

,covˆ 1 .

2

2) Suppose the population regression function is uzy i 10 , and you estimate the

following sample regression function: iii uxy ˆ ˆˆ 10 , where zx .

(20 points: 10/10)

(a) Express your estimator, 1̂ , in terms of the data and parameters of the population

regression function, ii zx ,,1 , and iu .

(b) Use your result from (a) to demonstrate that 1̂ is generally a biased estimator for 1 .

PART II: USING A FAKE DATA EXPERIMENT TO INVESTIGATE OLS ESTIMATORS

A fake data experiment can be a useful way to investigate the properties of an estimator. This

process begins by specifying the “true” economic model (i.e. the population regression

function). The next step is to use this model to generate some data that represent a population.

Finally, by taking repeated samples from the population and using these samples to estimate the

sample regression function several times, you can evaluate how well your estimator performs

(e.g. bias and variance) under specific conditions.

3) In this problem, you will use a fake data experiment to demonstrate the importance of correctly specifying the form of the sample regression function. More precisely, you will

compare the bias of the OLS estimator when the model is correctly specified, to the bias

when the model is incorrectly specified to use the wrong explanatory variable. In the file

“fake1.dta”, I have generated a population of 500 observations from the (true) regression

equation: uzy 10 , such that 0uE , 0| zuE , and 2|var zu .

(25 points: 5/5/5/5/5)

a) Use these data to calculate the population parameters 0 and 1 . What are they? Please

use 2 decimal places.

3

b) Now, take a random 5% sample from the population and discard the remaining observations. This can be done using the command “bsample round(0.05*_N)”. Use

this random sample to calculate OLS estimates for 0 and 1 . Report your results out to

3 decimal places.

c) Repeat part (b) 19 more times, saving the values for 0̂ and 1̂ on each iteration. Thus,

on each iteration you are reloading “fake1.dta”, taking a new randomly-chosen 5%

sample, and using that sample to generate estimates for 0 and 1 . Save your results

from all 20 iterations in a table and use them to calculate 0̂bias , and 1̂bias . Of your 20 samples, what is the closest and the farthest that you come from recovering the true

values of 0 and 1 in any individual sample? Report the following statistics:

00 ˆmin , 11

ˆmin , 00 ˆmax , and 11

ˆmax .1

d) Repeat the exercise in parts (b) and (c), except this time you will incorrectly replace z

with x on each of the 20 iterations. Report 1̂bias , 11 ˆmin , and 11 ˆmax .

e) Are your sample results from part (c) for 0̂bias and 1̂bias consistent with the theoretical properties of correctly specified OLS estimators? Are your sample results

from part (d) consistent with what you learned from problem #2 about the theoretical

properties of an OLS estimator that is incorrectly specified to use the wrong explanatory

variable? Please explain your answers.

1 Stata hint: After typing in the commands for the first iteration in part (b), you can use the review window to click

on those same commands 19 more times, rather than typing them again

4

PART III: EMPIRICAL ANALYSIS2

4) Use airfare.dta to answer the following questions. (15 points: 5/5/5)

(i) Report the mean, standard deviation, minimum and maximum airfare for: (a) one-way

flights less than 500 miles, (b) one way flights between 500 and 1000 miles, (c) one-way

flights between 1000 and 2000 miles; and (d) one way flights over 2000 miles.

(ii) Estimate a regression model where a one mile increase in flight distance changes the

fare by a constant dollar amount. Use your result to predict the price of flying 250 miles.

(iii) Now estimate a regression model where a one percent increase in flight distance

leads to a constant percentage change in price. Use your result to report the elasticity of

airfare to flight distance.

5) Is there adverse selection in the market for health care? I have obtained state-level data on health outcomes for the share of the population with health insurance for 2004, 2005, 2008,

2009, and 2010. These data are from the Behavioral Risk Factor Surveillance System. This

question asks you investigate the data, run some regressions, and interpret the results. The

file BRFSS.dta contains data on state population, state population with health insurance, and

health outcomes for the insured population.

(20 points: 5/5/5/5)

a) Generate a variable, share_insured, that measures the share of the state population with health insurance. Report summary statistics for share_insured for each year (mean, st.

dev, min, max). Did the share of people with health insurance in the average state

increase during the 2000’s?

b) Use a simple linear regression model to estimate how the share of people with health insurance impacts health outcomes for the insured population. Report slope coefficients,

their standard errors, the number of observations, and the R2 in the table below.

2 Stata hint: you might find the if and bysort commands helpful on this part of the assignment.

5

avg. days

health not

good

avg. days

health

prevented

regular activity

disability (%)

use

equipment

because of

disability (%)

exercise in

past month

(%)

asthma (%) diabetes (%)

Slope coefficient ___ ___ ___ ___ ___ ___ ___

(___) (___) (___) (___) (___) (___) (___)

N ___ ___ ___ ___ ___ ___ ___

R2 ___ ___ ___ ___ ___ ___ ___

Dependent Variable

c) Based on your results, how would increasing the share of the state population with health insurance by 1% affect the average days that insured consumers report their health is not

good? How would it affect the percentage of the insured consumers with diabetes? Are

these results consistent with the presence of adverse selection in the market for health

insurance? Explain your answer.

d) Does it seem reasonable to expect that the model we estimated in part (b) provides an unbiased estimator for the impact of health insurance on health outcomes in the insured

population? If so, justify your answer by explaining why you suspect SLR.1 through

SLR.4 are satisfied. If not, explain why you suspect one or more of the four SLR

assumptions are violated.