Relationship Between Preparation Time And Student Marks In A Statistics Class

Survey and Sampling Methods

Many Holmes Institute instructors believe that students need to spend at least 2 hours studying outside of class for every hour of lecture. They believe that the number of hours students study to prepare for the exam affect students’ marks significantly. As opposed, few of the lecturers believe that the number of preparation hours do not essentially affect students’ marks while some other factors are to be considered. To study the relationship between the preparation time spent by each student (in hours) for the exam and the reported mark, a sample of 100 students were selected randomly from a large statistics class. The data are stored in the file named “ASSIGNMENTDATA” in the course website. Answer below 9 questions:

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Cross-sectional survey; this is where the researcher collects data from the respondents at a single period in time uses the cross-sectional type of survey.

Simple random sampling could be used. This method would give the participants an equal chance of being included into the study and as such will reduce the chances of bias.

  1. On the basis of given data, determine the dependent and independent variables we should use, and why? Also, identify the data type(s) for each variable.

The dependent variable is the student’s marks while the independent variable is the number of hours students study to prepare for the exam. This is because number of hours students study to prepare for the exam is believed to influence the students marks hence it is the independent variable while the student marks is the dependent variable.

  • Non-response from some of the participants. Some participants might not be willing to respond for their own reasons.
  • High cost of collecting data; one challenge would be in regard to the cost if the participants are widely spread apart.

Using 8 classes and intervals of 20 – 30, 30 – 40, etc for both of the variables selected in question 3, develop a distribution tableincluding class intervals, frequency, relative frequency and cumulative relative frequency for each variable. Then, draw frequency histogram, relative frequency histogram and cumulative relative frequency histogram for each variable. Also, Comment on the shape of frequency histogram for each variable and provide reason(s) for your comment.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Class Interval

Frequency

Relative Frequency

Cumulative relative frequency

20-30

1

0.01

0.01

30-40

8

0.08

0.09

40-50

16

0.16

0.25

50-60

20

0.2

0.45

60-70

20

0.2

0.65

70-80

17

0.17

0.82

80-90

12

0.12

0.94

90-100

6

0.06

1

 

Class Interval

Frequency

Relative Frequency

Cumulative relative frequency

20-30

1

0.01

0.01

30-40

5

0.05

0.06

40-50

10

0.1

0.16

50-60

17

0.17

0.33

60-70

21

0.21

0.54

70-80

22

0.22

0.76

80-90

14

0.14

0.9

90-100

10

0.1

1

In the next three figures, we present the frequency histogram, the relative frequency histogram and the cumulative relative frequency histogram for the preparation time. The histogram help to visualize the distribution of the data.

Figure 1: Frequency Histogram for the preparation time

Figure 2: Relative Frequency Histogram for the preparation time

Figure 3: Cumulative Relative Frequency Histogram for the preparation time

The histogram (both frequency and relative frequency) of the preparation time shows that the distribution is left skewed (has longer tail to the left).

The next three figures below presents the frequency histogram, the relative frequency histogram and the cumulative relative frequency histogram for the student marks.

Descriptive Statistics and Analysis

Figure 4: Frequency Histogram for the student marks

Figure 5: Relative Frequency Histogram for the student marks

Figure 6: Cumulative Relative Frequency Histogram for the student marks

The histogram for the student’s marks shows that the distribution is skewed to the left (longer tail to the left).

Draw and use an appropriate scatter plot to investigate the relationship between the two variables. Also, briefly explain the selection of each variable on the X and Y axes and the reason? Finally, draw the fitting line for the plotted observations.

Figure 7: A scatter plot of student’s marks against preparation time (number of hours)

As can be seen from the above plot, the X-axis is the preparation time while the Y-axis is the student’s marks. The X-axis is the independent variable hence the reason as to why preparation time was chosen for the x-axis while the Y-axis is the dependent variable hance the reason as to why student’s marks was chosen as the y-axis.

The above scatter plot shows evidence that there exists a positive linear relationship between the two variables (preparation time and student marks). This means that an increase in the number of hours spent by students to prepare for exam would result to an increase in the marks obtained by the student in that particular exam. Similarly, the it can also be inferred that a unit decrease in the number of hours spent by students to prepare for exam would result to a subsequent decrease in the marks obtained by the student in that particular exam.

  1. Present the equation of the estimated fitting line (regression) in your answer to Question f. Then, estimate the effect of an increase in the independent variable by one unit on the dependent variable.

The coefficient of the preparation time is 28.984; this means that a unit increase in the independent variable (preparation time) would result to an increase in the dependent variable (student’s marks) by 28.984. It also means that a unit decrease in the independent variable (preparation time) would result to a decrease in the dependent variable (student’s marks) by 28.984.

  1. Prepare a numerical summary report about the data on the two variables by including the mean, median, range, variance, standard deviation, smallest and largest values, quartiles, interquartile range and the 30thpercentile for each variable.

Table 3: Descriptive (summary) statistics for the preparation time and student marks

PREPARATION TIME

MARK

Mean

63.04

65.74

Median

64

68

Standard Deviation

16.32

17.41

Sample Variance

266.36

303.12

Range

65

75

Minimum

25

25

Maximum

90

100

1st Quartile

51

54

3rd Quartile

76.25

78

Interquartile range

25.25

24

30th percentile

54

58

Table 3 above presents the descriptive statistics for both the preparation time and the student marks. As can be seen, the average preparation time for the 100 sampled students was found to be 63.04 hours with the median time being 64 hours. The lowest amount of time taken by student to prepare for the exam was 25 hours while the highest amount of time taken was found to be 90 hours. The standard deviation was 16.32 implying that the data is not widely spread out.

Scatter Plot and Regression Analysis

On the other hand, the average student marks was 65 with the highest score being 100 and the lowest score recorded being 25. The median marks scored by the students was 68. Again the standard deviation showed that the student marks are not widely spread out from the mean (SD = 17.41).

Compute a numerical measurement which measures the strength and direction of the linear relationship between the two variables. Also, interpret this value.

Table 4: Correlation coefficient table

PREPARATION TIME

MARK

PREPARATION TIME

1

MARK

0.546556

1

As can be seen from the above table, there is a moderate positive relationship between the two variables (preparation time and student’s marks). The correlation coefficient is 0.5466. The fact that the correlation coefficient is positive means that an increase in the number of hours spent by students to prepare for exam would result to an increase in the marks obtained by the student in that particular exam. Similarly, the it can also be inferred that a unit decrease in the number of hours spent by students to prepare for exam would result to a subsequent decrease in the marks obtained by the student in that particular exam.

To determine whether or not the height of sons is related to father’s height (x1) and mother’s height (x2), data were gathered and part of the multiple regression excel output is shown below. Fill the table and answer the following questions.

The missing values in the table have been filled in red colour.

SUMMARY OUTPUT

Regression Statistics

Multiple R

0.5169

R Square

0.2672

Adjusted R Square

0.2635

Standard Error

8.0683

Observations

400

ANOVA

df

SS

MS

F

Significance F

Regression

2

9421.58

4710.79

72.366

0.0000

Residual

397

25843.41

65.097

Total

399

35264.98

Coefficients

Standard Error

t Stat

P-value

Intercept

93.8993

8.0072

11.7269

0.0000

X1

0.4849

0.0412

11.7772

0.0000

X2

-0.0229

0.0395

-0.5811

0.5615

  1. What is the standard error of estimate? What does this statistic tell you?

The standard error of the estimate is 8.0683. The statistics tells us how accurate the predictions are made from the regression line. And since this value is small enough, it clearly shows that the model is accurate in predicting the height of the son based on the father’s height (x1) and the mother’s height (x2).

  1. What is the coefficient of determination? What does this statistic tell you?

The coefficient of determination is 0.2672; this statistic tells u that 26.72% of the variation in the dependent variable (height of son) is explained by the two independent variables (father’s height (x1) and mother’s height (x2)).

  1. What is the adjusted coefficient of determination for degree of freedom? What do this statistic and the one referred to in part (b) tell you about how well the model fits the data

The adjusted coefficient of determination tells how great an additional variable predicts the dependent variable. This statistic (adjusted coefficient of determination for degree of freedom) and the coefficient of determination tells on the proportion of variation in the dependent variable is explained by the independent variables. The larger the values of these two statistics the better the model (the better the model fits the data).

  1. Test the overall utility of the model. What does the test result tell you?

As can be seen from the ANOVA table, the overall model is statistically significant at 5% level of significance [F(2, 399) = 72.366, p = 0.000].

The coefficient of father’s height (x1) is 0.4849; this means that a unit increase in the father’s height would result to an increase in the height of the son by 0.4849.

The coefficient of mother’s height (x2) is -0.0229; this means that a unit increase in the mother’s height would result to a decrease in the height of the son by 0.0229.

The intercept coefficient is given as 93.8993; this implies that holding all the other factors constant (zero values for the father’s height as well as the mother’s height) we would expect the height of the son to be 98.8993.

  1. Do these data allow the statistic practitioner to infer that the heights of the sons and the fathers are linearly related?

Yes the data allow the statistic practitioner to infer that the heights of the sons and the fathers are linearly related. This is based on the fact that the father’s height (x1) was found to be significant in the model (p = 0.0000).

  1. Do these data allow the statistic practitioner to infer that the heights of the sons and the mothers are linearly related?

No the data does not allow the statistic practitioner to infer that the heights of the sons and the mothers are linearly related. This is based on the fact that the mother’s height (x2) was found to be insignificant in the model (p = 0.5615).