Exploratory Data Analysis And Linear Regression Model For House Prices.csv Data Set

Exploratory Data Analysis

Exploratory Data Analysis was performed for the given data set in Rapid-Miner software platform (Andrews, Sanchez & Johansson, 2011). In the process of EDA the data file was bridged with the select attribute to exclude variables like id, date for their nominal and ordinal property. Variable Zip code was also excluded from the analysis as descriptive for the variables were redundant. The select attribute was then joined with the result port of the process window to obtain the descriptive of all other variables (Iacoviello & Neri, 2010).

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper
  1. Summarization of key results of exploratory data analysis:

Table 1.1: Summary of EDA values for task 1. (i)

Field name

Maximum

Minimum

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Missing values

Mean

Standard deviation

Mode

price

$7700000.0

$75000.0

0

$ 540088.14

$367127.19

$450000

bedrooms

33.0

0.0

0

3.37

0.93

3

bathrooms

8.0

0.0

0

2.11

0.77

2.5

sqft_living

13540.0

290.0

0

2079.89

918.44

1300

sqft_lot

1651359.0

520.0

0

15106.96

41420.51

5000

floors

3.5

1.0

0

1.49

0.53

1

waterfront

1.0

0.0

0

0.01

0.08

0

view

4.0

0.0

0

0.23

0.76

0

condition

5.0

1.0

0

3.40

0.65

3

grade

13.0

1.0

0

7.65

1.17

7

sqft_above

9410.0

290.0

0

1788.39

828.09

1300

sqft_basement

4820.0

0.0

0

291.50

442.57

0

sqft_living15

6210.0

399.0

0

1986.55

685.39

1540

sqft_lot15

871200.0

651.0

0

12768.45

27304.17

5000

lat

47.77

47.15

0

47.56

0.139

long

-121.31

-122.52

0

-122.21

0.141

Exploratory data analysis found the mean price to be $540088.14 for housing projects with a standard deviation (S.D) of $367127.19.  The average bedroom and bathroom for the housing projects were 3.37 (S.D= 0.93) and 2.11 (S.D=0.77). Trend revealed that construction of housing projects with three bedrooms and two bathrooms were done in the year 2014-2015 in United States. Average area for living was 2079.89 square units with S.D of 918.44 square units.  Mode of living space was found to be at 1300 square units (Mohit, Ibrahim & Rashid, 2010). The analysis also revealed that waterfront view or view of nature from the projects were almost not available. The mean gradation score was 7.65 (S.D=1.17), housing projects were located averagely at 47.56 latitude (S.D=0.14) and               -122.21 longitude (S.D=0.14).

(iii) Number of bathrooms, total square feet available for living, location of the project (latitude), gradation of material used for construction (grade) and year of built (yr_built) of the house were main variables for prediction on housing prices in the model. Pearson’s correlation was found between all the relevant variables of EDA process. It was found that price was significantly correlated with the above variables. From correlation analysis variables as grade, view, floor and year of built of the housing project were also found to be significantly correlated with price.

Table 1.11: Correlation coefficient values for major predicting variables

Correlation values

price

grade

bathrooms

sqft_living

lat

Yr_built

price

1

0.667

0.525

0.702

0.307

0.054

Correlation was performed in Rapid Miner platform and the process window is given in the figure 2.

The correlation matrix from Rapid-Miner is provided in figure 3.

Price was significantly correlated with floor and view of the housing project but from exploratory data analysis it was observed from the scatter plot (figure 4 & 5) that these variables were not significant enough in nature for explaining and predicting price of housing.

Chi-square goodness of fit was also used to check the validity of the data provided for price prediction. The dependent or label variable was taken as price in the process of weighted chi-square. The chi square test of goodness of fit calculated weight for bathrooms (638.35), total square feet available for living (230.78), grade (329.88), latitude (7652.40) and year built (2095.85). These variables were found to be significantly correlated from Pearson’s correlation test and scattered diagram from EDA results. Selection of the five major factors was approved by the confirmatory tests. Square feet above and square feet available were also significantly correlated with price but they were ignored in the study as square feet for living was already a major decision variable. Chung Chun Lin approached the prediction of real estate prices with multiple regression and non parametric model in 2013. Shiau Hui Kok, an eminent professor of economics in department of Economics, University Putra Malaysian also worked to analyze the predicting factors on real estate price in Malaysia form 2002 to 2015. Macro economic theories were used to predict the price of real estate. In this study the external effects such as GDP per capita of a country, purchasing power capacity of residents with respect to year and place were not considered for predicting the price of housing projects. Price of a housing project is always location dependent. In this study, location of housing properties in United States, latitude and longitude of the places were found to be correlated with price. In 2010 Mohammad Abdul Mohit and Mansor Ibrahim in Malaysia assessed that low cost housing with proper social environment and nearby facilities was very popular and lucrative option for the buyers. Correlation and multiple regression models were used to analyse that social environment had low level of association with nearby facilities and customers were more satisfied with accommodation facilities rather than environmental facilities. The study variables were appropriately chosen based on the facts of the given data which resembled with the earlier work.

Linear Regression Model

(i) The linear regression model was setup with five decision variables to observe the effect on price of housing property. The scatter diagrams (figure 9-12) of the price versus decision variables were studied before executing the regression model. Significant amount of association was observed between price and the five decision variables; the regression process in Rapid Miner was constructed as in figure 8.

The coefficients of regression and significance value were stated in table 1.2 (from figure in appendix). It was observed that p-value or significance values were all zero. Hence the claim of the significance of the five decision variables was established again. Gradation of the materials (122655.28) and location (latitude) (527912.45) of the housing projects was found to be have highest positive effect on the price whereas year of built or age of the project had high negative effect (-3300.84) on price.

A house is situated in a physical area and encompassed by an area which is something that changes to some key angles, its environment and neighborhood facilities. In the event that an area is near business or market territories, at that point the house costs are higher when contrasted with partners in the neighborhoods. The larger the square foot of the house, the more expensive it can be. Also, the number of bedrooms largely influences a home’s value. So, a house with the several numbers of bedrooms is more likely to have high curb appeal as opposed to a villa with just one bedroom. A house in a provincial or less created zone will dependably cost not exactly those in the all around created or urban region. Additionally, an area with an awesome availability to interstates, turnpikes, schools, shopping centers and neighborhood business openings adds to the additional house estimation. Hence the results of the study were in line with the previous observation and theories. Mohammad Abdul Mohit and Mansor Ibrahim in Malaysia in 2010 observed this relation earlier with the use of multiple regression analysis.

Table 3: Results of Final Linear Regression model for task 1.2

Attribute

Coefficient

Std.error

t-statistic

p-value

grade

122655.284

2124.72

57.72

0

bathrooms

37074.53

3281.31

11.30

0

sqft_living

166.82

3.03

55.114

0

lat

527912.45

11159.32

47.31

0

Yr_built

-3300.837

63.31

-52.14

0

intercept

-19426027.44

568218.49

-34.19

0

The intercept was high negative with value of -19426027.44 which indicated that in absence of the deciding five factors housing prices will fall drastically. The intercept was highly significant in nature with a p value of zero.

The t-statistic value for all the five deciding variables in regression analysis nullified any hypothetical claim of disassociation of price with them. The t-statistic value for all the variables (table 1.2) were in the rejection region at 5% level of significance. The alternate hypothesis of significant association of the decision variables with price was established.

Price with number of bathrooms-Rapid Miner result

2.1 Tableau Desktop View of House Prices 2014 to 2015

Quarterly price variation of housing

The tableau desktop version 10.5 was used to create the graph view of the data. The data file was connected to the tableau application and in the new sheet price (measure) was placed on the vertical axis with year (dimension) in the horizontal axis. Year was expanded for quarterly explanation of the trend. It was seen that price of the housing projects varied from quarter to quarter in the given time frame. Sharp decline in prices were notice in the third and fourth quarter of 2014. Prices again climbed in first quarter of 2015 and took a sharp dip in the second quarter of 2015.

Results of Exploratory Data Analysis

Tableau was used to construct graphical representation to analyse the degree of association of the five decision variables with price. Line diagrammatic presentation for the same has been provided in figures (14-17). The labels on the graph were given by dragging the measure variables on label button of the marks section in the worksheet.

Price versus square feet for living-Tableau desktop version

The line diagram for price versus square feet living (Sqft_living), labels in the graph were created based on total Sqft_living sold at different time from second quarter of 2014 to second quarter of 2015.

In price versus bathroom line diagram in figure 15 labels were done based on total bathroom sold with housing throughout the given time frame (Vahlne & Johanson, 2017).

Price with number of bathrooms -Desktop Tableau version

The gradation for the housing projects by government or some local reputed agencies were plotted against price. Average gradation was considered for the purpose. Degrading housing projects were noticed in 2015 in contrast to 2014.

The latitude versus price revealed that average location of housing property sold were situated little south in 2015, whereas in 2014 properties at north got sold more comparitvely.

Price with latitude-Desktop Tableau version

 2.2 Geo map Graph view for all the variables were created in desktop tableau by selecting the latitude and longitude from measures. These fields were generated by the software itself based on the data (Agnello & Schuknecht, 2011). After selecting zip codes where zip codes were assigned as zip code nature, longitude generated and latitude generated were plotted against columns and rows a map of the world with seventy null values. The location of the graph was changed to USA based on the zip codes and required geo map was obtained (Kok, Ismail & Lee, 2018). Figure 18 is the geo map for sum of per square feet for living in different locations in United States.

A second geo map was constructed for number of bathrooms sold in different locations of USA. Maximum number of bathrooms sold was used in the graph (figure 19) for labels (Dooley & Hutchison 2009).

Gradation of construction was also plotted in geo map for different zip codes of USA. Sum of the grades were used to select different color for the map (Chun Lin & Mohan, 2011). Average gradation was taken for the labels of the geo map. Average gradation was much higher in northern part compared to the southern section of USA (figure 20).

Geo map for year of built was also found in desktop tableau across different location in USA from the given set of data (figure 21). The labels were given as year built from dimension section of the worksheet.

References

Agnello, L. and Schuknecht, L., 2011. Booms and busts in housing markets: determinants and implications. Journal of Housing Economics, 20(3), pp.171-190.

Andrews, D., Sanchez, A.C. and Johansson, Å., 2011. Housing markets and structural policies in OECD countries. OECD Economic Department Working Papers, (836), p.0_1.

Chun Lin, C. and Mohan, S.B., 2011. Effectiveness comparison of the residential property mass appraisal methodologies in the USA. International Journal of Housing Markets and Analysis, 4(3), pp.224-243.

Dooley, M. and Hutchison, M., 2009. Transmission of the US subprime crisis to emerging markets: Evidence on the decoupling–recoupling hypothesis. Journal of International Money and Finance, 28(8), pp.1331-1349.

Iacoviello, M. and Neri, S., 2010. Housing market spillovers: evidence from an estimated DSGE model. American Economic Journal: Macroeconomics, 2(2), pp.125-64.

Kok, S.H., Ismail, N.W. and Lee, C., 2018. The sources of house price changes in Malaysia. International Journal of Housing Markets and Analysis.

Mohit, M.A., Ibrahim, M. and Rashid, Y.R., 2010. Assessment of residential satisfaction in newly designed public low-cost housing in Kuala Lumpur, Malaysia. Habitat international, 34(1), pp.18-27.

Vahlne, J.E. and Johanson, J., 2017. The internationalization process of the firm—a model of knowledge development and increasing foreign market commitments. In International Business (pp. 145-154). Routledge.