Data Mining Project: Boston Housing Dataset Analysis

Task 1: Understand the dataset

Main objective of this project is to use the Boston housing dataset to apply the data mining techniques to resolve a business problem. Analysis the provided data set to provide the suitable business solutions by using the Weka data mining tool. To analysis the provided data by review the current, methodologies and algorithms for business analytics. These are will be discussed and analysed in detail.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Analysis the provided data set, first user needs to understand the data set. The provided Boston housing dataset is described as below (Ahmadi & E Shiri Ahmad Abadi, 2013).

The provided dataset has following attributes such as,

  • Id – It is used for data instances identifications.
  • MS Sub Class – It is used to determines the dwelling types
  • MS Zoning – It is used to determines the sales zoning classification.
  • Lot Frontage: Linear feet of street connected to property
  • Lot Area: Lot size in square feet
  • Street: Type of road access to property
  • Alley: Type of alley access to property
  • Lot Shape: General shape of property
  • Land Contour: Flatness of the property
  • Utilities: Type of utilities available
  • Lot Config: Lot configuration
  • BsmtHalfBath
  • FullBath
  • HalfBath
  • Bedroom
  • Kitchen
  • Kitchen Qual
  • Land Slope: Slope of property
  • Neighbourhood: Physical locations within Ames city limits
  • Condition 1: Proximity to various conditions
  • Year Built: Original construction date
  • Year Remod Add: Remodel date
  • Tot Rms Abv Grd
  • Condition 2: Proximity to various conditions
  • Bldg Type: Type of dwelling
  • House Style: Dwelling Style  
  • Sale Type: Type of sale
  • Sale Condition: Condition of sale
  • Overall Qual: Percentages’ the overall material and finish of the house
  • Overall Cond: Percentages’ the overall condition of the house
  • Sale Price: Sale Amount and more.

Statistics data for provided dataset is shown below.

For ID attributes,

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

For Sale Conditions (Arabnia, Stahlbock, Abou-Nasr & Weiss, n.d.),

Visualization of provided data set is shown below.

In this task, user needs to discover the relationships existed among all the attributes. Here, we are applying the normalization techniques to discover the relationships among all the attributes in the Boston Housing data. The normalization technique is used to remove the duplicates in the data (Azzalini & Scarpa, 2012).

In this task, user requires to list the potential business analysis for a provided data set. Here, we are using the classification and prediction algorithm to resolve the business problem. And, also provide the effective solutions for that problem.  The effective results is used to provides the following benefits for real estate consulting firm such as,

  1. Business benefits
  2. Improve the business process
  3. Support decision making
  4. Support strategy development.

ZeroR is the most straightforward classification methods which depends on the objective and predicts all Predictors .ZeroR classifier essentially predicts the category which is class (Witten, Frank & Hall, 2011). Despite the fact that there is no consistency control in ZeroR, it is helpful for deciding a standard execution as a benchmark for other classification methods. Algorithm Construct a recurrence table for the objective and select it is most regular value. Predictors Contribution There is not something to be said about the Predictors commitment to the model on the grounds that ZeroR does not utilize any of them. Display Evaluation the ZeroR just predicts the greater part class accurately. As referenced previously, ZeroR is helpful for deciding a pattern execution for other classification methods. The ZeroR classification is demonstrated as below (Han, Kamber & Pei, 2012).

=== Classifier model (full training set) ===

ZeroR predicts class value: 180921.19589041095

Time taken to build model: 0 seconds

=== Cross-validation ===

=== Summary ===

Task 2: Relationships discovery among features

Correlation coefficient                 -0.0508

Mean absolute error                  57444.7035

Root mean squared error              79439.3263

Relative absolute error                100      %

Root relative squared error            100      %

Total Number of Instances             1460   

The ZeroR algorithm predicts the mean Boston House class values is 180921.19589041095. it must achieve an RMSE better than this value. The ZeroR algorithm predicts the tested negative value for all instances as it is the majority class, and achieves an accuracy of 82 % (Kaluža, 2013).

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances        1198               82.0548 %

Incorrectly Classified Instances       262               17.9452 %

Kappa statistic                          0     

Mean absolute error                      0.1056

Root mean squared error                  0.2289

Relative absolute error                100      %

Root relative squared error            100      %

Total Number of Instances             1460     

=== Detailed Accuracy By Class ===

                 TP Percentage  FP Percentage  Accuracy  Recall   F-Measure  MCC      ROC Area  PRC Area  Class

                 1.000    1.000    0.821      1.000    0.901      ?        0.496     0.819     Normal

                 0.000    0.000    ?          0.000    ?          ?        0.495     0.069     Abnorml

                 0.000    0.000    ?          0.000    ?          ?        0.489     0.084     Partial

                 0.000    0.000    ?          0.000    ?          ?        0.199     0.003     AdjLand

                 0.000    0.000    ?          0.000    ?          ?        0.433     0.007     Alloca

                 0.000    0.000    ?          0.000    ?          ?        0.500     0.014     Family

Weighted Avg.    0.821    0.821    ?          0.821    ?          ?        0.494     0.685     

=== Confusion Matrix ===

    a    b    c    d    e    f   <– classified as

 1198    0    0    0    0    0 |    a = Normal

  101    0    0    0    0    0 |    b = Abnorml

  125    0    0    0    0    0 |    c = Partial

    4    0    0    0    0    0 |    d = AdjLand

   12    0    0    0    0    0 |    e = Alloca

   20    0    0    0    0    0 |    f = Family

In light of the above tables and figures, we can obviously observe that for the Boston Housing data most significant accuracy is 100% and the least is 17.94 %. The other algorithm yields a normal accuracy of around 85%. In fact, the most important accuracy has a place with the Multi scheme classifier. ZeroR Classifier present at the base of the outline with percentage around 100%. A normal of 1198 instances out of absolute 1460 instances is observed to be effectively characterized with most elevated score of 262 occurrences contrasted with 1460 instances, which is the least score (Maimon & Rokach, 2010). The total time required to build the model is likewise a basic parameter in contrasting the classification algorithm. It is regular to recognize the reliability quality of the data gathered and their legality. This analysis suggests a normally utilized pointer which is mean of supreme errors and root mean squared errors. Then again, the relative errors are additionally utilized. It is found that the most important error is found in ZeroR Classifier with a normal score of around 0.821. A algorithm which has a lower error percentage will be favoured as it has all the more powerful classification capability, so after investigation we can say that ZeroR algorithm isn’t appropriate for a Data since it has most extreme number of errors and can’t classify the data effectively (Olson, 2017).

References

Ahmadi, F., & E Shiri Ahmad Abadi, M. (2013). Data Mining in Teacher Evaluation System using WEKA. International Journal Of Computer Applications, 63(10), 12-18. doi: 10.5120/10501-5268

Arabnia, H., Stahlbock, R., Abou-Nasr, M., & Weiss, G. DMIN 2017.

Azzalini, A., & Scarpa, B. (2012). Data Analysis and Data Mining. Oxford: Oxford University Press, USA.

Han, J., Kamber, M., & Pei, J. (2012). Data mining. Waltham, MA: Morgan Kaufmann/Elsevier.

Kaluža, B. (2013). Instant Weka how-to. Birmingham: Packt Pub.

Maimon, O., & Rokach, L. (2010). Data mining and knowledge discovery handbook. New York: Springer.

Olson, D. (2017). Descriptive Data Mining. Singapore: Springer Singapore.

Witten, I., Frank, E., & Hall, M. (2011). Data mining. Burlington, Mass.: Morgan Kaufmann Publishers.