Data Mining Project: Boston Housing Dataset Analysis
- December 21, 2023/ Uncategorized
Task 1: Understand the dataset
Main objective of this project is to use the Boston housing dataset to apply the data mining techniques to resolve a business problem. Analysis the provided data set to provide the suitable business solutions by using the Weka data mining tool. To analysis the provided data by review the current, methodologies and algorithms for business analytics. These are will be discussed and analysed in detail.
Analysis the provided data set, first user needs to understand the data set. The provided Boston housing dataset is described as below (Ahmadi & E Shiri Ahmad Abadi, 2013).
The provided dataset has following attributes such as,
- Id – It is used for data instances identifications.
- MS Sub Class – It is used to determines the dwelling types
- MS Zoning – It is used to determines the sales zoning classification.
- Lot Frontage: Linear feet of street connected to property
- Lot Area: Lot size in square feet
- Street: Type of road access to property
- Alley: Type of alley access to property
- Lot Shape: General shape of property
- Land Contour: Flatness of the property
- Utilities: Type of utilities available
- Lot Config: Lot configuration
- BsmtHalfBath
- FullBath
- HalfBath
- Bedroom
- Kitchen
- Kitchen Qual
- Land Slope: Slope of property
- Neighbourhood: Physical locations within Ames city limits
- Condition 1: Proximity to various conditions
- Year Built: Original construction date
- Year Remod Add: Remodel date
- Tot Rms Abv Grd
- Condition 2: Proximity to various conditions
- Bldg Type: Type of dwelling
- House Style: Dwelling Style
- Sale Type: Type of sale
- Sale Condition: Condition of sale
- Overall Qual: Percentages’ the overall material and finish of the house
- Overall Cond: Percentages’ the overall condition of the house
- Sale Price: Sale Amount and more.
Statistics data for provided dataset is shown below.
For ID attributes,
For Sale Conditions (Arabnia, Stahlbock, Abou-Nasr & Weiss, n.d.),
Visualization of provided data set is shown below.
In this task, user needs to discover the relationships existed among all the attributes. Here, we are applying the normalization techniques to discover the relationships among all the attributes in the Boston Housing data. The normalization technique is used to remove the duplicates in the data (Azzalini & Scarpa, 2012).
In this task, user requires to list the potential business analysis for a provided data set. Here, we are using the classification and prediction algorithm to resolve the business problem. And, also provide the effective solutions for that problem. The effective results is used to provides the following benefits for real estate consulting firm such as,
- Business benefits
- Improve the business process
- Support decision making
- Support strategy development.
ZeroR is the most straightforward classification methods which depends on the objective and predicts all Predictors .ZeroR classifier essentially predicts the category which is class (Witten, Frank & Hall, 2011). Despite the fact that there is no consistency control in ZeroR, it is helpful for deciding a standard execution as a benchmark for other classification methods. Algorithm Construct a recurrence table for the objective and select it is most regular value. Predictors Contribution There is not something to be said about the Predictors commitment to the model on the grounds that ZeroR does not utilize any of them. Display Evaluation the ZeroR just predicts the greater part class accurately. As referenced previously, ZeroR is helpful for deciding a pattern execution for other classification methods. The ZeroR classification is demonstrated as below (Han, Kamber & Pei, 2012).
=== Classifier model (full training set) ===
ZeroR predicts class value: 180921.19589041095
Time taken to build model: 0 seconds
=== Cross-validation ===
=== Summary ===
Task 2: Relationships discovery among features
Correlation coefficient -0.0508
Mean absolute error 57444.7035
Root mean squared error 79439.3263
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 1460
The ZeroR algorithm predicts the mean Boston House class values is 180921.19589041095. it must achieve an RMSE better than this value. The ZeroR algorithm predicts the tested negative value for all instances as it is the majority class, and achieves an accuracy of 82 % (Kaluža, 2013).
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 1198 82.0548 %
Incorrectly Classified Instances 262 17.9452 %
Kappa statistic 0
Mean absolute error 0.1056
Root mean squared error 0.2289
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 1460
=== Detailed Accuracy By Class ===
TP Percentage FP Percentage Accuracy Recall F-Measure MCC ROC Area PRC Area Class
1.000 1.000 0.821 1.000 0.901 ? 0.496 0.819 Normal
0.000 0.000 ? 0.000 ? ? 0.495 0.069 Abnorml
0.000 0.000 ? 0.000 ? ? 0.489 0.084 Partial
0.000 0.000 ? 0.000 ? ? 0.199 0.003 AdjLand
0.000 0.000 ? 0.000 ? ? 0.433 0.007 Alloca
0.000 0.000 ? 0.000 ? ? 0.500 0.014 Family
Weighted Avg. 0.821 0.821 ? 0.821 ? ? 0.494 0.685
=== Confusion Matrix ===
a b c d e f <– classified as
1198 0 0 0 0 0 | a = Normal
101 0 0 0 0 0 | b = Abnorml
125 0 0 0 0 0 | c = Partial
4 0 0 0 0 0 | d = AdjLand
12 0 0 0 0 0 | e = Alloca
20 0 0 0 0 0 | f = Family
In light of the above tables and figures, we can obviously observe that for the Boston Housing data most significant accuracy is 100% and the least is 17.94 %. The other algorithm yields a normal accuracy of around 85%. In fact, the most important accuracy has a place with the Multi scheme classifier. ZeroR Classifier present at the base of the outline with percentage around 100%. A normal of 1198 instances out of absolute 1460 instances is observed to be effectively characterized with most elevated score of 262 occurrences contrasted with 1460 instances, which is the least score (Maimon & Rokach, 2010). The total time required to build the model is likewise a basic parameter in contrasting the classification algorithm. It is regular to recognize the reliability quality of the data gathered and their legality. This analysis suggests a normally utilized pointer which is mean of supreme errors and root mean squared errors. Then again, the relative errors are additionally utilized. It is found that the most important error is found in ZeroR Classifier with a normal score of around 0.821. A algorithm which has a lower error percentage will be favoured as it has all the more powerful classification capability, so after investigation we can say that ZeroR algorithm isn’t appropriate for a Data since it has most extreme number of errors and can’t classify the data effectively (Olson, 2017).
References
Ahmadi, F., & E Shiri Ahmad Abadi, M. (2013). Data Mining in Teacher Evaluation System using WEKA. International Journal Of Computer Applications, 63(10), 12-18. doi: 10.5120/10501-5268
Arabnia, H., Stahlbock, R., Abou-Nasr, M., & Weiss, G. DMIN 2017.
Azzalini, A., & Scarpa, B. (2012). Data Analysis and Data Mining. Oxford: Oxford University Press, USA.
Han, J., Kamber, M., & Pei, J. (2012). Data mining. Waltham, MA: Morgan Kaufmann/Elsevier.
Kaluža, B. (2013). Instant Weka how-to. Birmingham: Packt Pub.
Maimon, O., & Rokach, L. (2010). Data mining and knowledge discovery handbook. New York: Springer.
Olson, D. (2017). Descriptive Data Mining. Singapore: Springer Singapore.
Witten, I., Frank, E., & Hall, M. (2011). Data mining. Burlington, Mass.: Morgan Kaufmann Publishers.