Exploration Of A Retail Dataset For Machine Learning Algorithm Implementation

Program ID3 Algorithm for Supervised Learning Problem

With the emergence of big data, it becomes important for different industries to churn the available data about the business processes and customers in order to improve the performance. In this way, the organizations can gain competitive advantage in their respective markets.    

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Now a day the retail sector became one of the most competitive industries. In order to survive the organization in the market mostly utilizes undirected mass marketing comprehensively.

Every potential customer receives similar catalogues, advertising mails, pamphlets and announcements. In this way most of the customers gets annoyed due to huge number offers while the response rate for those campaigns drops for every organization.

The following report contributes to the exploration of a dataset for retail organization, modelling of the selected dataset and implementation of different classifier algorithms. In addition to that, different sections of this report contributes to the interpretations of different insights from the exploration of the selected dataset.

For the BigMart selected dataset, the data is collected from the below link;

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

The test dataset contains the 8523 rows for twelve columns that represents different attributes for the different products.  The columns are given by; ‘Item_Identifier’, ‘Item_Weight’, ‘Item_Fat_Content’, ‘Item_Visibility’, ‘Item_Type’, ‘Item_MRP’, ‘Outlet_Identifier’, ‘Outlet_Establishment_Year’, ‘Outlet_Size’, ‘Outlet_Location_Type’,  ‘Outlet_Type’, ‘Item_Outlet_Sales’.

Following is the statistical information for   the selected dataset;

































The above table provides statistical description about the selected dataset about all the numerical dataset.  For the Item weight there are total 7060 rows, and for Item visibility, Item MRP and item outlet sales there are 8523 rows of data.  From the above table it is evident that the minimum, mean and maximum value for the Item_MRP is given by, 31.290,140.9927 and 266.8884.

For the item weight, the minimum, mean and the maximum value is given by 4.5550, 12.857645 and 21.3500.

For the attribute Item_Visibility it is found that the minimum value of it is zero. This result does not make any sense as whenever a product is sold from a store, therefore the value of visibility cannot be 0.

On the other hand, the Outlet_Establishment_Years ranges from 1985 to 2009. Therefore, with the age of the store it will be possible to find out the impact on the sales of the products from the specific stores.

The lower number of Item_Weight compared to the Item_Outlet_Sales indicates that there are missing values in the selected dataset for the analysis.

For this project the linear regression classifier model is used.   The Simple linear regression is a method that helps in summarization and define the relationships among two continuous variables that are included in the test. One is denoted by x, which is considered as independent or explanatory variable. Another variable is usually denoted y, which is considered as the dependent variable.

Choose Another Algorithm and Get Approval

There are mainly two types of regression models.  One is Simple Linear Regression model and the other is multiple Linear Regression. The Simple Linear Regression is considered by one independent variable. On the other hand, the Multiple Linear Regression model considers more than one independent variables for the prediction process. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.

In implementing the linear regression model the main objective is to fit a straight line among the distribution of the selected dataset or feature.  The best fit line among the distributions will be nearest of all the points.  This helps in reducing the error in the process of prediction from the available data points from the fitted line.

Following are   the some of the properties of the linear regression line;

  • The plotted regression line goes through the mean of the considered independent variable) and the mean of the dependent variables in the process.
  • The Linear Regression method is also known as Ordinary Least Square (OLS) as the line minimizes the total of the of Square of Residuals in the process.
  • The coefficient of relationship between the two variables describes the change in Y along the change in X. In other words, it can have said that it depicts the trend of change in Y with the increase and decrease of the value of ‘X’ in the process.

In order to implement the linear regression model which is the part of the Machine learning in artificial intelligence. In this way the developed algorithm that enables the computer sysems to adapt with the behaviour and predict for a specific data point in the view of available empirical data used to train the dataset (Chen and Zhang 2014). A critical focal point of machine learning is on finding data by recognizing examples and settling on shrewd choices in light of data.

Machine learning can assist retailers with becoming more exact and granular in making predictions (Davenport, 2013). Models of machine learning incorporate normal dialect preparing, affiliation run learning and group learning (Chen and Zhang 2014).

Thematic examination of the dataset was chosen as the information investigation strategy for this investigation, as it was fitting for deciding examples and subjects identifying with the utilization of huge information investigation. Examples and subjects were distinguished by following the way toward performing topical examination down into six stages as quickly talked about beneath. 1. Getting comfortable with the information that had been assembled from the semi organized meetings. The key way to deal with this was in the interpretation and examination of this to the first meetings for exactness. 2. Perusing the majority of the translations and producing codes which depicted intriguing highlights of the reactions from respondents. 3.  down subjects included perusing the majority of the codes and doling out them a typical topic. Precedents of beginning topics were: use, challenge, definition, obstructions to selection, future, design, esteem, challenge, innovations, development, discernment and inspiration. 4. Arranging the codes under their separate subjects. Rehashing the coded removes under each subject with a specific end goal to distinguish subthemes. The way toward recognizing subthemes gathered regular codes together and to expel a portion of the codes which did not shape some portion of a coherent gathering.

Use an Existing Package to Solve a Data Mining Problem

For the selected dataset, the unique values for each of the columns are investigated which is listed below;

























dtype: int64

From the above table it can be stated that, the dataset contains 10 unique outlets with unique identifiers and mainly 1559 different items.

In the next stage of this project, at first it is investigated that most important factors that impacts on the sales of the products at the store.  In this analysis we found the relation between the different factors through the correlation among the factors. Following is the depiction of relation through the heat map


From the above heat map, it is evident that the item_MRP has the most significant impact on the outlet sales of the products from the stores.

In this stage the, the distribution of the different type of items according to the contents and types are segregated.  This are given by;

Frequency of Item_Fat_Content in the dataset;

Low Fat




low fat






For different outlet sizes the frequency of the stores are given by;







In further investigation, the comparison between the different item types and item weight were investigated.  In order to do that, box plot is used to find the relationship which lead to the following plots;


From the above range it is evident that, the house hold products and the seafood is available in wide range of weight which shows the range of the weights with different values.

Now in the following plot that includes different variables are depicted that helps in determine the relationships between them.   


From the above plot it can be said that that most of the stores that were established before the year 1990 have the larger number of sales compared to the newer stores of big mart.   Moreover, the items with weights between the values 10-15 are mostly sold by the stores.

In this data mining project, it is found that there are numerous rows in the dataset that contains missing values which may lead to the wrong modelling of the classifier of the dataset. Thus, in order to remove or imputing the missing values in the dataset in order to make the dataset and the generated model a reliable one.

Statistical Information for the Selected Dataset

The examination of enormous information to pick up bits of knowledge is another idea. Huge information examination has been characterized in various routes and there seems, by all accounts, to be an absence of agreement on the definition. Enormous information investigation has been characterized as far as the advancements and systems which are utilized to dissect substantial scale complex information to help enhance the execution of a firm (Kwon et al., 2014) characterizes huge information investigation as the use of cutting edge logical methods on enormous informational indexes. Fisher et al. (2012) characterize enormous information examination as a work process which distils terabytes of low esteem information down into more granular information of high esteem. For the reasons for this paper, enormous information examination was characterized as the utilization of logical strategies and advances to break down huge information so as to acquire data which is of incentive for deciding.

For the majority of Big Data mainly available in unstructured state that does not provide any value for the business organizations. From the unstructured dataset through the use of the right set of tools and analytics it is possible to find out the relevant insights for the organizations.

With the specifically crafted model it is possible to make prediction for the desired feature selected from the database.  Use of the predictive model from the selected Dataset can be helpful in finding trends and insights about the sales and business processes that help in driving operational efficiencies for the business organizations. in this way the organization can create and launch new products as well as gaining competitive advantages against other competitors. In this way the exploitation of the value  from the  Big Data is helpful in removing tremendous effort required for sampling.

Furthermore, analysis of Big Data can bring other benefits for the organizations.  This benefits includes launch of new products and services with the customer centric product recommendations while better meeting customer requirements of the customers. In this way the data analytics can facilitate growth in the market.

Previously getting the insights from huge amount of dataset were too costly to process.

The underlying concept to carry out this project was to knowledge discovery through the use of the different packages in python languages known as data mining. The core of this process is machine learning by defining the features for classification process.

Spellbinding examination is the arrangement of systems which are utilized to depict and give an account of the past (Davenport and Dyché 2013). Retailers can utilize unmistakable investigation to portray and outline deals by area and stock levels. Models of techniques incorporate information representation, expressive measurements and a few information mining strategies (Camm et al., 2014).

Prescient investigation comprises of an arrangement of procedures which utilize measurable models and experimental techniques on past information keeping in mind the end goal to make exact expectations about the future or decide the effect of one variable on another.

retail industry, prescient examination can extricate designs from information to make forecasts about future deals, rehash visits by clients and probability of making an online buy (Camm et al. 2014). Precedents of prescient investigative systems which can be connected to enormous information incorporate information mining strategies and direct relapse.


In order to avoid decreased rate of response as well as success   it is important to provide personalized recommendation for the customers that are specific to their needs.  In order to do that, it is important to determine the factors that have the significant impact in the plotting of the regression line.  For this, the feature engineering is the most important stage where the variables are selected for the classifier modelling process.

For this project we have investigated with multiple models with different perspectives with the selected dataset using different classification models. The selected data set is a small in size that may not be fruitful for the organizations large scale sales model.  Throughout the project it is found that the dataset has missing values for the different attributes in the numerous rows which were managed in the data cleansing process to have better classification model.

Even though, altered combinations of feature sets for the modelling were used due to the noisy dataset the results deviated from each other. Where as in case of the results from the models which yielded higher rate of accuracies can help in concluding that the dataset contains a demonstrative amount of information.

In order to remove the noise from the dataset that may have caused due to the e random split method of the complete dataset. The results from the project depicted that linear models are not actually suitable to use for this kind of project as this classification model was introduced in order to predict other attributes than the categorical data.


Davenport, T.H. and Dyché, J., 2013. Big data in big companies. International Institute for Analytics, 3.