Data Profiling In Organizations: Literature Review And Analysis

Literature Review

This paper aims at providing a briefing paper on Data profiling in the context of the organization. The data profiling is the process of examination of data from the database and gathering information about the data (Abedjan, Golab and Naumann 2015). The purpose of the data profiling is to understand the content, structure, sources and relationship of the data. It helps the organization to understand abnormalities and quality of the data. The process of data profiling is conducted by statistics such as maximum, minimum, aggregate, mean, percentile, mode and frequency (Mahanti, 2014). A briefing paper is a short summary of research results written for a particular audience (Copeland and Neeley 2013). The information is acquired in the form of length, data type, null value occurrence, abstract and string pattern. The organizations use purpose-built tools for the data profiling. It is an effective process for improvement in the accuracy of data in the corporate database. It can be concluded that this paper would be effective in providing a detailed analysis of data profiling in the organizations.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

This section of the paper would provide the literature review on the data profiling and relevant issues.

The data Quality Problem

According to (Takeshita et al. 2013) this reviewed paper has focused on the problem of data quality in the data profiling. The reviewed paper presents data quality as a major concern for the organizations and their business. This literature suggests that an increase in the distribution of data and data interchange are the reasons behind the issue of data quality. The organizations are concerned about the requirement of data standardization. The findings from this article are that the current concerns of the organizations are to check the reliability of data source, the medium of data storage and standardization of data in order to achieve the goals of the organizations. The further concerns of post data profiling are the transformation of data and extraction of data for the organization.

According to (Hazen et al. 2014) the poor quality data affects the organization to a large extent. The discussion from this article suggests that it changes the operation and management of the organizations. The problem of data quality results in the revenue loss of the organization. The data quality problem leads to a loss of about $500 billion in a year. The statistics reveal that the error of up to 5% in the data quality is capable of loss of revenue by 10% (John Walker 2014). This reviewed article finds the poor quality data as the most important factor for the degradation of a business or an organization. The (Hazen et al. 2014) states that poor quality data affects the revenue generation and employee performance of an organization to a large extent.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Current Concerns Regarding Data Profiling

Data Cleansing

According to (Boselli et al. 2014, May) the data profiling process helps in the cleansing of data by providing the information about the vulnerable data. This reviewed article discusses the instances of data cleansing are missing values, outliers and skewed distributions. The data cleansing process is followed by a repeated process of data profiling. This would assure the cleansing efforts of the data. The adoption of data cleansing would help the organization in assuring the good quality of data. This article considers the data cleansing process as an important aspect of data profiling, and this process helps in the growth of the organization. The reviewed article further adds that this process occurs in a repeated manner to ensure the cleansing of data.

The Data Extraction process in Data Profiling

According to (Ferrara et al. 2014) the process of data extraction is an important aspect of data profiling. It requires the instant analysis of the length, data type, value range, frequency, occurrence and variance. This reviewed article reveals that there is a need for retrieving the data by the collection of the data or information and transformation of those information in the format which can be analyzed with ease. It separates relevant fragments of text, extracts the required information from the fragments, and makes a section of the information in a framework. The process of data extraction provides the detailed analysis and provides the data collection method.

The (Huang et al. 2014) data quality management ensures the quality of the data assets of the organizations. The article discusses the three major areas for discussing the data quality management. The completeness of the data assets is the most important part of quality management. The next step for quality management is to assess whether the data assets are accurate or not. For a data to be accurate, it should meet the internal and external requirements of the organizations. This article discusses the business process and decision making as the internal requirements and regulations as the external requirements. The integrity is an important aspect for the data quality management. The process of data integration is used by the organization to move the data from one application to another.

The Benefits of Data Profiling

According to reviewed article of (Ellefi et al. 2014), the various benefits of data profiling are the improvement of data quality and reducing the time for implementation cycle. It helps in improving the understanding of the data for an organization. This reviewed article states that corporate database is likely to be accurate and it is expected to maintain data quality. It can be achieved effectively with the help of the data profiling. The statistical data are at first collected from different platforms and then it is used to gain the perspective of the people and their experiences. It helps the organization to assess the needs for achieving the aim of the organization. There are also specific concerns related to data profiling, and it includes the tampering of information.

The Consequences of Poor Quality Data

(Albers et al. 2016) states the problem of data quality includes decomposing and collecting of data. This is done mainly to duplicate values, remove errors, to avoid unwanted characters, white spaces and symbols from the data. The reviewed article states the process of data profiling is used to remove the unwanted characters, symbols and spaces in the text but the process lacks at removing the errors from the text. In order to identify and remove the error from the data, it is required to incorporate the different data processing techniques such as text mining, data mining and implementation of extraction (Evans et al. 2014). According to this literature, the results of data profiling are subjected to the techniques related to mining of the data and information so that information can be extracted and discovered from the database of an organization. Another important method includes the categorization of traditional text mining practices. It is performed so that a meaningful structure of data can be obtained. This article defines the categorization as learning method where categorization methods are implemented to assign the documents different categories based on the content. This reviewed article further describes the classifier are skilled from instances to conduct the group assignment. It involves the presentation of each category of data as a problem classified in binary. These techniques include the analysis of phrase or words of the text and considers the frequency of the phase.  

It can be recommended that the organizations should use data profiling as it would provide data quality. The quality of data is important for an organization to earn more revenue. It is recommended as it reduces the time for data cycle. The organization should implement a standard method for conducting data profiling, and it should include the publication of results and reporting to relevant stakeholders. It is recommended for assuring the reliability of data source and standardization for achieving the business goals of an organization.

Conclusion

 It can be concluded from this briefing paper that data profiling is an important process for the organizations. The organization can use the process of data profiling for maintaining the quality of the data and improving the data accuracy. The major concerns of the data profiling are poor data quality and the process of cleansing of the data. It states that the cleansing of data as the process for ensuring the data quality. It provides the literature review for the data profiling with the help of data extraction and collection process. The paper is efficient in providing the benefits of data profiling for the organization. It concludes that the poor data quality is the reason for the loss of revenue of various organization and it provides the statistical data in this context. The paper discusses categorization as an important aspect of data profiling, and it is efficient in discussing the relevant process. Therefore it can be concluded that this process is effective in providing the briefing report on the topic of data profiling

References

Abedjan, Z., Golab, L. and Naumann, F., 2015. Profiling relational data: a survey. The VLDB Journal—The International Journal on Very Large Data Bases, 24(4), pp.557-581.

Albers, A., Gladysz, B., Heitger, N. and Wilmsen, M., 2016. Categories of product innovations–A prospective categorization framework for innovation projects in early development phases based on empirical data. Procedia CIRP, 50, pp.135-140.

Boselli, R., Cesarini, M., Mercorio, F. and Mezzanzanica, M., 2014, May. Planning meets data cleansing. In Twenty-Fourth International Conference on Automated Planning and Scheduling.

Copeland, G. and Neeley, A., 2013. Identifying Competencies and Actions of Effective Turnaround Principals. Briefing Paper. Southeast Comprehensive Center.

Ellefi, M.B., Bellahsene, Z., Scharffe, F. and Todorov, K., 2014, May. Towards Semantic Dataset Profiling. In [email protected] ESWC.

Evans, A.M., Bridgewater, B.R., Liu, Q., Mitchell, M.W., Robinson, R.J., Dai, H., Stewart, S.J., DeHaven, C.D. and Miller, L.A.D., 2014. High resolution mass spectrometry improves data quantity and quality as compared to unit mass resolution mass spectrometry in high-throughput profiling metabolomics. Metabolomics, 4(2), p.1.

Ferrara, E., De Meo, P., Fiumara, G. and Baumgartner, R., 2014. Web data extraction, applications and techniques: A survey. Knowledge-based systems, 70, pp.301-323.

Hazen, B.T., Boone, C.A., Ezell, J.D. and Jones-Farmer, L.A., 2014. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. International Journal of Production Economics, 154, pp.72-80.

Huang, G., Wu, X.Y., Yuan, M. and Li, R.F., 2014. Research on Data Quality of E&P Database Base on Metadata-Driven Data Quality Assessment Architecture. In Applied Mechanics and Materials (Vol. 530, pp. 813-817). Trans Tech Publications.

John Walker, S., 2014. Big data: A revolution that will transform how we live, work, and think.

Mahanti, R., 2014. Critical success factors for implementing data profiling: The first step toward data quality. Software Quality Professional, 16(2), p.13.

Takeshita, Y., Martz, T.R., Johnson, K.S., Plant, J.N., Gilbert, D., Riser, S.C., Neill, C. and Tilbrook, B., 2013. A climatology?based quality control procedure for profiling float oxygen data. Journal of Geophysical Research: Oceans, 118(10), pp.5640-5650.