Pattern Mining for Information Retrieval from Large Databases

Abstract
With the widespread use of databases and explosive growth in their sizes are reason for the attraction of the data mining for retrieving the useful informations. Desktop has been used by tens of millions of people and we have been humbled by its usage and great user feedback. However over the past seven years we have also witnessed some changes in how users store and access their own data, with many moving to web based application. Despite the increasing amount of information available in the internet, storing files in personal computer is a common habit among internet users. The motivation is to develop a local search engine for users to have instant access to their personal information.The quality of extracted features is the key issue to text mining due to the large number of terms, phrases, and noise. Most existing text mining methods are based on term-based approaches which extract terms from a training set for describing relevant information. However, the quality of the extracted terms in text documents may be not high because of lot of noise in text. For many years, some researchers make use of various phrases that have more semantics than single words to improve the relevance, but many experiments do not support the effective use of phrases since they have low frequency of occurrence, and include many redundant and noise phrases. In this paper, we propose a novel pattern discovery approach for text mining.To evaluate the proposed approach, we adopt the feature extraction method for Information Retrieval (IR).
Keywords –Pattern mining, Text mining, Information retrieval, Closed pattern.
Introduction
In the past decade, for retrieving an information from the large database a significant number of datamining techniques have been presented that includes association rule mining, sequential pattern mining, and closed pattern mining. These methods are used to find out the patterns in a reasonable time frame, but it is difficult to use the discovered pattern in the field of text mining. Text mining is the process of discovering interesting information in the text documents. Information retrieval provide many methods to find the accurate knowledge form the text documents. The most commonly used method for finding the knowledge is the phrase based approaches, but the method have many problems such as phrases have low frequency of occurrence, and there are large number of noisy phrases among them.If the minimum support is decreased then it will create lot of noisy pattern

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

Pattern Classification Method
To find the knowledge effectively without the problem of low frequency and misinterpretation a pattern based approach (Pattern classification method) is discovered in this paper. This approach first find out the common character of pattern and evaluates the weight of the terms based on distribution of terms in the discovered pattern. It solves the problem of misinterpretation. The low frequency problem can also be reduced by using the pattern in the negatively trained examples. To discover patterns many algorithms are used such as Apriori algorithm, FP-tree algorithm, but these algorithms does not tell how to use the discovered patterns effectively. The pattern classification method uses closed sequential pattern to deal with large amount of discovered patterns efficiently. It uses the concept of closed pattern in text mining.
Preprocessing
The first step towards handling and analyzing textual data formats in general is to consider the text based information available in free formatted text documents.Real world databases are highly susceptible to noisy, missing, and inconsistent data due to their huge size. These low quality data will lead to low quality mining results. Initially the preprocessing is done with text document while storing the content into desktop systems.Commonly the information would be processed manually by reading thoroughly and then human domain experts would decide whether the information was good or bad (positive or negative). This is expensive in relation to the time and effort required from the domain experts. This method includes two process.
Removing stop words and stem words
To begin the automated text classification process the input data needs to be represented in a suitable format for the application of different textual data mining techniques, the first step is to remove the un-necessary information available in the form of stop words.Stop words are words that are deemed irrelevant even though they may appear frequently in the document. These are verbs, conjunctions, disjunctions and pronouns, etc. (e.g. is, am, the, of, an, we, our). These words need to be removed as they are less useful in interpreting the meaning of text.
Stemming is defined as the process of conflating the words to their original stem, base or root. Several words are small syntactic variants of each other since they share a common word stem. In this paper simple stemming is applied where words e.g. ‘deliver’, ‘delivering’ and ‘delivered’ are stemmed to ‘deliver’. This method helps to capture whole information carrying term space and also reduces the dimensions of the data which ultimately affects the classification task. There are many algorithms used to implement the stemming method. They are Snowball, Lancaster and the Porter stemmer. Comparing with others Porter stemmer algorithm is an efficient algorithm. It is a simple rule based algorithm that replaces a word by an another. Rules are in the form of (condition)s1->s2 where s1, s2 are words. The replacement can be done in many ways such as, replacing sses by ss, ies by i, replacing past tense and progressive, cleaning up, replacing y by i, etc.
Weight Calculation
The weight of the each term is calculated by multiplying the term frequency and inverse document frequency. Term frequency find the occurrence of the individual terms and counts. Inverse document frequency is a measure of whether a term is common or rare across all documents.
Term Frequency:
Tf(t,d)=0.5+0.5*f(t,d)/max{f(w,d):wbelongs to d}
Where d represents single document and t represents the terms
Inverse Document Frequency:
IDF(t,D)= log(Total no of doc./No of doc. Containing the term)
Where D represents the total number of documents
Weight:
Wt=Tf*IDF
Clustering
Cluster is a collection of data objects. Similar to one another within the same cluster. Cluster analysis will find similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.Clustering is defined as a process of grouping data or information into groups of similar types using some physical or quantitative measures. It is an unsupervised learning. Cluster analysis used in many applications such as, pattern recognition, data analysis and web for information discovery. Cluster analysis support many types of data like, Data matrix, Interval scaled variables, Nominal variables, Binary variables and variables of mixed types. There are many methods used for clustering. The methods are partitioning methods, hierarchical methods, density based methods, grid based methods and model based methods. In this paper partitioning method is proposed for clustering.
Partitioning methods
This method classifies the data into k-groups, which together satisfy the following requirements: (1) each group must contain at least one object, (2) each object must belong to exactly one group. Given a database of n objects, a partitioning method constructs k partitions of the data, where each partition represents a cluster and kK-means algorithm
K-means is one of the simplest unsupervised learning algorithms. It takes the input parameter, k, and partitions a set of n objects into k-clusters so that the resulting intra cluster similarity is high but the inter cluster similarity is low. It is centroid based technique. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the clusters centroid.
Input:k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Methods:

Select an initial partition with k clusters containing randomly chosen samples, and compute the centroids of the clusters.
Generate a new partition by assigning each sample to the closest cluster center.
Compute new cluster centers as the centroids of the cluster.
Repeat steps 2 and 3 until an optimum value of the criterion function is found or until the cluster membership stabilizes.

This algorithm faster than hierarchical clustering. But it is not suitable to discover clusters with non-convex shapes.

Fig.1. K-Means Clustering
Classification
It predicts categorical class labels and classifies the data based on the training set and the values in classifying the attribute and uses it in classifying the new data. Data classification is a two step process (1) learning, (2) classification. Learning can be classified into two types supervised and unsupervised learning. The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data. There are many classification methods are available such as, K-nearest neighbor, Genetic algorithm, Rough Set Approach, and Fuzzy Set approaches.The classification technique measures the nearing occurrence. It assumes the training set includes not only the data in the set but also the desired classification for each item. The classification is done through training samples, where the entire training set includes not only the data in the set, but also the desired classification for each item. The Proposed approaches find the minimum distance from the new or incoming instance to the training samples. On the basis of finding the minimum distance only the closest entries in the training set are considered and thenew item is placed into the classwhich contains the most items of the K. Here classify thesimilarity text documents and file indexing is performed to retrieve the file in effective manner.
Result and Discussion
The input file is given and initial preprocessing is done with that file. To find the match with any other training sample inverse document frequency is calculated. To find the similarities between documents clustering is performed.Then classification is performed to find the input matches with any of the clusters. If it matches the particular cluster file will be listed.Theclassification techniques classify the various file formats and the report is generated as percentage of files available. The graphical representation shows the clear representation of files available in various formats. This method uses least amount of patterns for concept learning compare to other methods such as, Rocchio, Prob, nGram , the concept based models and the most BM25 and SVM models. The proposed model is achieved the high performance and it determined the relevant information what users want. This method reduces the side effects of noisy patterns because the term weight is not only based on term space but it also based on patterns. The proper usage of discovered patterns is used to overcome the misinterpretation problem and provide a feasible solution to effectively exploit the vast amount of patterns generated by data mining algorithms.
Conclusion
Storing huge amount of files in personal computers is a common habit among internet users, which is essentially justified for the following reasons,
1) The information will not always permanent
2) The retrieval of information differs based on the different query search
3) Location same sites for retrieving information is difficult to remember
4) Obtaining information is not always immediate. But these habits have many drawbacks. It is difficult to find when the data is required.In the Internet, the use of searching techniques is now widespread, but in terms of personal computers, the tools are quite limited. The normal “Search or “Find” options take several hours to produce the search result. It acquires more time to predict the desire result where the time consumption is high.The proposed system provides accurate result comparing to normal search.All files are indexed and clustered using the efficient k means techniques so the information retrieved in efficient manner.
The best and advanced clustering gadget provides optimized time results.Downtime and power consumption is reduced.
References
[1]K. Aas and L. Eikvil, ‘’Text Categorization: A Survey,’’ Technical Report NR 941, Norwegian Computing Centre, 1999.
[2] R. Agarwal and R.Srikanth, ‘’Fast Algorithm for Mining Association Rules in Large Databases, ‘’ Proc. 20th Int’l Conf. Very Large Data Bases(VLDB’94), pp.478-499, 1994.
[3] H. Ahonen, O. Heinonen, M. Klemettinen, and A.I. Verkamo, “Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections,” Proc. IEEE Int’l Forum on Research and Technology Advances in Digital Libraries (ADL ’98), pp. 2-11, 1998.
[4] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.
[5] N. Cancedda, N. Cesa-Bianchi, A. Conconi, and C. Gentile, “Kernel Methods for Document Filtering,” TREC, trec.nist.gov/ pubs/trec11/papers/kermit.ps.gz, 2002.
[6] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “Word- Sequence Kernels,” J. Machine Learning Research, vol. 3, pp. 1059- 1082, 2003.
[7] M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in Automated Text Categorization,” Technical Report IEI-B4-07- 2000, Instituto di Elaborazionedell’Informazione, 2000.
[8] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[9] S.T. Dumais, “Improving the Retrieval of Information from External Sources,” Behavior Research Methods, Instruments, and Computers, vol. 23, no. 2, pp. 229-236, 1991.
[10] J. Han and K.C.-C. Chang, “Data Mining for Web Intelligence,” Computer, vol. 35, no. 11, pp. 64-70, Nov. 2002.
[11] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’00), pp. 1-12, 2000.
[12] Y. Huang and S. Lin, “Mining Sequential Patterns Using Graph Search Techniques,” Proc. 27th Ann. Int’l Computer Software and Applications Conf., pp. 4-9, 2003.
[13] N. Jindal and B. Liu, “Identifying Comparative Sentences in Text Documents,” Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’06), pp. 244-251, 2006. [14] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with tfidf for Text Categorization,” Proc. 14th Int’l Conf. Machine Learning (ICML ’97), pp. 143-151, 1997.
[15] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. Machine Learning (ICML ’98),, pp. 137-142, 1998.
[16] T. Joachims, “Transductive Inference for Text Classification Using Support Vector Machines,” Proc. 16th Int’l Conf. Machine Learning (ICML ’99), pp. 200-209, 1999.
[17] W. Lam, M.E. Ruiz, and P. Srinivasan, “Automatic Text Categorization and Its Application to Text Retrieval,” IEEE Trans. Knowledge and Data Eng., vol. 11, no. 6, pp. 865-879, Nov./Dec. 1999.
[18] D.D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proc. 15th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’92), pp. 37-50, 1992.
[19] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp. 212-217, 1992.
[20] D.D. Lewis, “Evaluating and Optimizing Automous Text Classification Systems,” Proc. 18th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’95), pp. 246-254, 1995.
[21] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management: An Int’l J., vol. 24, no. 5, pp. 513-523, 1988.
[22] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[23] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval, vol. 1, pp. 69-90, 1999.
[24] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. 22nd Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’99), pp. 42-49, 1999.
 

A Database Design and Report on the Impact of Databases in the Work Place

A Database Design and Report on the Impact of Databases in the Work Place

Table of Contents

Introduction

General Description of the Doctor’s Office Database

Visio Design of the Database

Access Design of the Database

A Database used in the Work Place

Impact of the Database on the Work Place

Business Benefits of Queries

Forms and Reports

Security Concerns and Mitigation

Conclusion

References

Appendix A: Noodletools Sources Window Screen Shot

Introduction

Why are Data bases so important to for Businesses? Databases are condensed information that are put into tables, columns and rows. This makes it easier to retrieve and manage data. Data gets deleted and updated in these systems on a daily, weekly or quarterly basis. All businesses use some sort of database daily to store information to be retrieved for later use.

 The most common form of database used in healthcare is the relational database. Relational databases can be used to track patient care in the form of treatments, outcomes of those treatments, and critical indicators of a patient’s current state such as blood pressure, heart rate, and blood glucose levels. Relational databases can also be used to interconnect with multiple informational systems throughout a healthcare facility. For example, a relational database in a cardiac care unit can be directly linked to a hospital’s registration system. Upon registration, a newly admitted patient’s demographic information is sent automatically to the cardiac database. This eliminates the need for cardiac care clinicians to input patient information into the database, freeing them to concentrate on providing the patient with the best care possible.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

Relational databases have the potential to eliminate paper storage and transfer of information and to answer important questions about healthcare efficacy rather than merely serving as an accounting mechanism. For example, diabetic patients sharing similar health risk factors (for example, slightly overweight, high HbA1c and fasting blood glucose readings) can be closely monitored to determine how different drugs (for example, Glucovance) help to control those factors. From an administrative and prevention standpoint, relational databases can be used to identify at-risk patients, for example, those who have a family history of aneurysms. Once identified, patients can be screened to prevent them from succumbing to a particular disease.

 

A Database used in the Work Place 

Thousands of companies depend on the accurate recording, updating and tracking of their data on a minute-to-minute basis. Employees use this data to complete accounting reports, calculate sales estimates and invoice customers. The workers access this data through a computerized database. A proven method to manage the relationships between the various database elements is the use of a relational database management system. A fully-functional relational database management system allows users to enter new information, update current records and delete outdated data. As an example, when a salesperson sells 1,000 units, that person will enter the transaction information into the relational database management system. The data can include the salesperson’s name, the customer information, the product sold and the quantity sold. The relational database management system enters a new record in the customer table, updates the salesperson’s record and subtracts 1,000 units from the inventory record.

Find Out How UKEssays.com Can Help You!
Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.
View our services

There are many different types of Databases on the market today. They all have a major influence on how a business operates. A business or corporation would suffer if they did not have some sort of database. Without some sort of organization, a business that uses a database system would become disorganized and hard to manage. Databases are made to help keep a business organized and structured. This is what they are designed to do. There are several popular Databases out on the market today. One such widely used Data base is MySQL. It is a database used by many people because of it is free. There are several editions that you can pay for that are for commercial use. With this database version you get a broad array of features. Each company has customized features to suit their specific needs. There are a multitude of storage engines that this database permits you to choose from. This helps you change the functions of a device that handles the data. This system also lets you process large amounts of data with ease and has an easy to use interface. This is the reason it is used widely by multiple consumers and has been rated very reliable.

Most businesses today use a personnel database. This database is used for controlling benefits, rate of pay, payroll, deductions and W-4 and I-9 forms. Making sure your employees get paid on time is key. No one wants to work for free. That’s why this database was designed to help all business owners. Personnel data system helps a company when they have an abundance of employees. This database system also helps the company stay in compliance with State and Federal mandated regulations.

Databases can have a positive or negative impact on a business’s operation. The impact that a strong database can have are Improved Management of Workflows, Increased Operational Intelligence, Manage Risk More Proficiently, Improved Overall Business Process Analysis and will Centralize Your Operations Management Efforts. Basically, Databases are the main reasons that businesses are successful. They cut the operation times in half for processing a greater result for a company managing their time more efficiently.

Queries are questions that you ask the database that to help you find specific information. This gives you the power to request materials from the database that contains multiple tables. There are numerous database benefits that come because of a business using query in their daily operations. They are comprised of multiple data views, using interactive languages to make it easier to learn and understand. It is a fact that using a correct database in a business can have multiple positive effects on a workplace. Queries help you make less mistakes when performing daily tasks. If the information is correct it helps minimize the possibility of an error occurring. If the queries bring up the information and it is not correct, you can fix it then. Reducing the amount of time using queries for specific item in a database is a time saver. Database queries help companies Improve Data Security, Improvement on data sharing, Effective Data Integration, Database Management Systems Minimize Data Inconsistency, Better Access to Data, and Increase in Productivity of The End User. This is just a few ways queries help save businesses time and money.

Reports offer an actual way of collaborating with difficult materials. The reports can use pictures, graphs, and other visual aids. These reports help relay specific material to your company, competitors, investors and others. Reports are used to form a record of your business’s activities. There are several types of Business reports such as Analytical reports, Feasibility reports, Progress reports, and Conference reports. Analytical reports are used to show how well a business is doing or where they need to improve. It is a useful tool to help find out why your company has decreased in sales. It is a financial report that includes sales data, financial results and effective strategies to help isolate the issue. A Feasibility report is a report used to see if a business adventure is worth the effort or not. This report data that is gathered and presented in several formats with a conclusion. Progress report are reports given to a client or a director of a company. This report shows how a project is coming along. It helps the company estimate the time it will take to finish an existing project. Conference report are reports detailing a business’s daily or weekly outcomes. As you can see reports obtain many important facts that a company needs to stay on top of their business. Forms are created to hold the reports. The form is created by putting in all the information such as name, address, phone number, and date along with others. Without the forms you will not be able to make a report.

There are many security issues and concerns that businesses have. One such issue is Excessive Privileges. This is where employees have privileges that surpass their job functions. These privileges give them the ability to make changes to personal or business accounts. This can cause a problem if the wrong person is in that position. Account information could be changed on purpose or by accident. The employee possesses the right to people’s personal information such as social security number, date of birth, credit card information, and bank records. There are dishonest people everywhere in this day and time.

Another security issue would be the human factor. People are prone to make mistakes. We become careless in a rushed society. Thirty percent of all data breaches come from human carelessness. This is due to lack of experience in one’s position. We all live in a fast-paced environment these days. Our bosses want things done now not tomorrow. This makes people rush throwing their processing skills off. Rushing causes us to forget steps and start making mistakes. The only way you could mitigate these issues are more training before leaving a person alone to do the job by their self. Another way is to put a security device on the computers that alerts a person that would be monitoring everyone’s job. They are many other ways but here are two that I would put in place.

In conclusion, Databases were made to make our lives a little easier. Databases help us maintain information needed daily success. Another reason why Database systems were made, so that companies could keep up with massive amounts of information in a singular place. Queries help find this data faster by punching in key words. This helps out productivity in a work place. Forms are made to store the reports, so we have the information required to present to others. There are serious threats out there. Without proper monitoring these issues can become major problems if companies do not catch them in time.

[put a screen shot of your Noodletools sources window.  This is in addition to your Noodletools reference page.  For instructions on how to use Noodletools and get a free account, please watch this video below:

https://lms.devry.edu/lms/video/player.html?video=1_zw0s3a9j ]

Relational Databases: Functional Dependency and Normalization

Abstract

Functional dependencies and Normalization play an important role in relational database design. Functional dependencies are key to establishing the relationship between key and non-key attributes in a relation. Normalization process works towards removing anomalies in a relation and prevents data redundancy. This paper, intended to be a graduate research paper, establishes the definitions of these concepts. The paper introduces Functional Dependency and explains its inference rules. The paper also introduces Normalization and various normal forms 1NF thru 5NF including the BCNF. The paper also explains how Functional Dependencies and Normalization are related, why they are important with regards to relational databases, and advantages of designing a normalized database.

Relational Databases: Functional Dependency and Normalization

Definitions and Concepts

Functional Dependency

A functional dependency is a constraint between two sets of attributes from the database. A functional dependency, represented by X → Y, between two sets of attributes X and Y that are subsets of a relation R specifies a constraint that, for any two tuples t1 and t2 in R that have t1[X] = t2[X],they must also have t1[Y] = t2[Y].

This means the values of the set of attributes Y of a tuple in R are determined by the set of attributes X. In other words, the values of the set of attributes X functionally determine the values of the set of attributes Y. We can also say that Y is functionally dependent on Y.

The set of attributes X on the left-hand side of the functional dependency is called determinant and the set of attributes Y on the right-hand side of the functional dependency is called dependent. Despite the mathematical definition a functional dependency cannot be determined automatically. It is a property of the semantics of attributes – the database designers will have to understand how the attributes are related to each other to specify a functional dependency. (Elmasri, Ramez and Shamkant B. Navathe. 2006)

Example

Consider the example of an SSN (Social Security Number) database. Every individual has a unique SSN. So, the other attributes of the relation, like name, address etc. can be determined using the SSN. That makes SSN a determinant, and name, address, the dependents – thus establishing the functional dependencies:

SSN  name

SSN  address

Inference Rules for Functional Dependency

Armstrong’s axioms are a set of inference rules used to infer all the functional dependencies on a relational database. They were developed by William W. Armstrong.

Axiom of reflexivity: if Y is a subset of X, then X determines Y

If Y is a subset of X then X  Y

Axiom of augmentation: if X determines Y, then XZ determines YZ for any Z

 If X  Y, then XZ  YZ

Axiom of transitivity: if X determines Y and Y determines Z, then X must determine Z

 If X  Y and Y  Z, then X  Z

Union: if X determines Y and X determines Z then X must also determine Y and Z

 If X  Y and X  Z, then X  YZ

Decomposition: if X determines Y and Z, then X determines Y and X determines Z separately

 If X  YZ, then X  Y and X  Z

Normalization

Database Normalization is a process that allows the storage of data without unnecessary redundancy and thereby eliminate data inconsistency. A normalized database eliminates anomalies in updating, inserting, and deleting data, which improves the efficiency and effectiveness of the database. Users can maintain and retrieve data from a normalized database without difficulty. Data Normalization can be used by the designer of a database to identify and group together related data elements (attributes) and establish relationships between the groups.

Database Normalization concept and its ‘Normal Forms’ were originally invented by Edgar Codd, the inventor of the relational model. The ‘Normal Forms’ provide the criteria for determining a table’s degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is.

First Normal Form (1NF)

An entity type or table is in 1NF when each of its attributes contain simple values which are atomic and contains no repeating groups of data. The domain of an attribute in an 1NF table must include only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute.

Example

Consider the address attribute in a sales database. It is not an atomic attribute, because it is made up of atomic attributes as street, city, state and zip. For the relation to be in 1NF, the appropriate database design should have the atomic attributes street, city, state and zip instead of an address attribute.

Un-Normalized: sales (date, order_no, product_no, product_description, price, quantity_sold, cust_name, cust_address)

1NF: sales (date, order_no, product_no, product_description, price, quantity_sold, cust_name, cust_street, cust_city, cust_state, cust_zip)

Second Normal Form (2NF)

An entity type or table is in 2NF when it is in 1NF and all its non-key attributes depend on the whole key (i.e., functional dependency). There cannot be partial dependencies.

Example

Continuing with the sales database, the order_no and the product_no form the composite key for the table. There are partial dependencies – date is dependent on order_no, but not product_no – which violates the requirement for 2NF. The product_description is dependent on product_no and not on order_no. Removing these partial dependencies will result in 2NF.

1NF: sales (date, order_no, product_no, product_description, price, quantity_sold, customer_name, customer_street, customer_city, customer_state, customer_zip)

2NF: order (date, order_no, cust_no);

product (product_no, product_description, price);

order_detail (order_no, product_no, quantity_sold);

customer (cust_no, cust_name, cust_street, cust_city, cust_state, cust_zip)

Third Normal Form (3NF)

An entity type or table is in 3NF when it is in 2NF and non-key attributes do not depend on other non-key attributes (i.e., there is no transitive dependency).

Example

Continuing with the sales database, the non-key attributes cust_city and cust_state are dependent on cust_zip which is a non-key attribute. Creating a separate zip table will transform the design into 3NF, where in there are no more dependencies between non-key attributes.

2NF: order (date, order_no, cust_no);

product (product_no, product_description, price);

order_detail (order_no, product_no, quantity_sold);

customer (cust_no, cust_name, cust_street, cust_city, cust_state, cust_zip)

3NF: order (date, order_no, cust_no);

product (product_no, product_description, price);

order_detail (order_no, product_no, quantity_sold);

customer (cust_no, cust_name, cust_street, zip_code);

zip (zip_code, city, state)

Boyce Codd Normal Form (BCNF)

An entity type or table is in BCNF when it is in 3NF and all candidate keys defined for the relation satisfy the test for third normal form.

Example

Continuing with the sales database, all the candidate keys already satisfy the 3NF requirements.

Fourth Normal Form (4NF)

An entity type or table is in 4NF when it is in BCNF and there are no non-trivial multi-valued dependencies. To move from BCNF to 4NF, remove any independently multi-valued components of the primary key to two new parent entities.

Example

For example, a professor can teach multiple subjects and can also mentor multiple students. To be in 4NF, the professor to subjects should be a separate relation and professor to students should be a separate relation – since they are independent of each other

Fifth Normal Form (5NF)

To be in 5NF, a relation decomposed into two relations must have lossless-join property, which ensures that no spurious tuples are generated when relations are reunited through a natural join.

Example

In the sales database example, when the sales database was split into order and product, the natural join of those two tables does not result in loss of data (tuples).

(Russell, Gordon. Chapter 4; Nguyen Kim Anh, Relational Design Theory)

Importance of Functional Dependency and Normalization to Relational Model

How are they related

Normalization theory draws heavily on the theory of functional dependencies. When a database designer sets out to design a database, it is essential to understand the semantics of the data – how the attributes are related to one another. This helps in establishing the functional dependencies between attributes. Once the functional dependencies are identified, the design the database in to a ‘normal form’ of the highest order possible is easier. Rules for each normal form, starting from the 1NF are invariably framed around maintaining the functional dependencies and are also based on the inference rules for functional dependencies (refer Inference Rules section). For example, to be in 2NF the non-key attributes should be dependent on the whole-key, which means the functional dependencies should be satisfied. Similarly, to be in 3NF, transitive dependency should be removed, which can be done if the functional dependencies are established correctly.

In other words, database normalization process ensures an efficient organization of data in database tables, which results in guaranteeing that data dependencies make sense, and also reducing the space occupied by the database via eliminating redundant data.

Why are they necessary for Relational Database model?

Functional dependencies play an important role in relational database design. They are used to establish keys that are used to define normal forms for relations. In addition, they help in deriving constraints based on the relationships between attributes. As a database grows in size and complexity it is essential that order and organization be maintained to control these complexities and minimize errors and redundancy in the associated data. This goal is managed by normalization. Database normalization minimizes data duplication to safeguard databases against logical and structural problems, such as data anomalies.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

Normalization can help keep the data free of errors and can also help ensure that the size of the database doesn’t grow large with duplicated data. Normalization permits us to design our relational database tables so that they “(1) contain all the data necessary for the purposes that the database is to serve, (2) have as little redundancy as possible, (3) accommodate multiple values for types of data that require them, (4) permit efficient updates of the data in the database, and (5) avoid the danger of losing data unknowingly (Wyllys, R. E., 2002).”

The resulting normalized database is highly efficient, which can be characterized by –

Increased Consistency: Information is stored in one place and one place only, reducing the possibility of inconsistent data.

Easier object-to-data mapping: Highly-normalized data schemas in general are closer conceptually to object-oriented schemas because the object-oriented goals of promoting high cohesion and loose coupling between classes results in similar solutions.

Moreover, a normalized database is advantageous when operations will be write-intensive or when ACID (Atomicity, Consistency, Isolation, Durability) compliance is required. Some advantages include:

Updates run quickly since no data being duplicated in multiple locations.

Inserts run quickly since there is only a single insertion point for a piece of data and no duplication is required.

Tables are typically smaller than the tables found in non-normalized databases. This usually allows the tables to fit into the buffer, thus offering faster performance.

Data integrity and consistency is an absolute must if the database must be ACID compliant. A normalized database helps immensely with such an undertaking.

Searching, sorting, and creating indexes can be faster, since tables are narrower, and more rows fit on a data page.

Minimizes/avoids data modification issues.

(https://en.wikipedia.org/wiki/ACID_(computer_science))

Summary

The paper defined the concept of functional dependency, which is the basic tool for analyzing relational schemas, and discussed some of its properties. Functional dependencies specify semantic constraints among the attributes of a relation schema. Next it described the normalization process for achieving good designs It presented examples to illustrate how by using the general definition of the normal forms, a given relation may be analyzed and decomposed to eventually yield a set of relations in 3NF. The paper also touches not often used BCNF, 4NF and 5NF normal forms.

Then the paper explains how functional dependencies and normalization are inter-related in the design of a relational model database. It explains the importance of functional dependency and normalization in the design of a relational database. A normalized database is highly efficient and has many advantages.

References

Wyllys, R. E., 2002. Database management principles and applications

Elmasri, Ramez and Shamkant B. Navathe. 2006. Fundamentals of Database Systems. 5th ed. Reading, MA: Addison-Wesley

Russell, Gordon. Chapter 4 – Normalization. Database eLearning

Nguyen Kim Anh, Relational Design Theory. OpenStax CNX

Gaikwad, A.S., Kadri, F.A., Khandagle, S.S., Tava, N.I. (2017) Review on Automation Tool for ERD Normalization. International Research Journal of Engineering and Technology (IRJET) [Online]. 4 (2), pp. 1323-1325. [Accessed 07 May 2017]. Available from: https://www.irjet.net/archives/V4/i2/IRJET-V4I2259.pdf

https://en.wikipedia.org/wiki/ACID_(computer_science)

Tables

Un-normalized Table: sales

sales

date

order_no

cust_no

product_description

price

quantity_sold

cust_name

cust_address

12/12/2018

1001

A320

MP3

10.00

8

Tom

1 Main St, Hartford, CT 06106

12/12/2015

1001

B101

Ipod

100.00

4

Tom

1 Main St, Hartford, CT 06106

01/05/2019

1002

C101

Blu Ray

80.00

3

Aaron

1 Holy Lane, Manchester, 06040

1NF Table: sales

sales

date

order_no

product_no

product_description

price

quantity_sold

cust_name

….

12/12/2018

1001

A320

MP3

10.00

8

Tom

12/12/2015

1001

B101

Ipod

100.00

4

Tom

01/05/2019

1002

C101

Blu Ray

80.00

3

Aaron

Continued..

sales

….

cust_street

cust_city

cust_state

cust_zip

1 Main St

Hartford

CT

06106

1 Main St

Hartford

CT

06106

1 Holy Lane

Manchester

CT

06040

2NF Table: order

order

date

order_no

cust_no

12/12/2018

1001

101

01/05/2019

1002

102

2NF Table: product

product

product_no

product_description

price

A320

MP3

10.00

B101

Ipod

100.00

C101

Blu Ray

80.00

2NF Table: order_detail

order_detail

order_no

product_no

quantity_sold

1001

A320

8

1001

B101

4

1002

C101

3

2NF Table: customer

customer

cust_no

cust_name

cust_street

cust_city

cust_state

cust_zip

101

Tom

1 Main Street

Hartford

CT

06106

102

Aaron

1 Holy Lane

Manchester

CT

06040

3NF Table: customer

customer

cust_no

cust_name

cust_street

zip_code

101

Tom

1 Main Street

06106

102

Aaron

1 Holy Lane

06040

3NF Table: zip

zip

zip_code

city

State

06106

Hartford

CT

06040

Manchester

CT