Developing User Profiles With Weka Tool: Login, Access And Resource Usage Patterns

Project Aim

For a department, the provided data contains login and access data for all the 20 users.This project’s aim is to develop a profile for each user in a department that is profiling the user for authorization and authentications. Based on the data, the users need to develop a user profile in the department. Generally, the profile indicates the login, login off, access patterns and session time patterns. Each profile contains start time, duration, resource accessed and performedoperations types. Here, the resources indicate the printer, network, file and computer. The access pattern indicates the executed user programs; file accessed for update and read, and printer usage, file size and library programs. The user profiles are determined by using any association rules with sufficient support. The development of user profile needs to avoid the outliers, it over fits and also uses the data mining security techniques to provide effective user profiles, for a department. All these will be analyzed and discussed in detail. 

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

The aim of this project includes to develop a profile for each user in a department that is profiling the user for authorization and authentications. The development of user profile is done by using Weka tool. In Weka data mining tool, the association rules are used to develop the user profile in a department. Also, the analysis of the data mining security techniques to provide the effective user profile for a department will be covered. 

  1. Analyze the user profile to develop a profile for each user.
  2. Using association rules to develop a user profile, by using the provided data.
  3. Analysis on identifying the development of user profiles, in a department. 

The provided data set is divided into three data sets like login pattern, e-mail pattern and resource usage pattern and the data set contains the following attributes such as, user ID, user program ID, library program ID, library utility ID, File ID, printer ID, e-mail program ID, Time, date and host machines. 

Weka is a stage to apply the machine learning method, to deal with separatingthe information. Weka refers to Waikato Environment for Knowledge Analysis. It is an open source programming and is affirmed with GNU and overall population License. To do information investigation and data mining process, Weka instrument is favored. The primary reason is that, it has five creators which are as follows(Bifet, 2010):

  • Open Source Software:The device is discharged as open source programming with the help of GNU GPL and also authorized with Pentaho Corporation. This enterprise claimed Weka with business knowledge stage.
  • Graphical User Interface: To give a straightforward entry to finish the machine learning ventures, the instrument has GUI interface.
  • Command Line Interface: To do scripting employments, the product has planned with charge line highlight.

The data mining process comprises of extraction of data from direct database assets. All through this procedure, the concealed data is additionally recovered. The information digging instruments are utilized for foreseeing the future patterns, information conduct social developments, enabling business to settle on decisions(Bouckaert, 2004).

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

The Decision tables resemble the neural sets and decision trees. This kind of characterization calculation is utilized to foresee the information. This model is incited in Weka along with the machine learning algorithms. The hierarchical table is utilized inside the decision table and the information is entered. Each section of information will be put away as the key esteem matches. In the abnormal state tree, the extra information traits were put away in another table. The structure of decision table looks like dimensional stacking. To acknowledge the model and permit the characteristics, the representation strategy is connected to oversee new properties. To do representation plans, various collaboration types were utilized. Thus, it is considered as the most valuable representation procedure instead of other static plans.

Hypothesis

In light of the given condition, the given activities will be completed outwardly or graphically. It is called as Decision tables. These calculations admission the programming dialects, for example, switch case articulations and if-then-else conditions. Each decision is individually displayed to the variable. These qualities are connected and anticipate the conceivable qualities through given imperatives. The activity is to be performed by the individual activities. As per the given limitations, the activities to be performed and each passage of key information esteem combining will be finished. When taking decisions, the condition isn’t relevant, at that point they don’t give it a second thought and image is utilized. So in the decision table, the esteem can be clear or hyphen. It means that, the decision isn’t taken or fragmented based on basic leadership process. A portion of the decision tables utilizes thetrue or false incentive to speak to the decision condition. It is considered as adjusted or inadequate(Kaluza, 2013).

The practical work on Decision table rule is to analyze the User profiles and develop the user profile.

Decision Table:

Number of training instances: 213

Number of Rules: 28

Non matches covered by Majority class.

Best first.

Start set: no attributes

Search direction: forward

Stale search after 5 node expansions

Total number of subsets evaluated: 45

Merit of best subset found:   73.709

Evaluation (for feature selection): CV (leave one out)

Feature set: 3,6,8,10,2

Time taken to build model: 0.11 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.05 seconds

Correctly Classified Instances         178               83.5681 %

Incorrectly Classified Instances        35               16.4319 %

Kappa statistic                          0.8265

Mean absolute error                      0.0732

Root mean squared error                  0.1675

Relative absolute error                 73.3946 %

Root relative squared error             75.0281 %

Total Number of Instances              213     

  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <– classified as

 12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = U02

  0 11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = U03

  0  0 11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = U05

  0  0  0 11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  d = U07

Data Preparation

  0  0  0  0 11  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = U09

  0  0  0  0  0  4  0  0  0  0  0  3  4  0  0  0  0  0  0 |  f = U11

  0  0  0  0  0  0 12  0  0  0  0  0  0  0  0  0  0  0  0 |  g = U04

  0  0  0  0  0  0  0 12  0  0  0  0  0  0  0  0  0  0  0 |  h = U06

  0  0  0  0  0  0  0  0 12  0  0  0  0  0  0  0  0  0  0 |  i = U08

  2  0  0  0  0  0  0  0  0 10  0  0  0  0  0  0  0  0  0 |  j = U10

  0  0  0  0  0  0  0  0  0  0 10  0  0  0  0  0  0  0  0 |  k = U01

  0  0  0  0  0  1  0  0  0  0  0  5  5  0  0  0  0  0  0 |  l = U12

  0  0  0  0  0  0  0  0  0  0  0  3  8  0  0  0  0  0  0 |  m = U13

  0  0  0  0  0  0  0  0  0  0  0  3  2  6  0  0  0  0  0 |  n = U14

  1  0  0  0  0  0  0  0  0  0  0  0  0  0 10  0  0  0  0 |  o = U15

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 11  0  0  0 |  p = U16

  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10 |  q = U17

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 11  0 |  r = U18

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 11 |  s = U19

Decision Table:

Number of training instances: 388

Number of Rules: 27

Non matches covered by Majority class.

Best first.

Start set: no attributes

Search direction: forward

Stale search after 5 node expansions

Total number of subsets evaluated: 43

Merit of best subset found:   67.784

Evaluation (for feature selection): CV (leave one out)

Feature set: 3, 2

Approach

Time taken to build model: 0.76 seconds 

Time taken to test model on training data: 0.05 seconds 

Correctly Classified Instances         281               72.4227 %

Incorrectly Classified Instances       107               27.5773 %

Kappa statistic                          0.7079

Mean absolute error                      0.0654

Root mean squared error                  0.1565

Relative absolute error                 65.7274 %

Root relative squared error             70.1906 %

Total Number of Instances              388   

  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <– classified as

 22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = U03

  0 22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = U05

  0  0 22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = U07

  0  0  0 22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  d = U09

  1  0  0  0  7  0  0  0  6  0  5  3  0  0  0  0  0  0  0 |  e = U11

  0  0  0  0  0 22  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = U04

  0  0  0  0  0  0 22  0  0  0  0  0  0  0  0  0  0  0  0 |  g = U06

  0  0  0  0  0  0  0 22  0  0  0  0  0  0  0  0  0  0  0 |  h = U08

  1  0  0  0  6  0  0  6  9  0  0  0  0  0  0  0  0  0  0 |  i = U10

  0  0  0  0  0  0  0  0  0 22  0  0  0  0  0  0  0  0  0 |  j = U02

  1  0  0  0  5  0  0  0  6  0  5  5  0  0  0  0  0  0  0 |  k = U12

  1  0  0  0  5  0  0  0  6  0  4  6  0  0  0  0  0  0  0 |  l = U13

  0  0  0  0  5  0  0  0  6  0  4  2  5  0  0  0  0  0  0 |  m = U14

  0  0  0  0  0  0  0  0  0  0  0  0  0 22  0  0  0  0  0 |  n = U16

Weka

  0  0  0  0  0  0  0  0  0  0  0  0  0  0 22  0  0  0  0 |  o = U17

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 21  0  0  0 |  p = U01

  0  0  0  0  5  0  0  0  6  0  5  5  0  0  0  0  0  0  0 |  q = U15

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  8  0 |  r = U18

  • 0  0  0  0  0  0  0  0  0  0  0  0  0  8  0  0  0  0 |  s = U19 

Decision Table:

Number of training instances: 383

Number of Rules: 28

Non matches covered by Majority class(Stahlbock, Crone &Lessmann, 2010).

Best first.

Start set: no attributes

Search direction: forward

Stale search after 5 node expansions

Total number of subsets evaluated: 49

Merit of best subset found:   72.585

Evaluation (for feature selection): CV (leave one out)

Feature set: 3, 2

Time taken to build model: 0.31 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.02 seconds 

Correctly Classified Instances         297               77.5457 %

Incorrectly Classified Instances        86               22.4543 %

Kappa statistic                          0.7621

Mean absolute error                      0.0642

Root mean squared error                  0.1541

Relative absolute error                 64.6374 %

Root relative squared error             69.181 %

Total Number of Instances              383     

=== Confusion Matrix ===

  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s   <– classified as

 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  a = U05

  0 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  b = U07

  0  0 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  c = U09

  0  0  0 13  0  0  0  0  0  0  0  6  0  0  0  0  0  0  0 |  d = U11

  0  0  0  0 28  0  0  0  0  0  0  0  0  0  0  0  0  0  0 |  e = U02

  0  0  0  0  0 28  0  0  0  0  0  0  0  0  0  0  0  0  0 |  f = U04

  0  0  0  0  0  0 28  0  0  0  0  0  0  0  0  0  0  0  0 |  g = U06

Decision Table

  0  0  0  0  0  0  0 28  0  0  0  0  0  0  0  0  0  0  0 |  h = U08

  3  0  0  2  0  0  0 14  8  0  0  0  1  0  0  0  0  0  0 |  i = U10

  0  0  0  0  0  0  0  0  0 18  0  0  0  0  0  0  0  0  0 |  j = U01

  0  0  0  0  0  0  0  0  0  0 18  0  0  0  0  0  0  0  0 |  k = U03

  0  0  0  6  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0 |  l = U12

  0  0  0  6  0  0  0  0  0  0  0 10 12  0  0  0  0  0  0 |  m = U13

  0  0  0  6  0  0  0  0  0  0  0  6  0  7  0  0  0  0  0 |  n = U14

  0  0  0  6  0  0  0  0  0  0  0 10  0  0  3  0  0  0  0 |  o = U15

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0 |  p = U16

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0 |  q = U17

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10  0 |  r = U18

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10  0  0 |  s = U19 

This project suggests an approach for developing a user profile by using association rules. The association rules are used to offer new rules from a data set and it finds connection between the data items in a data set. It is used to discover the frequency association in a provided data set and it produces the frequent item sets. The association rules are very useful for decision making and effective marketing. It significantly increases various application’s area such as it obtains the user profiles for web system personalization, knowledge extraction from software engineering and for finding the patterns in biological data bases. It easily evaluates the importance of the attributes on the classifications. It produces the most useful rules for predictions to discriminate the user profiles and it contains different values of the class attributes.

Analysis

The data set totallycontains 213 instances.

Correctly Classified Instance

178

83.5681%   

Incorrectly Classified Instances

35

16.4319

If we observe the above instances, out of 213 instances 178 instances are correctly classified, which is a good sign of result and 35 instances are incorrectly classified.

Results on Login Pattern

The data set totally contains 388 instances.

Correctly Classified Instances

281

72.4227 %

Incorrectly Classified Instances

107

27.5773 %

If we observe the above instances, out of 388 instances 281 instances are correctly classified, which is a good sign of result and 107 instances are incorrectly classified.

Results on Resource Pattern

The data set totallycontains 383 instances.

Correctly Classified Instances

297

77.5457%

Incorrectly Classified Instances

86

22.4543%

If we observe the above instances, out of 383 instances 297 instances are correctly classified, which is a good sign of result and 86 instances are incorrectly classified.

According to the results, the email pattern, resource pattern and login pattern are developed and it provides effective user profiles in a department and it avoids the outliers and over fits in a department (Witten, Frank, Hall & Pal, 2017). The user profiles contain login pattern, machine usage pattern, email pattern, File access pattern, print usage pattern and program access pattern. The decision table is successfully determined and evaluates the total number of subsets in the user profiles. This result is able to refine the user profiles by adding a list of words to each pattern preferred by a specific user. Our goal was to integrate the prototype in an already existing personalization system. It could limit the new user effort asked for to another user by getting her to the fun part, while as yet learning data valuable to make great suggestions. We plan to apply data extraction procedures to find data about another user from the discourse she had with the specialist. In addition, we are assessing the likelihood of utilizing ontologies in catching information of user inclination, so as to get profiles that allude unequivocally to ideas of a standard metaphysics, and not only a rundown of words. 

Conclusion

This project successfully developed the profile for each user in a department. The provided data contains, login and access data for all the 20 users in a department. Based on the data, the users’ user profile in a department issuccessfully developed,by using the data mining security techniques.  

References

Bifet, A. (2010). Adaptive stream mining. Amsterdam: IOS Press.

Bouckaert, R. (2004). Bayesian network classifiers in Weka. Hamilton, N.Z.: Dept. of Computer Science, University of Waikato.

Kaluza, B. (2013). Instant Weka How-to. Birmingham: Packt Publishing.

Stahlbock, R., Crone, S., &Lessmann, S. (2010). Data mining. New York: Springer.

Witten, I., Frank, E., Hall, M., & Pal, C. (2017). Data mining. Amsterdam: Morgan Kaufmann.