Assignment 3 Data Mining

Annual Report

FY [Year]

[Add a quote here from one of your company executives or use this space for a brief summary of the document content.]

 

Data mining and data warehouse

 IT446

 

Instructions:
·       You must submit two separate copies (one Word file and one PDF file) using the Assignment Template on Blackboard via the allocated folder. These files must not be in compressed format.

·       It is your responsibility to check and make sure that you have uploaded both the correct files.

·       Zero mark will be given if you try to bypass the SafeAssign (e.g. misspell words, remove spaces between words, hide characters, use different character sets or languages other than English or any kind of manipulation).

·       Email submission will not be accepted.

·       You are advised to make your work clear and well-presented. This includes filling your information on the cover page.

·       You must use this template, failing which will result in zero mark.

·       You MUST show all your work, and text must not be converted into an image, unless specified otherwise by the question.

·       Late submission will result in ZERO mark.

·       The work should be your own, copying from students or other resources will result in ZERO mark.

·       Use Times New Roman font for all your answers.

 

     
Name: ###

 

CRN: ###

  ID: ###

 

     

 

 

1 Marks

Learning Outcome(s): LO3

Carry out recent data mining techniques and applications.

 

 

 

 

 

 

Question One

Define Neural Network pruning, does pruning imposes a tradeoff between model efficiency and quality?

Answer:

 

Neural network pruning is a method of compression that involves removing weights or nodes (neurons) from a trained model. Model compression aims to reduce the size of models. This can be achieved by Weight-based pruning or nodes-based pruning.

Yes, the pruning imposes a tradeoff between model efficiency and quality.

Justification:

It leads to increasing the efficiency (maximizing speed) and at the same time it leads to decreasing the quality (accuracy).

In general, if we have a NN classifier that classifies a transaction in credit card as normal or fraud. Neural network (classifier) have many layers and in each layer there are many neurons. More neurons lead to more success of deep learning models. In other words, this allows these models to generate more accurate outputs (higher accuracy).

But, larger models take more storage space. Larger models also take more time to training. This in turn means high computational cost. In addition, in the both methods used for pruning, more hardware are needed to cover the layers and nodes. This means expensive cost. As a result, the performance will be poor.

In contrast, pruning models leads to avoid the previous problems, but it leads to a low level of accuracy (or prediction). That is because the time used form training is short.

References

https://towardsdatascience.com/pruning-neural-networks-1bb3ab5791f9

https://arxiv.org/pdf/2003.03033.pdf

 

1.5 Marks

CaMarks

Learning Outcome(s): LO4

Apply a wide range of clustering, estimation, prediction, and classification algorithms

 

 

 

 

 

Question Two

Match the most appropriate features of approach to clustering techniques (Partitioning, Hierarchical, Density based and Grid based).

Features to Clustering techniques Clustering techniques
Clusters are dense regions of objects in space that are separated by low density regions.

May filter out outliers

Density-based

methods

Clustering decomposition in multiple levels. May incorporate other techniques such as micro clustering or considering object linkages. Hierarchical

methods

Effective for small to medium size data sets. May use mean or medoid to represent the center of clusters. Partitioning

methods

Use a multiresolution granularity data structure. Fast processing time. Grid-based

Methods

 

References

https://images-na.ssl-images-amazon.com/images/G/01/books/stech-ems/DataMining-ch-9780123814791._V155175544_.pdf

2 Marks
Learning Outcome(s): LO3

Carry out recent data mining techniques and applications.

 

 

 

 

 

 

Question Three

Given the transaction list and items’ price list in Table 1 and Table 2 respectively:

Table 1: Item list of names and prices.

Item ID Item name Price (SAR)
A Milk 6
B Bread 1
C Eggs 11
D Soda 3
E 1kg orange 5
F 1kg tomato 2
G 1kg banana 7

Table 2: Transaction list.

Transaction ID Items in transaction
1 A, B, F
2 A, C, F
3 A, D, E, F, G
4 B, C, D
5 C, D
6 C, F, G
7 C, D, F, G

 

  • use frequent patter growth algorithm to get all frequent patterns satisfying the following constraints:
  • minimum support = 2
  • Average(price) >= 7

Answer:

 

 

  • Show how can you find all frequent patterns by eliminate some of the candidates in the process by leveraging the fact that the average (price) >= 7 constraint is strongly convertible. Get all the frequent patterns satisfying the same constraints above after converting and pushing the constraint.

 

 

 

1.5 Marks
Learning Outcome(s): LO4

Apply a wide range of clustering, estimation, prediction, and classification algorithms.

 

 

 

 

Question Four

A classification model may change dynamically along with the changes of training data streams. This is known as concept drift. Explain why decision tree induction may not be a suitable method for such dynamically changing data sets. Is naive Bayesian a better method on such data sets? Explain your reasoning.

Answer:

Concept drift affects the structure of the decision tree induction and it cannot have adopted with. This is related to the structure of the decision tree induction

Structure of decision tree induction.

A decision tree induction is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label.

Example:

The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.

Depending on the previous DTI, some rules (of the form if then style) are conducted, such as:

If (age is young, and the customer is student), then the decision (class is yes) is to bay the computer.

The rules generated by this DTI correspond to a certain case on which a classifier is trained.

When a new data is coming, a new data set is constructed for training. This leads to a different DTI, and consequently complete different generated rules. Therefore, the DTI cannot adopt with data steam in real time (behavior of data).

 

Yes, naive Bayesian performs better on such data sets (dynamic data sets, changing data sets. That have a concept drift). This is related to the structure of the naive Bayesian.

Structure of the naive Bayesian

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple.

Upon this, if a new data is coming over the old data (represented by the concept drift), the naive Bayesian classifier can deal with it. Independencies among features and the probability of each feature has the ability to adopt with changes in real time.

References

http://alexisbondu.free.fr/blog/wp-content/uploads/surveyStream2015.pdf

 

 

 

 

 

Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)