Get Cheapest Assignment in Australia, UK, US, UAE, Canada and NZ Order Now

Advanced Data Analytics


Assignment: Data Mining in Action

Start Assignment

  • Due Dec 12 by 23:59
  • Points 100
  • Submitting a file upload
  • Available Oct 4 at 0:00 – Dec 19 at 23:59 3 months


This assignment is a practical data analytics project. You will be acting as a data scientist at a consultant company and you need to make a prediction on a dataset. The dataset can be found below.

You need to build classifiers using the techniques covered in the lectures to predict the class attribute. At the very minimum, you need to produce a classifier for each method we have covered. However, if you explore the problem very thoroughly (as you should do in the industry), preprocessing the data, looking at different methods, choosing their best parameters settings and identifying the best classifier in a principled and explainable way, then you should be able to get a better mark. If you choose to use KNIME and you show ‘expert’ use (i.e. exploring multiple classifiers, with different settings, choosing the best in a principled way and being able to explain why you built the model the way you did), this will attract a better mark. If you choose to use R or Python to build, optimise and test different models, this will also attract better marks.

You need to write a report describing how you solved the problem and the results you found. See below for requirements.

Kaggle Competition

For this assignment, you will use the Kaggle website ( to submit your assignment solution. The report itself will be submitted through Canvas. Go to this link to sign up to the competition on Kaggle: (Links to an external site.). You need to use the link to access the project because it is a private project for this course only. Sharing the competition with anyone not relevant to the subject is strictly prohibited. To submit to Kaggle you will need to make a Kaggle login using your email address, and set your display name (in My Profile -> Edit Profile -> Display Name) as UTS_xxxx where xxxx is your full name. Submissions will not be considered if they don’t meet these criteria.

For direct access to submission page on kaggle use : (Links to an external site.)


Below you will find 3 datasets: a training set for training your model (it contains the target values), a test set for testing the model (it does not have the target values – you need to predict them) and a submission sample which shows you what the file submitted to Kaggle should look like. In particular, you will need to set the column names in your submission file correctly – that is, “Row ID” and “Qualified”.


Assessment is real-time. This means that as soon as you submit the file, Kaggle will assess the performance of your classifier and provide you with the result. You can submit multiple times, but Kaggle has a limit for the number of times you can do this per day.

Do not use the measure of performance reported by Kaggle as a measure of your test error in the final competition and optimise to it. This is because Kaggle has two measures: a public measure, which it reports to you, and a private measure, which it keeps hidden. Instead, develop several models and estimate the test error yourself before submitting to Kaggle. Remember that your estimate of test error is just that: an estimate. The actual private measure will probably be a little bit different.


Here are the training dataset, test dataset and submission sample:

Assignment-HousingDataset.csv  Download Assignment-HousingDataset.csv (the training set)

Assignment-UnknownDataset.csv  Download Assignment-UnknownDataset.csv(the test set, Note: it doesn’t have the Qualified attribute)

Assignment-Kaggle-Submission-Random-Sample.csv  Download Assignment-Kaggle-Submission-Random-Sample.csv (a sample submission file in the correct format making random predictions)

Dataset_Attribute-Description-1.pdf  Download Dataset_Attribute-Description-1.pdf(a brief description of attributes of the dataset)

Classification task

Build a classifier that classifies the “QUALIFIED” attribute – with 0 if it is not qualified (U), and 1 for qualified (Q). You can do different data pre-processing and transformations (e.g. grouping values of attributes, converting them to binary, etc.), providing explanations for why you have chosen to do that. You may need to split the training set into training, validation and test sets to accurately set the parameters and evaluate the quality of the classifier.

You can use KNIME to build classifiers. Feel free to use any other tool such as R, Weka, Python, Orange, scikit-learn or other software. If you do this, though, please explain more about your classifier – and be sure that you are producing valid results! You don’t need to limit yourself to the classifiers we used in class, but if you do use other classifiers you need to describe them in your report and make sure you are producing valid results.

A hint: usually it’s not a case of having a ‘better’ classifier that will produce good results. Rather, it’s a case of identifying or generating good features that can be used to solve the problem.

Assignment report and submission


Your report should include the following information:

  • A description of the data mining problem;
  • The data preprocessing and transformations you did (if any);
  • How you went about solving the problem;
  • Classification techniques used and summary of the results and parameter settings;
  • The best classifier that you selected – the type, its performance, how it solved the problem (if it makes sense for that type of classifier), and reasons for selecting it;
  • Reflection: One page reflecting on your learning in the assignment. What did you learn about data mining and yourself as a result of doing the assignment? How would you approach the problem differently if you were to do it again? The more incisive and thoughtful your reflection is, the better your mark.

The report should be a PDF (preferable) or MS Word doc, with the filename xxxx.pdf or xxxx.docx, where xxxx is your full name.

The report should be around 15-20 pages, in 11 or 12 point Times or Arial font.

Submit your report using the link at the bottom of this page, after the Discussion.


The predictions on the test set should be submitted as a .csv file to the Kaggle competition here: (Links to an external site.)

On average each student will require between 24 and 36 hours to complete this assignment.


This assignment is assessed as individual work. Review the assessment criteria and marking scheme below. 

Assignment_Marking-Criteria-1.pdf Download Assignment_Marking-Criteria-1.pdf

Marks and feedback

You will be notified when your assignment has been marked and you will be able to view your mark and feedback in the Marks section in the left-hand navigation. 

Leave a Reply

Your email address will not be published. Required fields are marked *