Get CE888 Imbalanced Datasets Assignment Help
CE888 Imbalanced Datasets Assignment are those in which the classes are not represented equally, which results in classifiers disregarding the minority class. Typical ways of dealing with this issue are resampling the dataset (either upsampling the minority class/es or downsampling the majority class/es. In this project, you will test a new approach to dealing with imbalanced datasets, based on a mixture of supervised and unsupervised learning.
Tasks for CE888 Imbalanced Datasets Assignment
- Choose 3 datasets from UCI, Kaggle, etc. You will make them imbalanced, so make sure they are not imbalanced from the beginning. Small imbalances (e.g., 55% of one class) are fine. Load, inspect, and clean the datasets. For each of them, create three versions/surrogates (in addition to the original one) by subsampling one of the classes:
- Low imbalance (65%)
- Medium imbalance (75%)
- High imbalance (90%)
- To establish a baseline, perform stratified cross-validation on each of the datasets and their surrogates and train a random forest. Report baseline results using appropriate metrics.
- Create 10 stratified folds (to ensure the imbalance ratio remains the same in each fold) for each of the datasets.
- Using the data of 9 of these folds:
- Using the Elbow method and the Silhouette method, identify the number of clusters in the dataset. There should be some level of agreement between these indices (or at least you should be able to identify lower and upper bounds).
- Run k-means in the data set using the identified number of clusters. Select as final clustering that with the lowest output criteria.
- For each cluster, identify its centroid and the number of samples of the minority class in that cluster (as per their labels). Save this information.
- Train a random forest for each of the clusters that contains samples from more than one class (i.e., if a cluster only has samples for one of the classes, you don’t need to train a classifier).
- Given a sample xi from the unseen fold (the one left out in (3))
- Assign xi to its closest cluster.
- If this cluster has only instances of one class, assign to xi that label. Otherwise, use the model trained with data from that cluster to assign a label to xi.
- Do the above for each permutation of 10 bins (like in cross-validation), and present the average and standard deviation of results for each of the datasets and their surrogates using appropriate metric/s.
- Compare your results with the baseline results from (2). A boxplot of the cross-validation results for each method should help you decide which method is best under which conditions. Are the results significantly better with the new method (e.g., as determined by a permutation test)? How does the data imbalance affect the results?
Do you Need Assignment Help with CE888 Imbalanced Datasets?
Get CE888 Imbalanced Datasets Assignment Help WITH PUNJAB ASSIGNMENT HELP
Students can avail of assignment help services for Deliverables of getting CE888 Imbalanced Datasets Assignment Help with the Punjab Assignment help team as they offer 100% plagiarism free content, on-time delivery and an affordable price range. Our experts have deep knowledge of the Professional Communications subject. Contact us today for assignment help and get a winning edge over fellow students. Contact us at Punjabassignmenthelp@gmail.com or call +918053884564.