This example is important because it is an R Notebook on Titanic data set. Titanic is considered as one of the Data Science 101 data set and this notebook provides a good exploratory data analysis. It is also a Kaggle Kernel, which is a very good source for this kind of example R codes. The notebook also covers feature engineering, missing data imputation and modeling.
This is an early Exploratory Data Analysis for the “Personalized Medicine”: Redefining Cancer Treatment challenge. ggplot2 and the tidyverse tools to study and visualise the structures in the data. The data comes in 4 different files. Two csv files and two text files:
training/test variants: These are csv catalogues of the gene mutations together with the target value Class, which is the (manually) classified assessment of the mutation. The feature variables are Gene, the specific gene where the mutation took place, and Variation, the nature of the mutation. The test data of course doesn’t have the Class values. This is what we have to predict. These two files each are linked through an ID variable to another file each, namely:
training/test text: Those contain an extensive description of the evidence that was used (by experts) to manually label the mutation classes.
The text information holds the key to the classification problem and will have to be understood/modelled well to achieve a useful accuracy.
Instacart is an internet – based grocery delivery service with a slogan of Groceries Delivered in an Hour. The purpose of this exercise is to analyze the trend in customer buying pattern on Instacart, suggest combination of products which can be sold together under various offers.
Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists
Extensive Exploratory Data Analysis for the Google Text Normalization Challenge - English Language competition with tidy R, ggplot2, and tidytext.
The aim of this challenge is to “automate the process of developing text normalization grammars via machine learning” (see the challenge description). Text normalisation describes the process of transforming language into a specific, self-consistent grammar system with well-defined rules. This analysis aim to convert “written expressions into into appropriate ‘spoken’ forms”.”
The data comes in the shape of two files: ../input/en_train.csv and ../input/en_test.csv. Each row contains a single language element (such as words, letters, or punctuation) together with its associated identifiers. The evaluation metric is the total percentage of correctly translated tokens.