• Do Data Science in 10 Days
    • What is This Book?
    • Structure of the Book
    • What Can This Book Offer You?
    • Notes
    • Acknowledgements
  • 1 Introduction
    • 1.1 What is Data Science?
      • Data Science as Discovery of Data Insight
      • Data Science as Development of Data Product
    • 1.2 What is Data Scientist?
      • The Requisite Skill Set
      • How to Become a Data Scientist?
    • 1.3 Process of Doing Data Science
      • Step 1: Understand the Problem - Define Objectives
      • Step 2: Understand Data - Knowing your Raw Materials
      • Step 3: Data Preprocessing - Get your Data Ready
      • Step 4: Data Analyese - Building Models
      • Step 5: Results Interpretation and Evaluation
      • Step 6: Data Report and Communication
    • 1.4 Tools used in Doing a Data Science Project
      • R
      • Python
      • SQL
      • Hadoop
      • Tableau
      • Weka
    • 1.5 Applications of Data Science
      • Data Science in Healthcare
      • Data Science in E-commerce
      • Data Science in Manufacturing
      • Data Science as Conversational Agents
      • Data Science in Transport
    • 1.6 Data Science Related Terms
      • DataAnalyst and Data Scientist
      • Machine Learning and Data Science
      • Data Mining and Data Science
    • Summary
    • Exercises
  • 2 Get Your Tools Ready
    • 2.1 Brief introductiuon about R and RStudio
      • Features of R Programming
      • R Scripts
      • R Graphical User Interface (RGui)
      • RStudio
    • 2.2 Downlaod and Install R and RStudio
      • R Download and Installation
      • RStudio Download and Installation
      • Familiar with RStudio interface
    • 2.3 Bootsup your RStudio
    • 2.4 Instructions
      • Code
      • Tips
      • Actions
      • Exercise
    • Exercises
  • 3 Understand Problem
    • 3.1 Kaggle Competion
    • 3.2 Titianic at Kaggel
    • 3.3 The Titanic Problem
      • The Challenge
      • The Data
      • The Submission
    • Summary
    • Exercises
  • 4 Understand Data
    • 4.1 Load Data
    • 4.2 Assess Data Quantity
    • 4.3 General Data Attributes Assessment
    • 4.4 Actual Attributes Types Examination
    • 4.5 Actual Data Attributes Value Examination
      • PassengerID
      • Survived
      • Pclass
      • Name
      • Sex
      • Age
      • SibSp
      • Parch
      • Ticket
      • Fare
      • Cabin
      • Embarkded
    • 4.6 Data Recods Level Assessment
    • Summary
    • Exercises
  • 5 Data Preparasion
    • 5.1 General Data Prepartion Tasks
    • 5.2 Dealt with Miss Values
      • Cabin Attribute
      • Age Attribute
      • Fare Attribute
      • Embarked Attribute
    • 5.3 Attribute Re-engineering
      • Title from Name attribute
      • Deck from Cabin attribute
      • Extract ticket class from ticket number
      • Travel in Groups
      • Age in Groups
      • Fare per passenger
    • 5.4 Build Re-engineered Dataset
    • Summary
    • Exercises
  • 6 Data Analysis
    • 6.1 Predictive Data Analysis (PDA)
    • 6.2 Process of Predcitive Data Analysis
      • Predictor Selection
      • Model Construction
      • Model Validation
    • 6.3 Classification as A Specific Prediction
    • Summary
  • 7 Predictor Selection
    • 7.1 Predictor Selection Pricinples
    • 7.2 Attributes Analysis
    • 7.3 Attributes Correlation Analysis
    • 7.4 PCA Analysis
    • Summary
  • 8 Prediction with Decision Trees
    • 8.1 Decision Tree in Hunt’s Algorithm
      • How to Form a Test Condition?
      • How to Determine the Best Split Condition?
    • 8.2 The Simplest Decision Tree for Titanic
    • 8.3 The Decision Tree with Core Predictors
    • 8.4 The Decision Tree with More Predictors
    • 8.5 The Decision Tree with Full Predictors
    • Summary
    • Exercises
  • 9 Titiannic Prediction with Random Forest
    • 9.1 Steps to Build a Random Forest
    • 9.2 Random Forest with Key Predictors
    • 9.3 Random Forest with More Variables
    • 9.4 Random Forest with All Variables
    • 9.5 Comparision the Three Random Forest Models
    • Summary
    • Exercises
  • 10 Model Cross Validation
    • 10.1 Model’s Underfitting and Overfitting
    • 10.2 General Cross Validation Methods
      • Single Model Cross Validation
      • General Procedure of CV
      • Cross Validation on Decision Tree Models
      • Cross Validation on Random Forest Models
    • 10.3 Multiple Models Comparison
      • Regression Model for Titanic
      • Support Vector Machine Model for Titanic
      • Neural Network Models
      • Comparision among Different Models
    • Summary
    • Exercises
  • 11 Fine Tune Models
    • 11.1 Tuning a model’s Predictor
    • 11.2 Tuning Training Data Samples
      • Set Prediction Accuracy Benchmark
      • 10 Folds CV Repeat 10 Times
      • 5 Folds CV Repeat 10 Times
      • 3 Folds CV Repeat 10 Times
    • 11.3 Tuning Model’s Parameters
      • Random Search
      • Grid Search
      • Manual Search
    • Summary
    • Exercises
  • 12 Report
    • 12.1 Content of Report
    • 12.2 Result Explainition
    • 12.3 Model Interpretation
      • Model’s Performance Measure
      • Visualise Model’s Prediction
      • Importance of the Model’s Predictors
    • 12.4 Further Analysis
    • Summary
    • Exercises
  • Apendix: The R code of the entire project
    • TitanicDataAnalysis1_Understand_Data.R
    • TitanicDataAnalysis2_Data_Preprocess.R
    • TitanicDataAnalysis3_Model_Construction.R
    • TitanicDataAnalysis4_Model_Cross_Validation.R
    • TitanicDataAnalysis5_Model_Fine_Tune.R
    • TitanicDataAnalysis6_Analyse_Report.R
  • Published with bookdown

Do A Data Science Project in 10 Days

Apendix: The R code of the entire project

The entire code are stored in 6 files:

  1. TitanicDataAnalysis1_UnderstandData.R
  2. TitanicDataAnalysis2_Data_Preprocess.R
  3. TitanicDataAnalysis3_Model_Construction.R
  4. TitanicDataAnalysis4_Model_Cross_Validation.R
  5. TitanicDataAnalysis5_Model_Fine_Tune.R
  6. TitanicDataAnalysis6_Analyse_Report.R

They are also available on-line.