ML Kaggle Titanic Problem using LogisticRegrission

Last update: Oct 23, 2022

Overview

-ML-Kaggle-Titanic-Problem-using-LogisticRegrission

here you will find the solution for the titanic problem on kaggle with comments and step by step coding

Problem Overview

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Table of Contents

Analuze and visilaze the Dataset
Clean and prepare the dataset for our ML model
Build & Train Our Model
Caluclate the Accuracy for the model
Prepare the submission file to submit it to kaggle

Load & Analyze Our Dataset

First we read the data from the csv files

data_train = pd.read_csv('titanic/train.csv')
data_test = pd.read_csv('titanic/test.csv')

visilyze the given data

   print(data_train.head())

PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

## Note ```sh The Survived column is what we’re trying to predict. We call this column the (target) and remaining columns are called (features) ```
### count the number of the Survived and the deaths ```py data_train['Survived'].value_counts() # (342 Survived) | (549 not survived) ```

plot the amount of the survived and the deaths

plt.figure(figsize=(5, 5))
plt.bar(list(data_train['Survived'].value_counts().keys()), (list(data_train['Survived'].value_counts())),
     color=['r', 'g'])

analyze the age

plt.figure(figsize=(5, 7))
plt.hist(data_train['Age'], color='Purple')
plt.title('Age Distribuation')
plt.xlabel('Age')
plt.show()

Note: Now after we made some analyze here and their, it's time to clean up our data If you take a look to the avalible columns we you may noticed that some columns are useless so they may affect on our model performance.

Here we make our cleaning function

   def clean(data):
    # here we drop the unwanted data
    data = data.drop(['Ticket', 'Cabin', 'Name'], axis=1)
    cols = ['SibSp', 'Parch', 'Fare', 'Age']

    # Fill the Null Values with the mean value
    for col in cols:
        data[col].fillna(data[col].mean(), inplace=True)

    # fill the Embarked null values with an unknown data
    data.Embarked.fillna('U', inplace=True)
    return data

# now we call our function and start cleaning!

data_train = clean(data_train)
data_test = clean(data_test)

## Note: now we need to change the sex feature into a numeric value like [1] for male and [0] female and also for the Embarked feature

Here we used preprocessing method in sklearn to do this job

le = preprocessing.LabelEncoder()
cols = ['Sex', 'Embarked'].predic
for col in cols:
    data_train[col] = le.fit_transform(data_train[col])
    data_test[col] = le.fit_transform(data_test[col])

## now our data is ready! it's time to build our model

we select the target column ['Survived'] to store it in [Y] and drop it from the original data

y = data_train['Survived']
x = data_train.drop('Survived', axis=1)

Here split our data

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.02, random_state=10)

Init the model

model = LogisticRegression(random_state=0, max_iter=10000)

train our model

model.fit(x_train, y_train)
predictions = model.predict(x_val)

## Great !!! our model is now finished and ready to use

It's time to check the accuracy for our model

print('Accuracy=', accuracy_score(y_val, predictions))

Output:

Accuracy=0.97777

Now we submit our model to kaggle

test = pd.read_csv('titanic/test.csv')
df = pd.DataFrame({'PassengerId': test['PassengerId'].values, 'Survived': submit_pred})
df.to_csv('submit_this_file.csv', index=False)

ML Kaggle Titanic Problem using LogisticRegrission

Related tags

Overview

-ML-Kaggle-Titanic-Problem-using-LogisticRegrission

Problem Overview

Load & Analyze Our Dataset

visilyze the given data

plot the amount of the survived and the deaths

analyze the age

Note: Now after we made some analyze here and their, it's time to clean up our data If you take a look to the avalible columns we you may noticed that some columns are useless so they may affect on our model performance.

Here we make our cleaning function

# now we call our function and start cleaning!

Here we used preprocessing method in sklearn to do this job

we select the target column ['Survived'] to store it in [Y] and drop it from the original data

Here split our data

Init the model

train our model

It's time to check the accuracy for our model

Now we submit our model to kaggle

Owner

Mahmoud Nasser Abdulhamed

Decision Weights in Prospect Theory

ThunderGBM: Fast GBDTs and Random Forests on GPUs

This repo implements a Topological SLAM: Deep Visual Odometry with Long Term Place Recognition (Loop Closure Detection)

It is a forest of random projection trees

Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining

Merlion: A Machine Learning Framework for Time Series Intelligence

An easier way to build neural search on the cloud

This repo includes some graph-based CTR prediction models and other representative baselines.

Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

MachineLearningStocks is designed to be an intuitive and highly extensible template project applying machine learning to making stock predictions.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

A concept I came up which ditches the idea of "layers" in a neural network.

Coursera Machine Learning - Python code

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching.

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts.

Distributed Computing for AI Made Simple

Tribuo - A Java machine learning library