Let’s build a machine learning model to predict the presence of heart disease together. In this project, we are going to use real health survey data, train a model to recognize patterns connected to heart disease, and then test how well it can make predictions on new people it has never seen before. #machinelearning #python #datascience #computerscience #STEM
@lera_byteTranscript
Let's build a machine learning model to predict the presence of heart disease together. Today we'll be building a logistic regression model in Python to predict the signs of heart disease as accurately as possible and this is a project absolutely anyone can learn to do so let's dive in. Code starts by importing all the tools we'll need to build our model today. In this project we have pandas for data, numpi for math, and scikit learn far machine learning toolbox. Next I'm looking for a data set and I settled on this one from Kaggle. In the code the data set is loaded in from the downloaded CSV file into a data frame. This is simply just a table that Python can work with. In this data set each row represents one individual person and each column is a detail about the health or lifestyle of that person. After the data is loaded the code creates the target variable which is what the model is trying to predict. Since there's not a clean specific column in the data set called heart disease the code defines heart disease as someone who has had a heart attack in the past. If that is the case then the person is labeled with a one and if not they're labeled with a zero. X the code separates the data set into our features and our target. The features are those pieces of information that the model will use to make its prediction such as BMI or smoking history. The target is the new heart disease column that we've created based on that information and this is what the model will try to predict. The code then identifies which columns are categorical and which ones are numerical. It's pretty straightforward. categorical columns contain groups or labels while numerical columns simply contain numbers. If this information we cannot dive into preprocessing so the model can actually use the data. Or numerical columns the missing values are filled in with the median of that column and numbers are scaled so that big numbers do not overpower smaller ones. For categorical columns though the missing values are replaced with the most common category in that column and categories are also converted into numbers so they can actually be understood by the code. Now it's time to actually build the model as a pipeline. A pipeline is something that's going to connect our preprocessing and our training all into one simple workflow. This means the data is automatically cleaned, transformed, and passed into the classifier the exact same way every single time it's run. The classifier we're using here is logistic regression which is a common one for yes or no predictions. Logistic regression works by comparing each person's health and lifestyle details to the heart disease prediction column in the data set. And then it learns which patterns or details about a person are most concretely connected to them having heart disease. After training it applies the patterns has learned to new patient information outputs both a prediction and a probability for whether that patient has heart disease. X comes one of the most important steps which is splitting our data into training and testing sets. The training set is given to the model during its training stages so it can look at it and find patterns in the data. The test set however is kept top secret during the training stages and only shown to the model after it's already been trained. Use it so you can see how the model will perform on unseen data and gauge its performance in the real world which can be very helpful for us as data scientists. Next it's time for my favorite part which is when the model is actually trained on the training set which is when the logistic regression model is run and after training the model makes its predictions on the test set. It predicts either a zero or a one meaning no heart disease or heart disease and also gives us a probability that tells us how confident it is in its prediction. The evaluation section checks how well the model actually performed. PERSU just shows us the overall percentage of correct predictions while ROCAUC shows us how well the model separated people with heart disease from people without heart disease. We also use a confusion matrix and a classification report to see what kinds of mistakes were being made. Finally we can actually test our model on a real person from the test set. Code runs that patient's information through the model and then it outputs a probability and a final prediction. Looks like in this case the model predicted the patient had a 64% chance of having heart disease which ultimately converged to a label of what. And that is a fully complete logistic regression model coded out for you in front of your eyes and python. So if you learned something from this video then don't forget to save it for the next time you're working with logistic regression and follow for more. See you in the next one.
Download Transcript
Related Videos

They rejected my application to Hogwarts but I still found a way to be a wizard. 🧹#illusion #magic #harrypotter

Jailbreak - Clue 5

Kiwi Eating 🥝 ASMR Your new daily ASMR habit starts here…Follow to keep it going! #asmr #satisfyingvideos #aiasmr #eating #kiwi

HAHAHAHAHAHAHHA