Step-By-Step Process of Implementing A Machine Learning Project
This article can be useful to everyone, especially the neophytes (like me)
According to one research publication the main takeaway from the 2020 State of Enterprise Machine Learning survey is that a growing number of companies are entering the early stages of ML development, but challenges in deployment, scaling, versioning, and other sophistication efforts still hinder teams from extracting value from their ML investments. However, budgets for ML programs are growing most often by 25 percent, and the banking, manufacturing, and IT industries have seen the largest budget growth this year. Organizations are determining ML success by both business unit and statistical metrics with a significant divide by job level. Besides, I have personally noted one thing from LinkedIn that the people who are from IT background are more inclined towards AI (Machine Learning, Deep Learning, Data Science, Neural Networks). So, it won’t be wrong saying that ‘AI is the future.’
So if we want to shape our career in AI, we need to understand what actually AI is and how it can be useful. Those who are planning to make their career in AI must have knowledge in ML, DL, DS, Neural Networks because these are nothing but the subset of AI. Here I will be sharing a mini project on Data Science made by me which gives the clear understanding about how ML model works.
- Domain Knowledge :
Understanding the domain is the most important thing to kick-off any project. Sometimes we do lack in understanding what actually the business need. As a data scientist, it is our duty to convert those business problems into a data science problem and provide the answers meaningfully.
2. Data Mining and Collecting the data :
It’s the place where most of the time is spent on by all the data scientists. In real-life, data source exploration and data collection isn’t a simple task. In any organization, you will hardly find a single source where you will find all of your required data. Usually, you will find multiple disconnected sources of data and some of them will be hard to access because of infrastructure-related issues or because of restricted access. Data collection from external sources also can be tricky because of cost issues and technical challenges. I have used a very small dataset for demo which is based on predicting through which advertising platform the sales is maximum. The dataset can be downloaded from https://www.kaggle.com/sazid28/advertising.csv
3. Exploratory Data Analysis (EDA) :
From the name itself you may get an idea that in EDA we explore our data and perform analysis. But before diving deep in EDA it’s very important to just a have a look at the data, whether it is in proper format or not, size of the data and most importantly the meaning of each features present in our dataset that we will be using in our project.
Now it’s time to get the hands dirty with python codes using “Advertising and Sales” data. Since I have used single data set to keep this discussion simple, I have limited scope to illustrate all the aspects of the first look at the data I mentioned here. The whole code can be executed in a jupyter notebook or any other python IDE of your choice. We will require several python libraries to complete this classification exercise. I shall first mention all the required libraries for our project.
It is always a good idea to check your current working directory. If required change it to your preferred one.
Now, let’s look at our data that will be used to build our model. You will know how many variables you are dealing with and what those variables represent.
This is a small data set with only two hundred observations and four variables where the “Sales” column is our target variable (on which the prediction is to be done). You should check for blank spaces in variable names. If you find any such cases, rename the variable names. Blank spaces in variable names cause problems in the scripts. However, in this dataset we do not have any issue of blank spaces.
Datatype of the variables :
Check the data type of each variable you are working with because there can be issue if the datatype of the variable is not proper. The best example of it is the date variable. In most of the cases, date variable is found as “object” and we need to convert it into datetime format by using “to_datetime”. My current data does not contain any date-variable. Also, sometimes numerical variables become “object” types when some of their values contain characters or if some of them have missing values.
After dealing with the datatype of the variable, it’s time to see if we have any missing values in our dataset.
There are no null values in our dataset. But in real world it won’t be like that. There will be many null values and it will be our job as data scientists to handle the data by performing various operations like imputation or by performing statistical analysis like mean or median.
After handling the null values, the next hurdle that almost all data scientists face is outliers. An outlier is an observation that lies at an abnormal distance from other values in a random sample from a population.
From our above observation, we can clearly observe that there is an outlier in the “Newspaper” column as the gap between the 75 (percentile) and max is very large. The outliers can also be observed by making box plots as well.
Now, let’s check if the data of “Newspaper” column is normally distributed (bell shaped curve) or not. Generally, this is used for the variables which have outliers.
It can be observed from above graph that the shape of curve is not normally distributed as it has skewness on its right side.
As there are outliers in the “Newspaper” column, we need to drop that particular index which does not fit in our data.
As we dropped the index 16 and 101 we now have 198 rows and 4 columns and this data is now stored in data2 variable. So now we will be performing further analysis on this data.
Now, let’s check how’s the correlation between x (independent variables) and y (dependent/target variable).
From the above correlation matrix, we can conclude that the target variable y (Sales) and x (TV) are highly correlated with each other (Observed from the scale).
Model Building :
For building our model, we first need to split our data into training set (for training the model) and testing set (for testing model). In most of the cases we also need to split our data for cross-validation but our dataset is too small to perform it so I have not used cross-validation method.
The above picture depicts the shape (rows and columns) of our data after splitting into train and test set.
Here, I will be using Linear Regression algorithm as our target variable has a continuous value.
It can be visualized that the accuracy of our model is approximately 90% ie our model is 90% accurate which is a descent score. If we want to improve the accuracy then we need to gather more data, do proper feature engineering and selection, etc.
Now, let’s compare our predicted and actual value.
Here, y_predict (index 0)is our predicted outcome and y_test (index 1) is the actual outcome.
The next step that we need to do is to see the variance in our model.
These are some of the statistical methods to see how much variance is there in our model and it should be very low (near to 0).
As we have perform almost all the analysis its now time to visualize the data in the form of regression plot.
From the above three regression plot, we can observe that we get the best fit line for TV as we also observed above that the correlation was also the maximum between the TV and Sales column.
Finally, we need to save our model by using the pickle library which is the built-in library to save the model.
After completing all the above steps, we can say that the only thing that is to be done now is to deploy our model into proper environment.
So this brings an end to our project. I hope that this project helps you all the data scientists. I am also looking forward to receive feedback about my project and I will be pleased if I get any comment as I will also get an opportunity to learn from you all guys.
Thank You !!!
Happy Learning !!!!