Introduction to Machine Learning

Tavish Aggarwal Updated on December 9, 2023 AI/ML and Deep Learning

Pictorial explanation of machine learning.

Machine Learning is making a buzz in the industry. And it’s the right time to get familiar with it. Let’s get the basics right. Let’s get started.

In this article

What is Machine Learning

What the heck is machine learning?

If I had to quote it in a single sentence, I would say, ‘Machine Learning is a way to find a pattern in data to predict the future.

The above is not the only definition of machine learning. There are many more definitions based on what you are trying to achieve. Still, it’s all about programming the machines to learn from the continuous input of data and its variations for better output.

As an elementary example, voice recognition software should understand a language irrespective of the accent. While we design the system, we would not have all the required data for all the possible accents and variations data. Hence, the software program needs to be programmed to understand the pattern of every new accent it hears, add it to its library, and keep on doing it to perfection so that it understands all possible accents and variations of those.

Definitions of Machine Learning

Machine learning is a subset of artificial intelligence that enables a system to autonomously learn and improve using neural networks and deep learning, without being explicitly programmed, by feeding it large amounts of data (Google Cloud).

Similarly, IBM considers it a discipline within artificial intelligence that concentrates on devising systems capable of learning from data, enabling computers to unveil hidden insights without explicit programming (IBM).

A more detailed exploration of such definitions is as follows:

1. Google’s Cloud Platform:

“Machine learning is a data analytics technique that teaches computers to do what comes naturally to humans and animals: learn from experience. Machine learning algorithms use computational methods to “learn” information directly from data without relying on a predetermined equation as a model.” (Source)

2. IBM:

“Machine learning is a branch of artificial intelligence (AI) focused on building systems that learn from data. By using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.” (Source)

3. Microsoft Azure:

“Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.” (Source)

4. MIT Sloan Management School:

“Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. Artificial intelligence systems are used to perform complex tasks in a way that is similar to how humans solve problems.” (Source)

5. SAS:

“Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.” (Source)

Significance of Machine Learning

Machine learning has become pivotal due to several reasons:

Data Utilization:

Efficient Data Analysis: ML algorithms can analyze enormous datasets to derive insights and patterns that are practically impossible to discern manually.
Predictive Analysis: ML excels in utilizing historical data to predict future trends, which is vital for various sectors like finance, healthcare, and marketing.

2. Automation:

Improving Productivity: Automating routine tasks through ML reduces manual labor and enhances efficiency and productivity.
Real-time Decisions: ML enables systems to make real-time decisions without human intervention, like in autonomous vehicles or cybersecurity systems.

3. Personalization:

Enhanced User Experience: ML allows systems to learn from user behaviors and preferences, enabling personalized experiences.
Targeted Marketing: Businesses leverage ML to analyze customer behaviors and implement targeted marketing strategies, improving customer engagement and sales.

4. Innovation:

New Possibilities: ML paves the way for innovations in products and services, such as voice assistants, recommendation systems, and advanced diagnostics in healthcare.

Everyday Applications of Machine Learning

ML permeates various facets of our daily lives, sometimes without us even consciously realizing it:

1. E-Commerce:

Product Recommendations: Platforms like Amazon use ML to analyze your browsing and purchase history to recommend products.
Price Optimization: ML algorithms help dynamically adjust prices based on various factors like demand, competitor prices, and inventory.

2. Social Media:

Content Recommendation: Social media platforms, like Facebook and Instagram, use ML to curate your feed based on your interactions and preferences.
Face Recognition: Facebook uses ML-powered face recognition to suggest tags in photos.

3. Search Engines:

Search Result Optimization: Google utilizes ML to enhance its search results, offering the most relevant information based on your search history and patterns.

4. Personal Assistants:

Voice Recognition: Siri, Alexa, and Google Assistant use ML to understand and process your voice commands.
Personalized Responses: These assistants analyze your patterns and preferences to provide more personalized responses and actions.

5. Transportation:

Ride-Sharing Apps: Uber and Lyft employ ML to determine the price of your ride, optimize routes, and even match you with passengers or drivers.
Autonomous Vehicles: Self-driving cars use ML to interpret sensory data, making real-time decisions to navigate safely.

6. Healthcare:

Disease Identification and Prediction: ML helps identify and predict diseases by analyzing medical data and images.
Personalized Treatment: Algorithms analyze a patient’s genetic makeup to recommend personalized treatment plans.

7. Finance:

Fraud Detection: Banks and financial institutions leverage ML to detect unusual activities and prevent fraud.
Algorithmic Trading: ML algorithms analyze market conditions and execute trades at optimal prices.

The Process of Machine Learning

Suppose we have a large data set with some pattern that is impossible for a human brain to identify.

We pass this data to a Machine Learning algorithm that studies the pattern and provides the model.
The provided model can be used by the applications to provide new data to model and check whether a pattern exists in the data provided.

The question is, do we always have data with some patterns?

The answer is NO. We, being data Scientists, will be provided with the raw data. Therefore, our role and responsibility will be to perform data transformation/manipulation using tools. After processing the data, it can be used as input to machine learning algorithms.

We can summarize the steps for a machine learning project as follows:

1. Define the Problem

A. Identify Problem Type:

Classification, regression, clustering, etc.

B. Define Objectives:

Clearly state what you aim to achieve: predict, classify, recommend, etc.

C. Understand the Context:

Know the domain, target audience, and utility of the solution.

2. Collect and Prepare Data

A. Data Acquisition:

Utilize APIs, databases, web scraping, etc., to accumulate data.

B. Data Cleaning:

Handle missing values, outliers, and remove duplicates.

C. Data Preprocessing:

Normalize/standardize data, encode categorical variables, and handle imbalanced data.

3. Explore the Data

A. Descriptive Statistics:

Mean, median, mode, range, etc.

B. Data Visualization:

Utilize plots, charts, and graphs to understand data distributions and relationships.

C. Feature Engineering:

Derive new features, perform feature scaling, and select prominent features.

4. Choose a Model

A. Model Selection:

Choose a model based on the problem type and data nature.

B. Model Justification:

Justify why the chosen model(s) is apt considering accuracy, interpretability, complexity, etc.

5. Train the Model

A. Split the Data:

Partition data into training, validation, and test sets.

B. Model Training:

Train using the training set and validate using the validation set.

C. Parameter Tuning:

Adjust hyperparameters to enhance model performance.

6. Evaluate the Model

A. Select Metrics:

Choose evaluation metrics suitable for the problem: accuracy, precision, recall, F1-score, etc.

B. Model Evaluation:

Assess the model using the test set and chosen metrics.

C. Analysis:

Analyze results, identify areas of improvement, and if required, iterate from a suitable previous step.

7. Deploy the Model

A. Deployment Strategy:

Decide on deployment methods: cloud, on-premises, or hybrid.

B. Integration:

Integrate the model with the existing system, ensuring seamless data flow.

C. Monitor:

Keep track of the model’s performance and maintain it accordingly.

8. Maintain the Model

A. Continuous Monitoring:

Monitor model predictions and any shift in data patterns.

B. Update the Model:

Re-train the model with new data and refine it as per evolving requirements.

C. Feedback Loop:

Implement a feedback mechanism to gather data on model predictions and improve it constantly.

Types of Machine Learning

Supervised Learning: In simple terms, supervision is only possible if the supervisor or the teacher knows all possible permutations and combinations. The teacher or the supervisor, having a very diverse set of data or knowledge, keeps directing the student along the way. This type of learning means that the value or result we want to predict (output) is within the training data. And the value which is in the data that we want to study is known as a target value.
Unsupervised Learning: This type of training means that the value or result we want to predict is not in the training data.
Reinforcement Machine Learning: It is the third type of Machine Learning. Here, an agent, part of the software logic, comes into the picture and makes decisions based on the rewards and punishments. It is about training machine learning models to make a sequence of decisions. It is concerned with how intelligent agents are to take actions in an environment to maximize the notion of cumulative reward.

Please check the article about the types of machine learning.

Categorizing Machine Learning Problems

Let’s delve into some basic categories in which machine learning problems can be classified. This will also include examples to give readers a practical understanding of each type.

1. Classification and Categorization

Definition:

Classification or classification involves predicting discrete labels (categories) for given input data. It is used for supervised data. We split our data into classes. And when new data comes in, we try to figure out what data belongs to which class.

Examples:

Email Filtering: Classifying emails as spam or not spam.
Image Recognition: Identifying if an image contains a cat or a dog.
Customer Churn: Predicting whether a customer will churn (leave or stay).

Key Points to Consider:

Binary vs. Multiclass Classification: Binary involves two classes (e.g., spam or not), while multiclass involves more than two (e.g., categorizing fruits into apples, bananas, or grapes).
Imbalance Handling: Techniques to deal with imbalanced datasets where one class significantly outnumbers the others.

2. Regression

Definition:

Regression is used for supervised data. Here, we try to find a line of the curve from our training data. The concept of regression deals with predicting a continuous value based on given input features.

Examples:

House Price Prediction: Estimating the price of a house based on features like location, size, and age.
Stock Price Forecasting: Predicting future stock prices based on historical data.
Sales Prediction: Estimating future sales based on past data and various influencing factors.

Key Points to Consider:

Linear vs. Nonlinear Regression: Deciding whether the relationship between variables is linear or not and choosing models accordingly.
Evaluation Metrics: Understanding metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared to evaluate regression models.

3. Clustering

Definition:

Clustering involves grouping data points into clusters such that items within a cluster are more similar to each other than to items in other clusters. It is used for unsupervised data. Here, we classify our data into clusters.

Examples:

Customer Segmentation: Grouping customers based on purchasing behavior to target marketing strategies.
Image Segmentation: Dividing an image into segments to analyze or modify each separately.
Document Clustering: Grouping documents into categories based on content similarity.

Key Points to Consider:

Choosing the Number of Clusters: Determining an optimal number of clusters, for example, using the Elbow Method.
Distance Metrics: Deciding on distance measures (Euclidean, Manhattan, etc.) for evaluating similarity between data points.

4. Dimensionality Reduction

Definition:

Dimensionality reduction involves reducing the number of input variables in a dataset, simplifying the dataset while retaining its essential features.

Examples:

Feature Selection: Choosing a subset of relevant features for use in model construction.
Image Compression: Reducing the dimensionality of images while preserving key information.
Noise Reduction: Eliminating non-essential features that might act as noise.

Key Points to Consider:

Preserving Variance: Selecting methods and components that capture most of the data variance.
Interpretability: Ensuring that reduced dimensions maintain interpretability for analysis.

You may also like to read about analyzing high-dimensional data in machine learning.

5. Anomaly Detection

Definition:

Anomaly detection involves identifying unusual patterns or outliers that deviate from expected behavior.

Examples:

Fraud Detection: Identifying suspicious activities in financial transactions.
Network Security: Detecting unusual traffic patterns or intrusions in a network.
Fault Detection: Identifying abnormal patterns in system operations, which could indicate a fault or defect.

Key Points to Consider:

Threshold Setting: Determining a boundary to differentiate between normal and abnormal behavior.
Feature Importance: Identifying which features are most relevant for detecting anomalies.

Machine Learning Workflow

Asking the right question
Preparing data
Selecting the algorithm
Training the model
Testing the model

Let’s see the workflow in detail:

Asking the right question

It is essential to know what you want from your data. And whether you can get the desired result you are looking for from the data you have or not.

If the question that you are asking is not correct, then there are chances that you won’t get the desired results when you have your model ready. So, asking the right question is very important to make predictions from the data.

Preparing data

It is the most crucial step of the entire process. Data Scientist spends most of their time preparing data. In most cases, the time spent preparing the data is more than 50% of the process.

The significant steps of cleaning data include Loading, Exploring, cleaning, imputing, and transforming the data.

Let me explain more about imput options. Most of the time, the data has null or missing values. So, how to deal with such a situation may cause biased results. There are various options:

Ignore such data
Delete the rows which are having missing data
Replace values or Impute

It is easy to ignore or delete rows. But what if out of 1000 rows, 400 rows have missing data? Will it be okay to delete 400 rows? Indeed, the answer is NO.

We have to replace the values in such a case. One way to replace a missing value is with mean or medium values.

Selecting the algorithm

Selecting the correct algorithm is essential to get the desired results we are looking for. First, we pass the training data we prepared in the previous step to the algorithm, and the algorithm calculates the data and returns the model. Then, we pass the actual data to the model we want to predict.

There are more than 50+ machine learning algorithms that have been researched. Choosing the correct algorithm is challenging. So, the algorithm is selected based on the following factors:

The type of problem we are trying to solve.
It also depends on the Data Scientist to choose the factors for choosing an algorithm.
Most importantly, experience plays a vital role in choosing the factors for selecting the correct algorithm.

The general technique most data scientists follow to choose the correct algorithm is elimination. We can eliminate and come closer to the right algorithm based on the following:

Supervised learning or unsupervised learning
Regression, or Classification, or Clustering
Initially, it is safe to eliminate ensemble algorithms. These are the algorithms that have many child algorithms.
We can also eliminate algorithms based on basic or enhanced configurations. Enhanced algorithms are improvements in basic algorithms. As a beginner, we can start with basic algorithms.

Training the model

Training our model is an important step. Usually, as our data changes with time, we need to retrain our model to predict the right results.

To train the model, we usually split our prepared data into:

Training data
Test data

Training data is the data we use to create a model. And test data is the data for which we already know the result we seek. So, it is passed to our model created by training data to check the accuracy of our model.

Mostly, 70% of prepared data is used as training data, and the remaining 30% is used as test data.

The columns that we use to train our model are known as features. Therefore, one way to improve the model’s performance is by using the minimum features to train the model.

Additionally, a scikit-learn library is used for:

Splitting data into training and test data
Model training
Model tuning – Improving the performance of a model

Testing the model

Once we have our model ready, it’s time to test it. Remember we have 30% of test data from the prepared data? We will test our model and see the accuracy of the test data.

The accuracy of test data and training data should be close. Then, we can say that our test is successful. But it is not the case every time. So, we face a common challenge of Overfitting while testing the model. Let’s explore this next.

Overfitting of data

Overfitting the data means that the algorithm we used understands the training data and starts training itself based on that data. The concept of overfitting results in the accuracy of testing data falling gradually compared to training data.

How can we fix the Overfitting issues?

We can control it with the help of a regularization hyperparameter.

NOTE: This parameter has many names based on our algorithm. So, it is highly recommended to read documentation to control Overfitting better.

Another way to control Overfitting is using cross-validation.

Cross-validation is the technique to split the training data into k-folds, and every time, one fold is used as test data, and the remaining folds are used as training data.

Some algorithms also have cross-validation versions, generally denoted by “<Algorithm name>CV”, E.g., “LogisticRegressionCV”.

NOTE: We can simultaneously use booth regularization hyperparameters and cross-validation to control Overfitting.

I hope the machine learning workflow explanation is clear and you are excited and ready to train your first machine learning model.

Tavish Aggarwal

Tavish lives in Hyderabad, India, and works as a result-oriented data scientist specializing in improving the major key performance business indicators.

He understands how data can be used for business excellence and is a focused learner who enjoys sharing knowledge.