Beginner Tutorial ยท 15 min read

Everyone tells you
to start with Python.
They are wrong.

A practitioner's honest guide to machine learning, starting where it actually matters, which is your data, not your tools.

Louiza BoujidaMarch 202615 min readBeginner to Production
Before anything else

What exactly is Machine Learning?

Machine Learning is a subset of Artificial Intelligence. But that sentence alone does not help much. Let us place it precisely in the landscape of fields it touches, because understanding where ML sits changes how you learn it.

The landscape of intelligent systems
Artificial Intelligence
Machine Learning
Deep
Learning
Artificial IntelligenceAny technique that lets machines simulate human intelligence. Includes rule-based systems, expert systems, and search algorithms.
Machine LearningAI that learns from data instead of following explicit rules. The system improves its performance through experience.
Deep LearningML using neural networks with many layers. Excels at unstructured data: images, text, audio. Subset of ML, not a replacement.
โˆ‘
Mathematics

Linear algebra, statistics, and calculus form the foundation that ML is built on. You do not need to master it first, but you need to understand what it is doing.

๐Ÿ—„
Data Engineering

The infrastructure that delivers clean, reliable data to your model. ML without solid data engineering is a model built on sand.

๐Ÿ“Š
Data Science

The broader field that combines statistics, ML, domain expertise, and storytelling to extract insights from data. ML is one of its core tools.

โš™๏ธ
Software Engineering

What takes a model from a notebook to a system running in production. MLOps, the intersection of ML and engineering, is where real value is created.

The simplest definition of ML that actually holds up in practice: Machine Learning is the process of giving a computer examples instead of instructions, so it can figure out the rules itself.

Traditional programming: you write the rules, the computer applies them to data and produces output. Machine Learning: you provide the data and the expected output, and the computer finds the rules on its own.

Why this distinction matters

Traditional programming works when you can write the rules. You can write rules for calculating a tax invoice. You cannot write rules for recognizing a cat in a photo. There are too many variations. That is where ML takes over: when the rules are too complex to write by hand, but examples are plentiful.


Open any "How to get started in Machine Learning" tutorial and you will find the same first instruction: install Python, install scikit-learn, run this notebook. Within ten minutes you are fitting a model on the Iris dataset and getting 95% accuracy. You feel like you are doing ML.

You are not. You are running pre-cleaned demo data through a pre-written pipeline. That is not machine learning. That is following a recipe with someone else's ingredients.

"In 24 years of building data systems, I have seen hundreds of ML projects fail. Almost none of them failed because of the model choice. They failed because the data was wrong, incomplete, biased, or misunderstood."
Louiza Boujida, AI & Data Architect

This tutorial is different. We start where real ML starts: with data. Tools come after understanding. Models come last.


The Real Foundation

The data pipeline that everyone skips

Before a single row reaches your model, your data travels through an entire pipeline. Every step introduces potential errors, biases, and information loss. Yet most tutorials start at the very last node.

โ†’
๐Ÿ“ฅ
Raw Data
CSV, databases, APIs, sensors
โ†’
๐Ÿ”
Exploration
Distributions, nulls, correlations
โ†’
๐Ÿงน
Cleaning
Fix types, handle missing values
โ†’
โš™๏ธ
Features
Create meaningful signals
๐Ÿง 
Model
The part tutorials start with
In a real project, steps 1 through 4 take 80% of your time. The model training itself often takes less than an hour.
Practitioner's note

The model is the last 10% of the work. The data is the other 90%. A mediocre model trained on clean, well-understood data will outperform a sophisticated model trained on messy, misunderstood data almost every time. This is not an opinion. It is a pattern I have seen hundreds of times across enterprise projects.

Practice exercise 1

Explore a real dataset before any model

Find any CSV in your area of work. Not Iris, not Titanic. Open it in Excel or Google Sheets first. Without any code, answer these five questions:

How many rows and columns? Are there empty cells? What does the distribution of one numeric column look like? Are there any obvious outliers? What story is this data trying to tell?

Spending 30 minutes here will teach you more about ML than running ten notebooks on demo data.

Check your understanding
In a typical real-world ML project, what takes the most time?

The Honest Roadmap

Five steps, in the right order

Most roadmaps start at step three. Here is what should happen before that, and why it matters more than anything else.

01
Weeks 1 to 2

Learn to think about data, not models

Before any code: understand what makes a good dataset. What is a feature? What is a label? What does missing data really mean for a model? Spend a week with a real messy dataset from your own industry. Ask: what story does this data tell? What is it hiding?

CSV explorationExcelDomain knowledge first
02
Weeks 3 to 4

Learn Python for data, not for ML

Now learn Python, specifically for data manipulation. Pandas, NumPy, basic visualizations. The goal is not to learn programming. The goal is to ask questions of your data at speed. Can I see the distribution? Where are the nulls? How do these two variables relate?

PythonPandasNumPyMatplotlibJupyter
03
Month 2

Understand what ML actually does mathematically

You do not need a PhD. But you need to understand: what is a loss function? What does training mean? What is overfitting and why is it the most common mistake in production ML? Learn these with visuals before equations. The math will make more sense after you have seen the problems it solves.

Linear regressionDecision treesTrain/test splitScikit-learn
04
Month 3

Build one real project end to end

Not five half-finished notebooks. One project, start to finish: collect data, clean it, explore it, build a simple model, evaluate it honestly, document what went wrong. The project does not need to be impressive. It needs to be real.

Your real dataGitHubHonest evaluationDocumentation
05
Month 4 onward

Learn what makes ML fail in production

This is where most tutorials stop, and where real ML begins. A model that was 94% accurate in January can be silently wrong by June. You need data drift detection, model monitoring, retraining pipelines, and governance. Because nobody will tell you it is broken unless you build systems to catch it.

MLOps basicsData driftModel monitoringAI Governance
Month 1 checklist ยท Click to track your progress
I explored a real dataset from my field without any code
I can explain the difference between a feature and a label
I installed Python and Jupyter and ran my first data exploration
I used Pandas to load a CSV and check for null values
I can explain what overfitting means in plain language
I trained my first simple model and evaluated it honestly

The Landscape

Three types of ML you actually need to know

Not a full taxonomy. Just the three approaches you will use 90% of the time, and when each one is appropriate.

Supervised

You have the answers. Train from them.

Your dataset has a label, which is a correct output for each input. The model learns to predict that output for new inputs it has never seen. This is where most business ML lives: predicting churn, detecting fraud, forecasting demand, classifying support tickets.

Prediction, classification, regression, fraud detection, demand forecasting
Unsupervised

No labels. Find the hidden structure.

No correct answers in your dataset. The model discovers patterns by itself: natural groupings, anomalies, similarities between items. Often more powerful than it looks, and very useful when you do not yet know what questions to ask of your data.

Customer segmentation, anomaly detection, recommendation systems, topic modeling
Deep Learning

When patterns are too complex for rules.

Neural networks with many layers. Learns from raw data: images, text, audio. This is what powers ChatGPT, image recognition, and speech-to-text. Extremely powerful, but needs massive data and compute. Often complete overkill for business problems where a simple model works just as well.

Image recognition, natural language processing, speech, generative AI
Common trap

Many beginners jump straight to deep learning because it sounds impressive. In practice, a well-tuned logistic regression or gradient boosting model will outperform a neural network on most tabular business datasets, and 10 times easier to explain, debug, and maintain.

Practice exercise 2

Match your problem to the right ML type

For each scenario below, decide which type of ML is most appropriate and why. Write your answers before reading further.

Scenario A: You want to predict whether a bank customer will close their account next month. You have 3 years of account history with a flag for who actually closed their account.

Scenario B: You have 5 years of customer purchase data and no labels. You want to group customers by behavior to personalize marketing campaigns.

Scenario C: You want to automatically tag incoming customer support emails by topic, but you have no historical labels. You have 50,000 emails.


Hard-Won Lessons

Five mistakes I see constantly

After reviewing hundreds of ML projects across enterprise environments, these patterns appear every single time.

01 of 05

Starting with a complex model

Everyone wants to build a neural network. But a logistic regression you understand beats a transformer you cannot explain or debug. Complexity is a cost, not a feature. The simpler model is almost always the right starting point.

Start with logistic regression or a decision tree. Add complexity only when you can prove it is needed.
02 of 05

Treating accuracy as the only metric

A model with 95% accuracy sounds excellent. But if 95% of your dataset is negative cases, a model that always predicts "no" achieves 95% with zero intelligence. Accuracy is nearly meaningless for imbalanced datasets.

Learn precision, recall, F1-score, and AUC-ROC. Always look at the confusion matrix first.
03 of 05

Ignoring data leakage

When information from your test set leaks into your training process, your model appears to work perfectly, then fails completely in production. It is the most dangerous and hardest to detect mistake in ML. I have seen it end projects.

Always split your data before any preprocessing. Build pipelines that enforce this rigorously and automatically.
04 of 05

Deploying and forgetting

A model trained in January on January data will silently degrade as the world changes. Most teams deploy and move on. This is how you end up with a working model that has been making wrong decisions for six months without anyone knowing.

Set up drift detection and scheduled re-evaluation from day one. Treat your model as a living system, not a finished artifact.
05 of 05

Skipping documentation and governance

Who trained this model? On what data? What are its known failure modes? Most teams cannot answer these questions. When something goes wrong, and it will, you have no way to investigate, explain, or fix it.

Document every model: training data, version, performance metrics, known limitations, owner. Build governance before you scale.
Check your understanding
A model achieves 97% accuracy on your test set. Before deploying, what should you check first?

Your First Real Project

Follow this step by step

This is a complete walkthrough. By the end, you will have a working ML model trained on real data, evaluated honestly, and published on GitHub. No demo datasets. No shortcuts.

The project

Predict which employees are likely to leave

IBM released a real HR dataset with 1,470 employees and 35 variables: age, salary, job satisfaction, overtime, distance from home, and more. For each employee, you know whether they actually left. Your goal is to build a model that predicts attrition before it happens.

This is a real business problem. Companies spend an average of 6 to 9 months of salary to replace an employee. A model that identifies at-risk employees early has direct financial value.

1,470employees in the dataset
35features per employee
16%attrition rate (imbalanced)
Freeavailable on Kaggle
Step 1Get the data and set up your environment

Go to kaggle.com, create a free account, and search for "IBM HR Analytics Employee Attrition". Download the file WA_Fn-UseC_-HR-Employee-Attrition.csv.

Then install the tools you need. Open your terminal and run:

# Install required libraries
pip install pandas matplotlib seaborn scikit-learn jupyter

# Launch Jupyter Notebook
jupyter notebook

A browser window opens. Create a new notebook called attrition_model.ipynb. You are ready.

Step 2Explore the data before anything else

Before writing a single line of model code, spend at least one hour understanding what you are working with. This is not optional. It is the most important step.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

# Basic facts about your dataset
print(df.shape)               # Should be (1470, 35)
print(df.isnull().sum().sum()) # Should be 0, this dataset is clean

# How balanced are your classes?
print(df['Attrition'].value_counts())
# Output: No=1233, Yes=237 โ€” only 16% left. This is imbalanced.

# Who is more likely to leave?
df.groupby('Attrition')[['Age', 'MonthlyIncome', 'YearsAtCompany']].mean()

Look at the last output carefully. People who left tend to be younger, earn less, and have fewer years at the company. This is your first insight, and you have not trained anything yet.

Step 3Prepare the data for modeling

ML models need numbers. You need to convert text columns and split your data into training and test sets. The split must happen before any other transformation.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Convert the target column to 0/1
df['Attrition'] = (df['Attrition'] == 'Yes').astype(int)

# Drop columns that add no signal
df = df.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis=1)

# Encode remaining text columns
df = pd.get_dummies(df, drop_first=True)

# Split BEFORE any further processing
X = df.drop('Attrition', axis=1)
y = df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

The stratify=y parameter ensures that both your train and test sets have the same 16% attrition rate. Without it, you might accidentally get very different distributions.

Step 4Train a simple model and evaluate it honestly

Start with logistic regression. Not because it is the most powerful, but because you can understand exactly what it is doing. Interpretability is more valuable than accuracy when you are learning.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Never just print accuracy. Read the full report.
print(classification_report(y_test, y_pred))

# Visualize what the model got right and wrong
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()
Read this before interpreting your results

The class_weight='balanced' parameter tells the model to pay more attention to the minority class (people who left). Without it, the model will learn to always predict "stayed" and achieve 84% accuracy while being completely useless. This is the imbalanced dataset problem in action.

Step 5Understand what the model learned

This is the step most tutorials skip. Before declaring success, look inside the model. Which features mattered most?

import numpy as np

# Get the most important features
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': np.abs(model.coef_[0])
}).sort_values('coefficient', ascending=False)

print(feature_importance.head(10))
# You will likely see: OverTime, JobSatisfaction, MonthlyIncome

Ask yourself: do these top features make business sense? If overtime and low job satisfaction predict attrition, that is credible. If something unexpected appears in the top features, investigate it before trusting the model.

Step 6Document and publish on GitHub

This step is not optional. It is what separates people who are learning from people who are building a portfolio. Create a GitHub repository and write a README.md that answers these questions:

README.md template
Project:What business problem does this solve?
Data:Where does it come from? How many rows? What time period?
Model:What algorithm? Why this one?
Results:Precision, recall, F1 on the test set. Not just accuracy.
Limitations:What does this model not do well? What data would improve it?
Next steps:What would you do differently with more time?

Publishing your honest analysis of what worked and what did not is far more impressive than a polished result that hides the problems. Every serious practitioner knows that real ML is messy. Show that you understand that.

What you have just done

You have completed every stage of a real ML pipeline: sourcing data, exploring it, preparing it correctly, training a model with the imbalanced dataset problem in mind, evaluating it beyond accuracy, interpreting what it learned, and documenting it honestly. That is the foundation of everything else in ML.


Go further

Curated resources, no noise

Datasets, courses, documentation, and tools โ€” all carefully selected and grouped by topic. Kaggle datasets, Scikit-learn docs, video series, and more.

View all references โ†’
LB
Louiza Boujida
AI and Data Architect with 24 years building production systems. I write about what actually works. TheGovernAI exists because data, architecture, and governance are not separate subjects.
Continue on TheGovernAI
Why AI Governance is Critical (Not Optional)
Tutorial ยท 12 min read
โ†’
Snowflake vs Databricks vs Microsoft Fabric
Platform Guide ยท Interactive comparison
โ†’
All Platform Guides
Platform Guides ยท Compare tools & technologies
โ†’
โ† HomeAll Tutorials โ†’