Statistics for Data Science — Basic Statistics
Statistics is a foundational component of data science, providing powerful tools and techniques for analyzing and interpreting data. Data scientists use statistical methods to extract meaningful insights from large and complex datasets, identify patterns and trends, and support informed business decisions. With a strong statistical foundation, a data scientist can better understand the behavior of data.
In this blog series, we will cover everything from foundational theories to advanced analytical techniques and explore their real-world applications.
What Is Statistics?
Statistics is the branch of applied mathematics that deals with the collection, organization, analysis, interpretation, and presentation of data.
It is widely used in science, economics, social sciences, business, and engineering to generate insights, make predictions, and guide decision-making. In simple terms, statistics helps us discover patterns, trends, and relationships in data.
Examples
- Calculating the average (mean) marks of students in an exam.
- Estimating the average height of all students in a school based on a sample of 100 students.
Key Concepts
Data
Data can be anything and everything. Any information or fact can be considered data.
Examples: age, weight, score, income.
Population
A population is the complete set of individuals or items that share a common characteristic and are the subject of study.
Example: all students in a class.
Types of Population
- Finite population: Can be counted and measured directly. Example: the number of people enrolled in a course.
- Infinite population: So large that it cannot be fully counted. Example: the number of Google searches performed per second.
Sample
A sample is a subset of a population used to draw conclusions about the entire population.
Example: surveying 100 students to understand the study habits of all students in a school.
Parameter
A parameter is a numerical value that describes a population.
Example: if the true average height of all students in a school is 5.5 feet, that value is a parameter.
Statistic
A statistic is a numerical value that describes a sample.
Example: if 100 students are measured and their average height is 5.4 feet, that value is a statistic.
Variable
A variable is any characteristic or quantity that can take different values.
Examples: age, length, height.
Types of Variables
- Qualitative (categorical) variable: Describes qualities or categories. Examples: color of a car, blood type, gender.
- Quantitative (numerical) variable: Represents measurable quantities. Examples: number of children, weight, income.
Types of Quantitative Data
- Discrete data: Takes specific, countable values (often integers). Example: number of students in a class (30, 31, 33).
- Continuous data: Takes any value within a range and is measured. Example: height of a person.
Scales of Measurement
There are four primary scales of measurement:
- Nominal Scale
- Ordinal Scale
- Interval Scale
- Ratio Scale
Nominal Scale
The nominal scale classifies data into distinct categories with no inherent order.
Examples:
- Gender: Male, Female
- Blood Type: A, B, AB, O
- Marital Status: Single, Married, Divorced
Ordinal Scale
The ordinal scale ranks data in a meaningful order, but differences between ranks are not equal or precisely measurable.
Examples:
- Education Level: High School, Bachelor’s, Master’s, PhD
- Customer Satisfaction: Very Unsatisfied to Very Satisfied
- Economic Status: Low, Middle, High
Interval Scale
The interval scale has ordered values with equal intervals, but no true zero point.
Examples:
- Temperature: Celsius, Fahrenheit
- Calendar Years
- IQ Scores
With interval data, addition and subtraction are meaningful, but ratio statements are not. For example, 20 C is not twice as warm as 10 C.
Ratio Scale
The ratio scale has all interval scale properties plus an absolute zero point, allowing meaningful ratio comparisons.
Examples:
- Height
- Weight
- Age
Types of Statistics
There are two major types of statistics:
- Descriptive Statistics
- Inferential Statistics
Descriptive Statistics
Descriptive statistics summarizes and presents data in a meaningful way so that we can understand it quickly.
Key Components
- Measures of central tendency
- Mean: average value
- Median: middle value in ordered data
- Mode: most frequent value
- Measures of dispersion (variability)
- Range: highest minus lowest value
- Variance: average squared deviation from the mean
- Standard Deviation: square root of variance
- Frequency distribution
- Shows how often each value appears (tables, histograms, pie charts)
Inferential Statistics
Inferential statistics uses sample data to draw conclusions or make predictions about a population.
Key Components
- Hypothesis Testing
- Null Hypothesis (H0): no effect or no difference
- Alternative Hypothesis (H1): effect or difference exists
- Confidence Intervals
- A range likely to contain the true population parameter
- Regression Analysis
- Studies relationships between variables and supports prediction
- Statistical Tests
- t-tests, chi-square tests, ANOVA
Thanks for reading.
“Your network is your net worth.” - Tim Sanders
Connect on LinkedIn: md-sawrab
GitHub: md-sawrab