Studying Together: Understanding Measures of Central Tendency
Hi fellow Data Alchemists, I’m writing here with the goal of studying together the essentials we need to know if we want to become Data Scientists or work with Machine Learning. I’m considering creating a series of posts covering Statistics, Probability, and maybe even some Math needed for Machine Learning. I’m not an expert, so I’ll be gathering information from the web, I'll write the post and I'll ask ChatGPT to correct me. My idea is to create some accountability for myself while sharing my studies with all of you. Today, I thought I’d start with the basics: Measures of Central Tendency. As the title suggests, "Central Tendency" should already give us a hint about what this is about, right? If you're not sure, it’s simply a fancy term for describing the mean, median, and mode. So, what are Measures of Central Tendency? They’re key statistical tools that help us summarize and understand the central point of a dataset. These measures are especially useful in data science for interpreting data distributions and providing meaningful insights into the general behavior of the data. The three primary measures, mean, median, and mode, each give us a unique perspective on this “center.” - Mean: The mean is calculated by summing up all data points and dividing by the count of those points. It’s useful when data is symmetrically distributed, as it represents the expected value. However, it’s sensitive to outliers, so in skewed distributions, it might not accurately represent the center of the data. - Median: The median, or the middle value when data is ordered, is especially valuable in skewed distributions or when there are outliers. Since it reflects positional rather than magnitude-based centrality, it often provides a more robust measure of central tendency than the mean in non-normal distributions. - Mode: The mode, or the most frequently occurring value, is useful in categorical data or multimodal distributions. It offers insights into the most common category or value in the dataset, which can be particularly important for understanding customer preferences, product popularity, or common patterns in discrete data.