Statistical Analysis with Excel-Chapter12

hajra Begum
5 min readMar 30, 2022

I will say, Data Visualization(charts)/Descriptive statistics is important in Analysis for the Analysts to get an idea of the data he is working on/How to go about it. But to interpret /conclude his Analysis these are not sufficient.

Therefore Inferential statistics is our need.

To be an expert in doing this I would suggest going back and having a knowledge check-in Probability and Probability distributions.

I did that and so can list some important items:

  • Data types:

1. Categorical /Qualitative data: Labels

2. Numerical /Qualitative data: Discrete(count) and Continuous (measured).

  • Probability:

Probability=no of favorable events/total possible events

Properties of probability:

  1. Total probability of all possible outcomes=1
  2. The value of probability is always between 0 and 1.

Random: A chance associated with it.

Random variable: Variable that holds the value of the experiment.

The use of random variables is most common in probability and statistics, where they are used to quantify outcomes of random occurrences

P(X=T)=probability that random variable X takes the value of T.

http://people.stern.nyu.edu/adamodar/pdfiles/papers/probabilistic.pdf

Probability distribution is a function that is used to give the probability of all the possible values that a random variable can take.

Based on discrete and continuous data types,a probability distribution can be classified into a discrete probability distribution and a continuous probability distribution.

The random variable takes discrete values and will have a probability distribution described by Probability Mass function .Then the formula for the probability mass function, f(x), evaluated at x, is given as follows:

f(x) = P(X = x)

The cumulative distribution function of a discrete random variable is given by the formula F(x) = P(X ≤ x).

Examples:Bernoulli distribution,Binaomail,Poisson distribution.

When the random variable takes continuous values, then the distribution is described by Probability density function.Suppose the probability that a random variable, X, lies between points a and b has to be determined then the general formula is given as follows:

The cumulative distribution function F(x) for a continuous rv X is defined for every number x by

Using F(x) to Compute Probabilities

The expected (or mean) value of a continuous r.v. X with the pdf f(x) is:

The variance of a continuous random variable X with pdf f(x) and mean value µ is

The standard deviation (SD) of X is

However, continuous models often approximate real-world situations very well, and continuous mathematics (calculus) is frequently easier to work with than mathematics of discrete variables and distributions.

we can visualize these distributions, through charts /graphs and to get probability from these we use the concept of area over the interval, which that in turn is obtained using the concept of Calculus(integration).

All the above are basics to move forward, the main aim is to do Inferential statistics i.e to estimate population characteristics from sample statistics.

To do this inference I need assumptions. I will not go into explaining your math here, Just the reason/logic behind the concept of Inferring.

For example, a specific attribute is distributed throughout a population so that most people have an average/near-average amount of attribute, and progressively fewer people have an increasingly extreme amount of the attribute.

This happening is so often, it becomes an assumption and to capture this assumption in a graphic way we get a bell curve. It is formally called the normal distribution.The normal distribution has two parameters-mean(center of distribution)and standard deviation(spread of distribution).

A random variable that follows a normal distribution is denoted as

.Here, μ is the mean and σ2 is the variance and they form the parameters of the normal distribution.

Let us quickly go over Properties of Normal distribution:

  1. Symmetric about the mean.
  2. Empirical rule

a. P [ μ — σ <= X <= μ + σ ] ≈ 68 %

So the first formula basically says that the probability of a variable that falls within the range of μ — σ and μ + σ is 68 %. which means 68 % of the data points belonging to the random variable X fall within the range of the first standard deviation.

b. P [ μ — 2σ <= X <= μ + 2σ ] ≈ 95 %

So the second formula basically says that the probability of a variable that falls within the range of μ — 2σ and μ + 2σ is 95 %. which means 95% of the data points belonging to the random variable X fall within the range of the second standard deviation.

c. P [ μ — 3σ <= X <= μ + 3σ ] ≈ 99.8 %

So the second formula basically says that the probability of a variable that falls within the range of μ — 3σ and μ + 3σ is 99.8 %. which means 99.8% of the data points belonging to the random variable X fall within the range of the third standard deviation.

3.Mean=median=mode

4.The rea under the curve of normal distribution=1.

--

--

hajra Begum

Enthusiast of Data Science, Operations Research, and Mathematics/Love to cook.