Statistical Analysis with Excel-Chapter8
When you are doing Analysis in Statistics you have two main categories :
a.Graphical Analysis(Data Visualization)
b.Descriptive and Inferential Statistics
I already went through all the visualizations possible through the chart types, but as an analyst that is only giving you a bigger picture of the data and is not going to give deeper insights into the data for decision making. So we have more tools in statistics. Therefore we will move to study Descriptive statistics.
In descriptive Statistics we have
a.Measures of Frequency(count and Frequency)
b.Measures of Central tendency(Mean,Median,Mode)
c.Measures of Dispersion(Variance, Standard deviation, and Range)
d.Measures of position. (percentiles)
We already practiced measures of Frequency when formatting Pivot tables for Charts.
So let us study the Measures of Central Tendency:
Mean: The statistical term for Average is mean.
We all calculate the average many times but why it is important to Data Analysts?
We practically do testing on Sample and get results to estimate the effects of the procedure on the population that produced the sample. The mean of the sample is the best estimate of the population mean.
I don’t need to tell you the formula but I’m going to revise it anyway.
Mean=Sum of numbers/Amount of numbers we added up
Population measures are called Parameters.
Sample measures are called Statistics.
population mean is (pronounced ‘Myou”)
Sample mean(pronounced “X bar”)
download the Equipment losses and personnel losses data and save on your desktop.
Data tab-Get data-From file-choose the Russia_equipemnt_losses file and load in Excel sheet.
We will go through
Average
AverageA
Averageif
AverageIFS
Trimmean
Geometric mean
Harmonic mean
I cannot just go and calculate average,it is important to check if our data is cleaned and is in the form we need.
Firstly I see that Day 1 is not listed in the table so add a row and put values for all columns as 0.You will understand why I am doing this in the next step.
Secondly write N/A in all rows of special equipment column and 0 for row 21,10 on row 22, 0 on row 23, 1 on row 24, 0 on row 25,1on row 26,1 on row 27 and 2 on row 28.
Finally I observed that the values are aggregated over the 27 days but I need daily losses for equipments.
So now I need to transform my table to pivot table.
Then choose Rows(day),Values(Sum of aircraft,sum of helicopter,sum of tank) (Right click and choose value filed settings as show values as difference from previous.
Do the same for helicopter and tank ,value field settings.
Because I am taking difference I need the Day 1 values otherwise I wont have Day 2 values after the difference.
Formulas tab-more functions-statistical-Average
Do the same for three columns.
** Average ignores the cells with text/true/false values***
Quickly check how AverageA works.In the table itself as you already did the changes we calculate average of special equipment and no need for pivot table.
AverageIf:
Conditional averages are very helpful in statistics.I will now take
Russia _personnel losses table.
follow same steps
Add row 1 in table with 0 value in all columns.
Convert the table into Pivot table.
Choose pivot fields as
then value field setting of Values (sum of personnel).
pivot table look
Formulas tab-more functions-statistical-AverageIF
Getting the average of personnel lossess only if the lossess are more than 1000.
Averageifs:
Getting the average of personnel lossess based on two criteria ≥1000 and day≤15.
TRIMMean:
In any data we need to look for statistical Outliers.Outlier is an extreme value in a set.These outliers can get in the way of conclusions based on the averaging of the data.
It is often recommended to eliminate them before we calculate the mean.
*** But in advanced Analytics of Data science we study the effects of Outliers before removing***
But in here we use Trimmean for trimming the mean ,to be able to use when needed.
I removed only 0.2% of outliers from the mean calculation.
Median:
Why median when we already use mean, because mean is sensitive to extreme values(outliers).
The mean value will change drastically when we have outliers and at that time mean would not be a good statistic to measure.
Median is the middle value in the group of numbers after arranging them in descending order. It is also the 50th percentile value, meaning half of the values fall below and another half of the values fall above the median value.
we will use the same russia_personnel_lossess table.
Mode:
Mode is the value that occurs most frequently in the group .
Use Mode. multi if you have two/more modes in the array.