Introduction to data science
Chapter 5 summarizing data - the basics.
Data summarization is the science and art of conveying information more effectivelly and efficiently. Data summarization is typically numerical, visual or a combination of the two. It is a key skill in data analysis - we use it to provide insights both to others and to ourselves. Data summarization is also an integral part of exploratory data analysis.
In this chapter we focus on the basic techniques for univariate and bivariate data. Visualization and more advanced data summarization techniques will be covered in later chapters.
We will be using R and ggplot2 but the contents of this chapter are meant to be tool-agnostic. Readers should use the programming language and tools that they are most comfortable with. However, do not sacrifice expresiveness or profesionallism for the sake of convenience - if your current toolbox limits you in any way, learn new tools!
5.1 Descriptive statistics for univariate distributions
We humans are not particularly good at thinking in multiple dimensions, so, in practice, there will be a tendency to look at individual variables and dimensions. That is, in practice, we will most of the time be summarizing univariate distributions.
Univariate distributions come from various sources. It might be a theoretical distribution, an empirical distribution of a data sample, a probabilistic opinion from a person, a posterior distribution of a parameter from a Bayesian model, and many others. Descriptive statistics apply to all of these cases in the same way, regardless of the source of the distribution.
Before we proceed with introducing the most commonly used descriptive statistics, we discuss their main purpose. The main purpose of any sort of data summarization technique is to (a) reduce the time and effort of delivering information to the reader in a way that (b) we lose as little relevant information as possible. That is, to compress the information.
All summarization methods do (a) but we must be careful to choose an appropriate method so that we also get (b). Summarizing out relevant information can lead to misleading summaries, as we will illustrate with several examples.
5.1.1 Central tendency
The most common first summary of a distribution is its typical value, also known as the location or central tendency of a distribution.
The most common summaries of the location of a distribution are:
- the mean (the mass centre of the distribution),
- the median or 2nd quartile (the value such that half of the mass is on one and half on the other side),
- the mode (the most probable value or the value with the highest density).
Given a sample of data, the estimate of the mean is the easiest to compute (we compute the average), but the median and mode are more robust to outliers - extreme and possibly unrepresentative values.
In the case of unimodal approximately symmetrical distributions, such as the univariate normal distribution, all these measures of central tendency will be similar and all will be an excellent summary of location. However, if the distribution is asymmetrical (skewed), they will differ. In such cases it is our job to determine what information we want to convey and which summary of central tendency is the most appropriate, if any.
For example, observe the Gamma(1.5, 0.1) distribution and its mean, median and mode:
In the case of multi-modal distributions, no single measure of central tendency will adequately summarize the distribution - they will all be misleading. For example, look at this bimodal distribution:
5.1.2 Dispersion
Once location is established, we are typically interested in whether the values of the distribution cluster close to the location or are spread far from the location.
The most common ways of measuring such dispersion (or spread or scale) of a distribution are:
- variance (mean of quadratic distances from mean) or, more commonly, standard deviation (root of variance, so we are on the same scale as the measurement)
- median absolute deviation (median of absolute distances from mean),
- quantile-based intervals , in particular the inter-quartile range (IQR) (interval between the 1st and 3rd quartiles, 50% of the mass/density lies in this interval).
Standard deviation is the most commonly used and median absolute deviation is more robust to outliers.
In the case of distributions that are approximately normal, the mean and standard deviation will be the optimal choice for summarization, because they correspond directly to the two parameters of the normal distribution. That is, they completely summarize the distribution without loss of information. We also know that approximately 95% (99%) of the normal density lies within 2 (3) standard deviations from the mean. Standard deviation is useful even if the distribution is not approximately normal as it does provide some information, combined with the sample size (producing the standard error), on how certain we can be in our estimate of the mean.
But, as before, the more we deviate from normality, the less meaningful standard deviation becomes and it makes more sense to use quantile-based intervals. For example, if we estimate the mean and \(\pm\) 2 standard deviations for samples from the Gamma distribution from before, we get the following:
That is, the 95% interval estimated this way also includes negative values, which is misleading and absurd - Gamma distributed variables are positive. Computing the IQR or the 95% range interval provides a more sensible summary of this skewed distribution and, together with the mean also serve as an indicator that the distribution is skewed (the mean is not the centre of the intervals):
And, again, for multi-modal distributions, we can adequately summarize them only by identifying the modes visually and/or describing each mode individually.
5.1.3 Skewness and kurtosis
As mentioned above, ranges can be used to indicate a distributions asymmetry (skewness) or fat-tailedness (kurtosis). Although less commonly used, there exist numerical summaries of skewness and kurtosis that can be used instead.
The following example shows the kurtosis and skewness for a gamma, normal, logistic and bimodal distribution. Observe how the standard way of calculating kurtosis fails for the bimodal and assigns it the lowest kurtosis:
5.1.4 Nominal variables
Nominal variables are typically represented with the relative frequencies or probabilities, numerically or visually. Note that the methods discussed so far in this chapter apply to numerical variables (rational, interval and to some extent, ordinal) but not nominal variables, because the notions of location and distance (dispersion) do not exist in the nominal case. The only exception to this is the mode, which is the level of the nominal variable with the highest relative frequency or probability.
One summary that is often useful for summarizing the dispersion or the uncertainty associated with a nominal variable is entropy . Observe the following example:
A fair coin has exactly 1 bit of entropy - we receive 1 bit of information by observing the outcome of a flip. This is also the maximum achievable entropy for a binary variable. A biased coin has lower entropy - we receive less information. In the extreme case of a coin with heads on both sides, the entropy is 0 - the outcome of a flip brings no new information, as we already know it will be heads.
When we want to compare entropy across variables with different numbers of levels/categories, we can normalize it by dividing it with the maximum achieavable entropy. For example, observe a fair coin and a fair 6-sided die - in absolute terms, the 6-sided die has higher entropy due to having more possible values. However, relatively to the maximum achievable entropy, both represent maximally uncertain distributions:
Note that entropy can easily be calculated for any discrete random variable. Entropy also has a continuous analogue - differential entropy, which we will not discuss here.
5.1.5 Testing the shape of a distribution
Often we want to check if the distribution that underlies our data has the shape of some hypothesized distribution (for example, the normal distribution) or if two samples come from the same distribution.
Here, we will present two of the most common methods used: the Kolmogorov-Smirnov test and the Chi-squared goodness-of-fit test. Both of these are Null-hypothesis significance tests (NHST), so, before we proceed, be aware of two things. First, do not use NHST blindly, without a good understanding of their properties and how to interpret their results. And second, if you are more comfortable with thinking in terms of probabilities of hypotheses as opposed to significance and p-values, there always exist Bayesian alternatives to NHST.
The Kolmogorov-Smirnov test (KS) is a non-parametric test for testing the equality of two cumulative distribution functions (CDF). These can be two empirical CDFs or an empirical CDF and a theoretical CDF. The KS test statistic is the maximum distance between the two corresponding CDFs. That is, we compute the distribution of this statistic under the null-hypothesis that the CDFs are the same and then observe how extreme the maximum distance is on the sample.
To illustrate the KS test, we use it to test the normality of the underlying distributions for two samples - one from a logistic distribution, one from a standard normal distribution. And then to test if the two samples come from the same distribution:
So, with a 5% confidence level (95% confidence), we would reject the null hypothesis that our sample x1 is from a standard normal distribution. On the other hand, we would not reject that our sample x2 is from a standard normal distribution. Finally, we would not reject the null hypothesis that x1 and x2 come from the same distribution. The only guarantee that comes with these results is that we will in the long run falsely reject a true null-hypothesis at most 5% of the time. It says very little about our overall performance, because we do not know the ratio of cases when the null-hypothesis will be true.
This example also illustrates the complexity of interpreting NHST results or rather all the tempting traps laid out for us - we might be tempted to conclude, based on the high p-value, that x2 does indeed come from a standard normal, but that then leads us to a weird predicament that we are willing to claim that x1 is not standard normal, x2 is standard normal, but we are less sure that x1 and x2 have different underlying distributions.
Note that typical implementations of the KS test assume that the underlying distributions are continuous and ties are therefore impossible. However, the KS test can be generalized to discrete and mixed distributions (see R package KSgeneral ).
Differences in between distributions can also be assessed visually, through the QQ-plot , a plot that compares the quantiles of the two distributions. If the distributions have the same shape, their quantiles, plotted together, should lie on a line. The samples from the logistic distribution obviously deviate from the theoretical quantiles of a normal distribution:
The Chi-squared goodness-of-fit (CHISQ) test is a non-parametric test for testing the equality of two categorical distributions. The CHISQ test can also be used on discrete or even continuous data, if there is a reasonable way of binning the data into a finite number of bins. The test statistic is based on a similar idea as the KS test statistic, but instead of observing just the maximum difference, we sum the squared difference between the relative frequency of the two distributions for a bin across all bins.
We illustrate the CHISQ test by testing the samples for a biased coin against a theoretical fair coin and the samples from an unbiased 6-sided die against a theoretical fair 6-sided die.
So, with a 5% significance level (95% confidence), we would reject the null hypothesis that our coin is fair, but only in the case with 40 samples. Because fair or close-to fair coins have high entropy, we typically require a lot of samples to distinguish between their underlying probabilities. We would not reject the null-hypothesis that the die is fair.
For a more real-world example, let us take the exit-poll data for the 2016 US Presidential election, broken down by gender, taken from here :
So, at any reasonable confidence level, we would reject the null-hypothesis and conclude that there is a difference in how men and women voted. In fact, we do not even need a test, because the difference is so obvious and the sample size so large. The differences between those who earned less or more than 100k$, however, appear smaller, so a test makes more sense:
Still, we can at most typical levels of confidence reject the null-hypothesis and conclude that that there is a pattern here as well.
5.2 Descriptive statistics for bivariate distributions
When dealing with a joint distribution of two variables (that is, paired samples), the first thing we are typically interested in is dependence between the two variables or lack thereof. If two distributions are independent, we can summarize each separately without loss of information. If they are not, then the distributions carry information about eachother. The predictability of one variable from another is another (equivalent) way of looking at dependence of variables.
The most commonly used numerical summary of dependence is the Pearson correlation coefficient or Pearson’s \(\rho\) . It summarizes the linear dependence, with \(\rho = 1\) and \(\rho = - 1\) indicating perfect colinearity (increasing or decreasing) and \(\rho = 0\) indicating linear independence. As such, Pearson’s \(\rho\) is directly related (the squared root) to the coefficient of determination \(R^2\) , a goodness-of-fit measure for linear models and the proportion of variance in one explained by the other variable. An important consideration is that the statement that linear independence implies independence is not true in general (the converse implication is). One notable exception where this implication is true is the multivariate Normal distribution, where the dependence structure is expressed through linear dependence only.
Two of the most popular alternatives to Pearson’s \(\rho\) are Spearman’s \(\rho\) and Kendalls \(\tau\) . The former measures the degree to which one variable can be expressed as monotonic function of the other. The latter measures the proportion of concordant pairs among all possible pairs (pairs (x1,y1) and (x2, y2), where if x1 > x2 then y1 > y2). As such, they can capture non-linear dependence and are more appropriate for data with outliers or data where distance might have no meaning, such as ordinal data. Spearman’s \(\rho\) and Kendall’s \(\tau\) are more robust but do not have as clear an interpretation as Pearson’s \(\rho\) . Kendall’s tau is also computationally more expensive.
Below are a few examples of bivariate samples that illustrate the strengths and limitations of the above correlation coefficients:
Like similar questions about other parameters of interest, the question Is a strong correlation? is a practical question. Unless the correlation is 0 (no correlation) or 1/-1 (perfectly correlated, can’t be more correlated than this), the meaning of the magnitude of correlation depends on the practical setting and its interpretation depends on some reference level. Even a very low correlation, such as 0.001 (if we are reasonably sure that it is around 0.001) can be practically meaningful. For example, if it is correlation between neighboring numbers in sequences generated by a uniform random number generator (RNG), that would be more than enough correlation to stop using this RNG.
5.3 Further reading and references
- For a more comprehensive treatment of the most commonly used summarization techniques see: Holcomb, Z. C. (2016). Fundamentals of descriptive statistics. Routledge.
- More on the practice of summarization techniques and hypothesis testing: Bruce, P., & Bruce, A. (2017). Practical statistics for data scientists: 50 essential concepts. ” O’Reilly Media, Inc.”. (Chapters 1 and 3)
5.4 Learning outcomes
Data science students should work towards obtaining the knowledge and the skills that enable them to:
- Reproduce the techniques demonstrated in this chapter using their language/tool of choice.
- Recognize when a type of summary is appropriate and when it is not.
- Apply data summarization techiques to obtain insights from data.
- Once introduced to the bootstrap and other estimation techniques, to be able to combine descriptive statistics with a quantification of uncertainty, such as confidence intervals.
5.5 Practice problems
- Download the Football Manager Players dataset or use a similarly rich dataset with numerical, binary and categorical variables. With Python or R demonstrate the application and interpretation of results for each of the summarization techniques from this chapter.
- Find one or more real-world examples (data sets) where a standard summary of univariate or bivariate data fails. That is, where important information is lost in the summary.
Summarising Data
Home » Data Analytics Articles » Statistics – a brief guide » Summarising Data
When you want to measure something in the natural world you usually have to take several measurements. This is because things are variable, so you need several results to get an idea of the situation. Once you have these measurements you need to summarize them in some way because sets of raw numbers are not easily interpreted by most people.
There are four key areas to consider when summarizing a set of numbers:
- Centrality – the middle value or average.
- Dispersion – how spread out the values are from the average.
- Replication – how many values there are in the sample.
- Shape – the data distribution, which relates to how “evenly” the values are spread either side of the average.
You need to present the first three summary statistics in order to summarize a set of numbers adequately. There are different measures of centrality and dispersion – the measures you select are based on the the last item, shape (or data distribution).
An average is a measure of the middle point of a set of values. This central tendency (centrality) is an important measure and is usually what you are comparing when looking at differences between samples for example.
There are three main kinds of average:
- Mean – the arithmetic mean, the sum of the values divided by the replication.
- Median – the middle value when all the numbers are ranked in order.
- Mode – the most frequent value(s) in a sample.
Of these three, the mean and the median are most commonly used in statistical analysis. The most appropriate average depends on the shape of the data sample.
The arithmetic mean is calculated by adding together the values in the sample. The sum is then divided by the number of items in the sample (the replication).
The formula is shown above. The ∑ symbol represents “sum of”. The n represents the replication. The final mean is indicated using an overbar. This shows that the mean is your estimate of the true mean. This is because you usually measure only some of the items in a “population”; this is called a sample. If you measured everything then you would be able to calculate the true mean, which would be indicated by giving it a µ symbol.
The mean should only be used when the shape of the sample is appropriate. When the data are normally distributed the mean is a good summary of the average. If the data are not normally distributed the mean is not a good summary and you should use the median instead.
The median is the middle value, taken when you arrange your numbers in order (rank). This measure of the average does not depend on the shape of the data. The “formula” for working out the median depends on the ranks of the values, you want a value whose rank is the (n/2)+0.5th like so:
If you have an odd number of values in your sample the median is simply the middle value like so:
The median is 7 in this case.
When you have an even number of values the middle will fall between two items:
What you do is use a value mid-way between the two items in the middle. In this case mid-way between 4 and 7, which gives 5.5.
The median is a good general choice for an average because it is not dependent on the shape of the data. When the data are normally distributed the mean and the median are coincident (or very close).
The Mode is the most frequent value in a sample. It is calculated by working out how many there are of each value in your sample. The one with the highest frequency is the mode. It is possible to get tied frequencies, in which case you report both values. The sample is then said to be bimodal. You might get more than two modal values!
The mode is not commonly used in statistical analysis. It tends to be used most often when you have a lot of values, and where you have integer values (although it can be calculated for any sample).
The mode is not dependent on the shape of your sample. Generally speaking you would expect your mode and median to be close, regardless of the sample distribution. If the sample is normally distributed the mode will usually also be close to the mean.
The dispersion of a sample refers to how spread out the values are around the average. If the values are close to the average, then your sample has low dispersion. If the values are widely scattered about the average your sample has high dispersion.
The example figure shows samples that are normally distributed, that is, they are symmetrical around the average (mean). As far as dispersion goes, the principle is the same regardless of the shape of the data. However, different measures of dispersion will be more appropriate for different data distribution.
There are various measures of dispersion, such as:
Standard deviation
- Standard Error
- Confidence Interval
- Inter-Quartile Range
The choice of measurement depends largely on the shape of the data and what you want to focus on. In general, with normally distributed data you use the standard deviation. If the data are not normally distributed, you use the inter-quartile range.
The standard deviation is used when the data are normally distributed. You can think of it as a sort of “average deviation” from the mean. The general formula for calculating standard deviation looks like the following:
To work out standard deviation follow these steps:
- Subtract the mean from each value in the sample.
- Square the results from step 1 (this removes negative values).
- Add together the squared differences from step 2.
- Divide the summed squared differences from step 3 by n-1, which is the number of items in the sample (replication) minus one.
- Take the square root of the result from step 4.
The final result is called s, the standard deviation. In most cases you will have taken a sample of values from a larger “population”, so your value of s is your estimate of standard deviation (the sample standard deviation). This is also why you used n-1 as the divisor in the formula. If you measured the entire population you can use n as the divisor. You would then have σ, which is the “true” standard deviation (called the population standard deviation).
In effect the -1 is a compensation factor. As n gets larger and therefore closer to the entire population, subtracting 1 has a smaller and smaller effect on the result. In most statistical analyses you will use sample standard deviation (and so n-1).
Inter-Quartile range
The inter-quartile range (IQR) is a useful measure of the dispersion of data that are not normally distributed (see shape). You start by working out the median; this effectively splits the data into two chunks, with an equal number of values in each part. For each half you can now work out the value that is half-way between the median and the “end” (the maximum or minimum). This gives you values for the two inter-quartiles. The difference between them is the IQR, which you usually express as a single value.
The IQR essentially “knocks off” the most extreme portions of the data sample, leaving you with a core 50% of your original data. A small IQR denotes a small dispersion and a large IQR a large dispersion.
As a by-product of working out the IQR you’ll usually end up with five values:
- Minimum – the 0th quartile (or 0% quantile).
- Lower quartile – the 1st quartile (or 25% quantile).
- Median – the 2nd quartile (or 50% quantile).
- Upper quartile – the 3rd quartile (or 75% quantile).
- Maximum – the 4th quartile (or 100% quantile).
These 5 values split the data sample into four parts, which is why they are called quartiles. You can calculate the quartiles from the ranks of the data values like so:
- Rank the values in ascending order. Use the mean rank for tied values.
- The median corresponds to the item that has rank 0.5n + 0.5 (where n = replication).
- The lower quartile corresponds to the item that has rank 0.25n + 0.75.
- The upper quartile corresponds to the item that has rank 0.75n + 0.25.
If you are using Excel you can compute the quartiles using the QUARTILE function.
The range is simply the difference between the maximum and the minimum values. It is quite a crude measure and not very useful. The inter-quartile range is much more useful, and makes use of the maximum and minimum values in the calculation.
Replication
This is the simplest of the summary statistics but it is still important. The replication is simply how many items there are in your sample (that is, the number of observations).
The value n, the replication, is used in calculating other summary statistics, such as standard deviation and IQR, but it is also helpful in its own right. You should look at the dispersion and replication together. A certain value for dispersion might be considered “high” if n is small but quite “low” if n is very large.
The shape of the data affects the type of summary statistics that best summarize them. The “shape” refers to how the data values are distributed across the range of values in the sample. Generally you expect there to be a “cluster” of values around the average. It is important to know if the values are more or less symmetrically arranged around the average, or if there are more values to one side than the other.
There are two main ways to explore the shape (distribution) of a sample of data values:
- Graphically – using frequency histograms or tally plots draws a picture of the sample shape.
- Shape statistics – such as skewness and kurtosis. These give values to how central the average is and how clustered around the average the data are.
The ultimate goal is to determine what kind of distribution your data forms. If you have normal distribution you have a wide range of options when it comes to data summary and subsequent analysis.
Types of data distribution
There are many “shapes” of data, commonly encountered ones are:
- Normal (also called Gaussian)
In general, your aim is to work out if you have normal distribution or not. If you do have normal distribution you can use mean and standard deviation for summary. If you do not have normal distribution you need to use median and IQR instead.
The normal distribution (also called Gaussian) has well-explored characteristics and such data are usually described as parametric. If data are not parametric they can be described as skewed or non-parametric.
Drawing the distribution
There are two main ways to visualize the shape of your data:
Tally plots
In both cases the idea is to make a frequency plot. The data values are split into frequency classes, usually called bins. You then determine how many data items are in each bin. There is little difference between a tally plot and a histogram, they show the same information but are constructed is slightly different ways.
A tally plot is a kind of frequency graph that you can sketch in a notebook. This makes it a very useful tool for times when you haven’t got a computer to hand.
To draw a tally plot follow these steps:
- Determine the size classes (bins), you want around 7 bins.
- Draw a vertical line (axis) and write the values for the bins to the left.
- For each datum, determine which size class it fits into and add a tally mark to the right of the axis, opposite the appropriate bin.
You will now be able to assess the shape of the data sample you’ve got.
The tally plot in the preceding figure shows a normal (parametric) distribution. You can see that the shape is more or less symmetrical around the middle. So here the mean and standard deviation would be good summary values to represent the data. The original dataset was:
The first bin, labelled 18, contains values up to 18. There are two in the dataset (17, and 16). The next bin is 21 and therefore contains items that are >18 but not greater than 21 (there are three: 21, 19 and 21).
The following dataset is not normally distributed:
These data produce a tally plot like so:
Note that the same bins were used for the second dataset. The range for both samples was 16-36. The data in the second sample are clearly not normally distributed. The tallest size class is not in the middle and there is a long “tail” towards the higher values. For these data the median and inter-quartile range would be appropriate summary statistics.
A histogram is like a bar chart. The bars represent the frequency of values in the data sample that correspond to various size classes (bins). Generally the bars are drawn without gaps between them to highlight the fact that the x-axis represents a continuous variable. There is little difference between a tally plot and a histogram but the latter can be produced easily using a computer (you can sketch one in a notebook too).
To make a histogram you follow the same general procedure as for a tally plot but with subtle differences:
- Determine the size classes.
- Work out the frequency for each size class.
- Draw a bar chart using the size classes as the x-axis and the frequencies on the y-axis.
You can draw a histogram by hand or use your spreadsheet. The following histograms were drawn using the same data as for the tally plots in the preceding section. The first histogram shows normally distributed data.
The next histogram shows a non-parametric distribution.
In both these examples the bars are shown with a small gap, more properly the bars should be touching. The x-axis shows the size classes as a range under each bar. You can also show the maximum value for each size class. Ideally your histogram should have the labels at the divisions between size classes like so:
Note that this histogram uses slightly different size classes to the earlier ones.
Shape statistics
Visualizing the shape of your data samples is usually your main goal. However, it is possible to characterize the shape of a data distribution using shape statistics. There are two, which are used in conjunction with each other:
- Skewness – a measure of how central the average is in the distribution.
- Kurtosis – a measure of how pointy the distribution is ( think of it as how clustered the values are around the middle).
If you are producing a numerical data summary these two values are useful statistics.
The skewness of a sample is a measure of how central the average is in relation to the overall spread of values. The formula to calculate skewness uses the number of items in the sample (the replication, n) and the standard deviation, s.
In practice you’ll use a computer to calculate skewness; Excel has a SKEW function that will compute it for you.
A positive value indicates that the average is skewed to the left, that is, there is a long “tail” of more positive values. A negative value indicates the opposite. The larger the value the more skewed the sample is.
The kurtosis of a sample is a measure of how pointed the distribution is (see drawing the distribution). It is also a way to think about how clustered the values are around the middle. The formula to calculate kurtosis uses the number of items in the sample (the replication, n) and the standard deviation, s.
In practice you’ll use a computer to calculate kurtosis; Excel has a KURT function that will compute it for you.
A positive result indicates a pointed distribution, which will probably also have a low dispersion. A negative result indicates a flat distribution, which will probably have high dispersion. The higher the value the more extreme the pointedness or flatness of the distribution.
You should always summarize a sample of data values to make them more easily understood (by you and others). At the very least you need to show:
- Middle value – centrality, that is, an average.
- Dispersion – how spread out the data are around the average.
Replication – how large the sample is.
The shape of the data (its distribution) is also important because the shape determines which summary statistics are most appropriate to describe the sample. Your data may be normally distributed (i.e. with a symmetrical, bell-shaped curve) and so parametric, or they may be skewed and therefore non-parametric.
You can explore and describe the shape of data using graphs:
- Tally plots – a simple frequency plot.
- Histograms – a frequency plot like a bar chart.
You can also use shape statistics:
- Skewness – how central the average is.
- Kurtosis – how pointed the distribution is.
The shape of the data also leads you towards the most appropriate ways of analyzing the data, that is, which statistical tests you can use.
My Publications
I have written several books on ecology and data analysis
Register your interest for our Training Courses
We run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. Courses will be held at one of our training centres in London. Alternatively we can come to you and provide the training at your workplace. Training Courses are also available via an online platform.
DS101 Making Sense of Data DS102 Basic Hypothesis Testing using Excel DS103 Data Visualisation using Excel DR102 Beginning R DR201 Data Mining using R DR202 Data Visualisation using R DR212 Data Visualisation with ggplot2 DR103 Ecological Data Analysis using R
Get In Touch Now
for any information regarding our training courses, publications or help with a data project
Data summaries #
Many approaches to data analysis may be viewed as data “summarization”. The most immediate effect of summarizing data is to take data that may be overwhelming to work with, and reduce it to a few key summary values that can be viewed, often in a table or plot.
As we have emphasized before, data analysis should always aim to address specific and explicit research questions. This principle continues to hold when working with data summaries. There are many data summaries in common use, and new approaches to summarizing data continue to be developed. However not all data summaries are equally relevant for addressing a particular question. When conducting a data analysis, it is important to identify data summaries that are informative for addressing the question at hand.
Previously , we have introduced the key notions of samples and populations . Our data (the “sample”) are reflective of a population that we cannot completely observe. The goal of a data analysis is always to learn about the population, not only about the sample. When performing a data analysis involving data summaries, we will calculate the summaries using the data for our sample (since that is all we have to work with), but draw conclusions (with appropriate uncertainty quantification) about the population.
There does not exist (and probably never will exist) a recipe or algorithm that automatically tells us which data summaries can be used to address a particular question. Therefore, it is important to understand the motivation and theoretical properties behind a number of different summarization approaches. With experience, you can develop the ability to effectively summarize data in a way that brings insight regarding your research aims.
One meaning of the term statistic is equivalent to the idea of a data summary. That is, a statistic is a value derived from data that tells us something in summary form about the data. Here we will introduce some of the main types of data summaries, or statistics. We will continue to learn about many more types of data summaries later in the course. As noted above, many data summaries are best viewed in graphical form, but here we will focus on simple numerical data summaries that can be appreciated without graphing. We will consider graphical summaries later.
Data summaries based on frequencies #
We have learned that a nominal variable takes on a finite set of unordered values. The most basic summarization of a nominal variable is its frequency table . For example, suppose we are interested in the employment status of working-age adults, and we categorize people’s employment status as follows: (i) employed full time, (ii) employed part time, looking for full time work, (iii) employed part time, not looking for full time work, (iv) not employed, looking for work, and (v) not employed, not looking for work. If we obtain the employment status for 1000 people, then our “raw data” is a list of 1000 values. The frequency table summarizes this data as five counts, the number of people who give each of the possible responses, and the corresponding proportions (which must sum to 1).
Data summaries based on quantiles and tail proportions #
For ordinal and quantitative data, a tail proportion is defined to be the proportion of the data that falls at or below a given value. For example, suppose we ask “what proportion of the days in August does the daily maximum temperature exceed 35C?”, or “what proportion of workers in Mexico earn 1000 USD or less per month?”. The first of these examples can be called a right tail proportion , because it refers to the proportion of observations that are greater than a given value (35C), and the second of these examples can be called a left tail proportion , because it refers to the proportion of observations that are less than or equal to a given value (1000 USD).
Tail proportions are calculated from a sample of data. The analogous characteristic of a population is called a tail probability . We will not emphasize this distinction here, but will return to it later in the course.
Tail proportions can be very useful if there is a good reason for choosing a particular threshold from which to define the tail. For example, a temperature of 35 C may be considered dangerously warm, or a monthly income of 1000 USD may be just above the poverty level in Mexico. While such thresholds can often be stated, in some settings any choice of a threshold seems arbitrary. To circumvent this, we may use a closely related summary statistic called a quantile . A quantile is essentially the inverse of a tail proportion. Instead of starting with a threshold (say 1000 USD for monthly income), we start with the proportion of the sample (say 0.25), and find a value X such that the given proportion (e.g. 0.25) of the data is less than or equal to X. Note that quantiles are always defined in terms of left tail proportions. These proportions may be called probability points in this context.
In principle, a quantile exists for every probability point between 0 and 1. But technical difficulties can arise when the sample size is small, or when the number of distinct observable values is small. Suppose we only observe four values {1, 3, 7, 9}. Half of the data are less than or equal to 3, but it is also true that half of the data are less than or equal to 6.99. Any value in the interval [3, 7) could be taken as the median. There are several conventions for resolving this difficulty. However in this course, we will generally be working with fairly large data sets, where the minor differences between various approaches for calculating quantiles are not consequential.
In common parlance, people refer to “percentiles” more often than “quantiles”. But these two notions are equivalent – the 65th percentile is the same as the 0.65 quantile (for example). In most settings, it is fine to use either term.
Certain quantiles have special names. The median is the quantile corresponding to a proportion of 0.5. The deciles are the quantiles for proportions 0.1, 0.2, …, 0.9. The quartiles are the quantiles for proportions 0.25, 0.5, and 0.75. Tertiles are the quantiles for proportions 1/3 and 2/3, and quintiles are the quantiles for proportions 0.2, 0.4, 0.6, and 0.8. Note that to divide the real line into n parts requires n-1 points. So there are 9 deciles, 3 quartiles, etc.
People may also state that a value “falls into” a quartile (or a tertile, etc.). A value that falls into, say, the second quartile of a dataset (or distribution) is a value that falls between the 25th and 50th percentiles.
Order statistics #
Order statistics and quantiles are related in the following manner. If \(n\) is the sample size, and we wish to estimate the quantile for proibability point \(p\) , one convention is to take the order statistic \(x_{(j)}\) , where \(j\) is the least integer greater than or equal to \(p\cdot n\) , where \(n\) is the sample size.
The most important thing to keep in mind about quantiles is that the \(p^\textrm{th}\) quantile \(Q_p\) has the property that approximately a fraction of \(p\) of the data falls at or below \(Q_p\) , and approximately a fraction of \(1-p\) of the data falls at or above \(Q_p\) .
The median as a measure of central tendency #
A very commonly-encountered quantile is the median , which is the quantile corresponding to a proportion of 0.5. That is, the median is a point \(Q_{0.5}\) such that approximately half of the data are less than or equal to \(Q_{0.5}\) , and approximately half of the data are greater than or equal to \(Q_{0.5}\) . For a large dataset, the median can be computed by sorting the data, and taking the value in the middle of the sorted list.
The median is the most commonly-encountered quantile, and is the easiest to interpret. It is sometimes said to be a measure of location or central tendency . These terms mean that the median is used to reflect the “most typical”, “most representative” or “most central” value in a dataset. However there are many other measures of central tendency besides the median, and in some cases different measures of central tendency can give very different results. This is a topic that we will return to later.
Example: Quantiles of household income #
Suppose we are looking at the distribution of income for households in the United States. This median is currently around 59,000 USD per year. The 10th percentile and 90th percentile of US household income are currently around 14,600 USD, and 184,000 USD, respectively. These values tell us about the living standard of the middle 80% of the United States population. We can take this a step further by looking at a series of income quantiles within three US states in 2018 (Michigan, Texas, and Maryland).
(source: ACS/PUMS)
There are several interesting observations that can be made from the above table. One is that the differences between the states grow larger as we move from the lower quantiles to the upper quantiles. Another is that the states are always ordered in the same way – every quantile for Texas is greater than the corresponding quantile for Michigan, and every quantile for Maryland is greater than the corresponding quantile for Texas. If this is true for every possible quantile given by a probability point \(0 \le p \le 1\) , then we could say that incomes in Texas are stochastically greater than incomes in Michigan, and incomes in Michigan are stochastically greater than incomes in Texas. Since the table above only shows five quantiles, we can only say that the table is consistent with these “stochastic ordering” relationships holding.
We have learned that quantitative data typically have measurement units. The quantiles based on a sample of quantitative data always have the same measurement units as the data themselves. The quantiles in the above table have United States dollars (USD) as their units, since that is the units of the data.
Extreme quantiles #
The term extreme quantile refers to any quantile that represents a point that is far into the tail of a distribution. Extreme quantiles play an important role in assessing risk. For example, if someone is debating whether to build an expensive house next to a river, they should be concerned about whether the house will be destroyed by a flood. The flood stage for an average year, or the median flood stage, is not relevant in such a setting. Since a house should last for 100 years or more under normal conditions, in order to assess the risk of loosing the house due to flooding one should consider an upper quantile such as the 99th percentile of the flood stage.
The most extreme quantiles are the maximum (the 100th percentile) and the minimum (the 0th percentile). However these two quantiles are seldom used in statistical data analysis, because they are very unstable when calculated using a sample of data. We will define this notion of “instability” more precisely later, but intuitively, if you have incomes for 1000 households in a state with 5 million households, the maximum income in your data will generally be far less than the maximum income in the state (since it is unlikely that your sample includes the richest person in the state). Moreover, if two people obtain distinct datasets containing 1000 households from the state, the maximum values in these two samples are likely to be quite different. This instability makes it hard to interpret the minimum and maximum of the sample, since their values are likely to be quite different from the corresponding minimum and maximum values in the population. With large samples it may be useful to examine the 99th or even 99.9th percentile, but for a small sample, it is uncommon to examine quantiles corresponding to proportions greater than 0.9, or less than 0.1.
Measures of dispersion derived from quantiles #
Some useful summary statistics are formed by taking “linear combinations” of quantiles. These are sometimes referred to as L-statistics . The most common statistic of this type is the interquartile range , or IQR . The IQR is simply the difference between the 75th percentile and the 25th percentile. The interdecile range ( IDR ) is the difference between the 90th percentile and the 10th percentile. The IQR and the IDR are both measures of dispersion , also known as measures of scale . Recall by contrast that the median (which is also an L-statistic) is a “measure of location”. Measures of location and measures of scale are highly complementary to each other and describe very different aspects of the data and population.
Measures of dispersion describe the degree of variation or spread in the data. Roughly speaking, “dispersion” tells us whether observations tend to fall far from each other, and far from the central value of the dataset (this would be a highly dispersed dataset), or whether they tend to fall close to each other, and close to the central value of the dataset (this would be a dataset with low dispersion).
As an example, suppose that we sample people from three locations and determine each person’s age. In each location, we calculate the 10th, 25th, 50th, 75th, and 90th percentiles, and the IQR and IDR of the ages. The three locations are an elementary school, a nursing home, and a grocery store. We may obtain results like this:
The elementary school has a small median and a small IQR – the central half of the people in this elementary school are children between the ages of 6 and 8. At least 10 percent but not more than 25 percent of the people in this elementary school are much older, these could be teachers and other adult workers. The presence of these older people impacts the 90th percentile, but not the 75th percentile of the data. Thus the IQR is not impacted by the presence of adults in the school (as long as they comprise less than 25% of the school population), but the IDR is affected, as long as at least 10% of the school population consists of adults. As a result, the IDR in this elementary school is much greater than the IQR.
The residents of a nursing home are mostly rather old, but will generally have a greater dispersion of ages compared to the students in an elementary school, as reflected in the greater IQR for the nursing home compared to the elementary school. The nursing home staff likely consists of people who are much younger than the nursing home residents. This leads to the 10th percentile of the ages of people in the nursing home being much less than the other reported quantiles.
Compared to the elementary school and the nursing home, the grocery store has an intermediate median age. It also has a very wide dispersion of ages, which is evident from the reported measures of dispersion of ages as measured by the IQR.
This example shows us that the median and the IQR are distinct characteristics – they can be similar or different, there is no linkage between them. On the other hand, the IQR and IDR are somewhat related – in particular the IDR cannot be smaller than the IQR. Data samples with a high IQR will tend to have a high IDR, and data samples with a low IQR will tend to have a low IDR. But these measures still capture distinct information, and it is possible to have a sample with an IDR that is much greater than its IQR, or to have a data sample in which the IDR is equal to the IQR.
Yet another measure of dispersion that has a form that is similar to the IQR and the IDR would be the range of the data, defined to be the maximum value minus the minimum value. This statistic has the advantage of being very easy to explain, and superficially is easy to interpret. However it is generally not a good measure of dispersion for the same reasons that the maximum and minimum are somewhat problematic quantiles to interpret, as discussed above.
The IQR and IDR are linear combinations of quantiles (L-statistics), and hence have the same units as the quantiles that they are derived from. If we measure the income of US households in dollars, then the IQR and IDR also have US dollars as their units.
Median residuals and the MAD #
Another useful data summary that can be derived using quantiles is the MAD , or median absolute deviation . The MAD, like the IQR and IDR, is a measure of dispersion. The MAD is calculated in two steps, and is not directly a function of the quantiles. To calculate the MAD, we first median center the data. “Centering operations”, including the median centering operation that we discuss here, arise very commonly in data science. To median center the data, we simply calculate the median of the data, then subtract this median from each data value. The median centered data values are sometimes called median residuals .
The median residuals have an important property, which is that if we calculate the median of the median residuals, we are guaranteed to get a result of zero. That is, median centering removes the median from the data, and recenters the data around zero. Note also that the median centering operation does not reduce or summarize the data – if we have 138 data points to begin with, we still have 138 data points after median centering.
After calculating the median residuals, the second step in calculating the MAD is to take the absolute value of each median residual. These are called, not surprisingly, the absolute median residuals . The third and final step of calculating the MAD is to take the median of the absolute median residuals. This third step is where the summarization takes place, as we end up with only one number after taking the median of the absolute median residuals.
The MAD, like the IQR and IDR, is a measure of dispersion. It tells us how far a typical data value falls from the median of the data. Note that the word “far” implies distance, and distances are not signed (i.e. they are always non-negative). That is the reason that we take the absolute value in the second step of computing the MAD. In fact, if we skipped this step, we would always get a value of zero following the third step.
The MAD is a measure of dispersion, but it is not equal to the IQR or IDR in general. However data sets with large values for the IQR and/or IDR will tend to have larger values for the MAD. In practice, the IQR, IDR, and MAD are all useful measures of dispersion.
Measures of skew derived from quantiles #
Another important statistic that can be derived from quantiles is the quantile skewness , which is defined to be: ((Q3 - Q2) - (Q2 - Q1)) / (Q3 - Q1), which simplifies to (Q3 - 2*Q2 + Q1) / (Q3 - Q1). Note that this is a ratio of two L-statistics, but is not an L-statistic itself. Here, Q2 is always the median, and Q1, Q3 are two quantiles representing symmetric proportions around the median. Most commonly, Q1 is the 25th percentile and Q3 is the 75th percentile. But alternatively, we could specify Q1 to be the 10th percentile and Q3 to be the 90th percentile. For example, the quantile skew for household income in Michigan using the 25th and 75th percentiles, based on the data given above, is
(98000 - 2*55500 + 29000) / (98000 - 29000) = 0.231
The skewness tells us whether the p’th quantile and the (1-p)‘th quantile are equally far from the median. In a right skewed distribution, like that of income in most places, the upper quantiles are much further from the median than the lower quantiles. That is, the difference between the 50th and 75th percentiles is much greater than the difference between the 25th and 50th percentiles. A right skewed data set has a positive value for its quantile skewness, as seen above for the Michigan household income data.
In a left skewed distribution, the difference between the lower quantiles and the median is greater than the difference between the upper quantiles and the median. Left skewed distributions are less commonly encountered than right skewed distributions, but they do occur. A left skewed data set will have a negative value for its quantile skewness.
The quantile skewness has another property that is common to many statistics, namely that it is scale invariant . If we multiply all our data by a positive constant factor, say 100, then the numerator and denominator of the quantile skewness will both change by a factor of 100, and since the statistic is a ratio, this constant factor will cancel out. Thus, the quantile skewness is not impacted by scaling the data by a positive constant. If we report income in dollars, in hundreds of dollars, or in pennies, we will get the same value when calculating the quantile skew. This property of being scale invariant is sometimes referred to as being “dimension-free” or “dimensionless”.
Data summaries based on moments #
A completely different class of data summaries is that based on moments, rather than on quantiles. A moment is a data summary formed by averaging. The most basic moment is the mean , which is simply the average of the data. The mean can be used as a measure of location or central tendency (like the median). If our data are \(x_1, x_2, \ldots, x_n\) , then the mean may be written \(\bar{x}\) . The concept of a moment is much more general than just the familiar average value. We can produce many other moments by transforming the data, and taking the mean of the transformed data. For example, \((x_1^2 + \cdots + x_n^2)/n\) is also a moment. We will see several useful examples of moments below.
Resistance #
Moments and quantiles have overlapping use-cases. In many settings, it is reasonable to use either a quantile-based or a moment-based summary statistic. But it is important to keep the differing properties of quantile-based and moment-based statistics in mind when deciding which to use, and when interpreting your findings. A prominent difference between quantile-based and moment-based statistics is captured by the notion of resistance . A resistant statistic is one that is not impacted by changing a few values, even if they are changed to an extreme degree. For example, if we have the incomes for 100 people, and change the greatest income to one billion dollars, the mean will change dramatically, but the median will not change at all. In general, quantile-based statistics are more resistant than moment-based statistics.
Mean residuals, variance, and the standard deviation #
Above we discussed median centering the data, and the concept of a median residual. Analogously, we can mean center the data by subtracting the mean value from each data value. The resulting centered values are called mean residuals , or just residuals . The mean residual is the most commonly used residual (it is more commonly encountered than the median residual).
A very important measure of dispersion called the variance can be calculated from the mean residuals. To calculate the variance, we simply average the squares of the mean residuals. In words, the variance is the “average squared distance from each data value to the mean value”. As with the MAD, here we want to summarize unsigned, “distance-like” values. That is why we square the residuals prior to averaging. It should be easy to see that if all the values in our dataset are zero, the variance is zero (as would be the MAD). On the other hand, if the data values are well spread out, the variance and the MAD will both be large.
One key difference between the variance and the IQR, IDR, or MAD is the units. As noted above, the MAD has the same units as the data. This is due to the fact that taking the absolute value does not change the units. However, squaring the data also results in the units becoming squared. If the data are measured in years, then the variance has “years-squared” as units. If the data are measured in meters, the variance has “meters-squared” as units (even though it does not represent an area).
Since it is often desirable to have our summary statistics be either dimensionless, or have the same units as the data, it is very common to work with a statistic called the standard deviation rather than working directly with the variance. The standard deviation is simply obtained by taking the square root of the variance. The standard deviation, like the IQR, IDR, and MAD, has the same units as the data.
The variance and standard deviation are often defined slightly differently than we are doing here – specifically, the variance is scaled by a factor of n/(n-1) relative to what we define here, where n is the number of data points. This is sometimes an important distinction, but often has little impact on the result. At this point in the course, we will ignore this minor difference.
Properties of different measures of dispersion #
Many people find the IQR or MAD to be more intuitive measures of dispersion than the standard deviation. The standard deviation is sometimes incorrectly described as being the “average distance from a data value to the mean” – it is actually “the square root of the average squared distance of a data value to the mean”. On the other hand, the MAD can be correctly described as being “the median distance from a data value to the median”. In this sense, the MAD can be described more concisely in everyday language. The IQR is also easy to define and interpret, since it is directly obtained from the sorted data.
The preference for the standard deviation is partly for historical reasons, but it is also easier to mathematically derive properties of the standard deviation, and of the variance, compared to quantile-based statistics. The reason for this is that we can expand the square: if \(x_i\) is a data value and \(\bar{x}\) is the overall mean, then \((x_i - \bar{x})^2 = x_i^2 - 2\bar{x} x_i + \bar{x}^2\) . With some algebra that we will not cover here, we can arrive at an identity for the variance: the variance is equal to the “mean of the squares minus the square of the mean”: \((x_1^2 + \cdots + x_n^2)/n - \bar{x}^2\) .
The “mean of the squares” is obtained by squaring each data value, and averaging the results, i.e., \((x_1^2 + x_2^2 + \cdots + x_n^2)/n\) . The “square of the mean” is obtained by calculating the mean in the usual way, and squaring it, i.e. \(\bar{x}^2 = ((x_1 + x_2 + \cdots + x_n)/n)^2\) . The “mean of the squares” is an example of an uncentered moment . We can take the mean of the squared data, as we do here, or the mean of the cubed data, and so on. These are all examples of uncentered moments. A centered moment is calculated from residuals. As stated above, the variance is the mean of the squared residuals. Thus, the variance is the most prominent example of a centered moment.
The IQR and MAD do not have any algebraic identities such as we just obtained for the variance. There is no way to algebraically “expand” an absolute value in the same way that we can expand a square. This lack of algebraic tractability is what made quantile-based statistics less popular historically. With computers, it is just as easy to work with quantile-based statistics as it is to work with moment-based statistics. Both types of data summaries are useful, and they have somewhat complementary properties. We will use both types of statistics throughout this course.
The notion of “resistance” stated above in regard to the median also applies to the IQR and the MAD (which like the median, are based on quantiles). If we include a few extreme values in our data, the IQR and MAD will be minimally impacted, whereas the standard deviation and variance may change dramatically. Arguably, this resistance is a good thing, because extreme outliers (a term that is often used but impossible to define in a meaningful way) may be measurement errors or data anomalies that are not related to what we are trying to study. However it is certainly possible to have too much resistance – if a statistic is resistant to almost any changes to the data, it cannot be used to learn from the data.
Assessing skew using moments #
Above we discussed the “quantile skew”. Its moment-based counterpart, simply the “skew” or coefficient of skewness is obtained as follows. First, “Z-score” the data, which means that we replace each value \(x_i\) with \(z_i = (x_i-\bar{x})/\hat{\sigma}\) . Here, \(\bar{x}\) is the mean of the data, and \(\hat{\sigma}\) is the standard deviation of the data. We will be using Z-scores throughout the course and will discuss them in more detail later. Once the Z-scores are obtained, the coefficient of skewness is simply their third moment: \((z_1^3 + \cdots + z_n^3)/n\) . Note that this is both the central and noncentral third moment, since the \(z_i\) have mean equal to zero.
The logic behind this moment-based definition of skew is as follows: first, raising data to a high power accentuates the largest values over the values that are of moderate or small magnitude. Also, raising the residuals to an odd power retains the sign of each residual. Thus, cubing the residuals gives us a new data set that has exactly the same signs as the original data set, but with the largest values exaggerated (i.e. they become relatively larger in magnitude while maintaining their sign). The average of the cubed residuals will therefore tell us whether the largest values in the data set tend to be positive, tend to be negative, or are equally likely to be positive or negative. This corresponds to the average cubed residual being positive, negative, or zero, respectively.
Moments are one of the most powerful tools in statistics, and there are many other summary statistics beyond what we have discussed here that can be derived using moments. We will return to the notion of using moments in data analysis throughout the course.
Summarizing Data Descriptive Statistics
Introduction
The first step in solving problems in public health and making evidence-based decisions is to collect accurate data and to describe, summarize, and present it in such a way that it can be used to address problems. Information consists of data elements or data points which represent the variables of interest. When dealing with public health problems the units of measurement are most often individual people, although if we were studying differences in medical practice across the US, the subjects, or units of measurement, might be hospitals. A population consists of all subjects of interest, in contrast to a sample , which is a subset of the population of interest. It is generally not possible to gather information on all members of a population of interest. Instead, we select a sample from the population of interest, and generalizations about the population are based on the assumption that the sample is representative of the population from which it was drawn.
Learning Objectives
After completing this module, the student will be able to:
- Distinguish among dichotomous, ordinal, categorical, and continuous variables.
- Identify appropriate numerical and graphical summaries for each variable type.
- Compute a mean, median, standard deviation, quartiles, and range for a continuous variable.
- Construct a frequency distribution table for dichotomous, categorical, and ordinal variables.
- Give an example of when the mean is a better measure of central tendency (location) than the median.
- Interpret the standard deviation of a continuous variable.
- Generate and interpret a box plot for a continuous variable.
- Generate and interpret side-by-side box plots.
- Differentiate between a histogram and a bar chart.
Types of Variables
Procedures to summarize data and to perform subsequent analysis differ depending on the type of data (or variables) that are available. As a result, it is important to have a clear understanding of how variables are classified.
There are three general classifications of variables:
1) Discrete Variables : variables that assume only a finite number of values, for example, race categorized as non-Hispanic white, Hispanic, black, Asian, other. Discrete variables may be further subdivided into:
- Dichotomous variables
- Categorical variables (or nominal variables)
- Ordinal variables
2) Continuous Variables : These are sometimes called quantitative or measurement variables; they can take on any value within a range of plausible values. For example, total serum cholesterol level, height, weight and systolic blood pressure are examples of continuous variables.
3) Time to Event Variables : these reflect the time to a particular event such as a heart attack, cancer remission or death.
Numerical Summaries for Discrete Variables
Frequency distribution tables are a common and useful way of summarizing discrete variables. Representative examples are shown below.
Frequency Distribution Tables for Dichotomous Variables
In the offspring cohort of the Framingham Heart Study 3,539 subjects completed the 7th examination between 1998 and 2001, which included an extensive physical examination. One of the variables recorded was sex as summarized below in a frequency distribution table.
Table 1 - Frequency Distribution Table for Sex
Note that the third column contains the relative frequencies, which are computed by dividing the frequency in each response category by the sample size (e.g., 1,625/3,539 = 0.459). With dichotomous variables the relative frequencies are often expressed as percentages (by multiplying by 100).
The investigators also recorded whether or not the subjects were being treated with antihypertensive medication, as shown below.
Table 2 - Frequency Distribution Table for Treatment with Antihypertensive Medication
Missing Data
Note in the table above that there are only n=3,532 valid responses, although the sample size was n=3,539. This indicates that seven individuals had missing data on this particular question. Missing data occurs in studies for a variety of reasons. If there is extensive missing data or if there is a systematic pattern of missing responses, the results of the analysis may be biased (see the module on Bias for EP713 for more detail.) There are techniques for handling missing data, but these are beyond the scope of this course
Sometimes it is of interest to compare two or more groups on the basis of a dichotomous outcome variable. For example, suppose we wish to compare the extent of treatment with antihypertensive medication in men and women, as summarized in the table below.
Table 3 - Treatment with Antihypertensive Medication in Men and Women
Here, both sex and treatment status are dichotomous variables. Because the numbers of men and women are unequal, the relative frequency of treatment for each sex must be calculated by dividing the number on treatment by the sample size for the sex. The numbers of men and women being treated (frequencies) are almost identical, but the relative frequencies indicate that a higher percentage of men are being treated than women. Note also that the sum of the rightmost column is not 100.0% as it was in previous examples, because it indicates the relative frequency of treatment among all participants (men and women) combined.
Frequency Distribution Tables for Categorical Variables
Recall that categorical variables are those with two or more distinct responses that are unordered. Some examples of categorical variables measured in the Framingham Heart Study include marital status, handedness (right or left) and smoking status. Because the responses are unordered, the order of the responses or categories in the summary table can be changed, for example, presenting the categories alphabetically or perhaps from the most frequent to the least frequent.
Table 4 below summarizes data on marital status from the Framingham Heart Study. The mutually exclusive and exhaustive categories are shown in the first column of the table. The frequencies, or numbers of participants in each response category, are shown in the middle column and the relative frequencies, as percentages, are shown in the rightmost column.
Table 4 - Frequency Distribution Table for Marital Status
There are n=3,530 valid responses to the marital status question (9 participants did not provide marital status data). The majority of the sample is married (73.1%), and approximately 10% of the sample is divorced. Another 10% are widowed, 6% are single, and 1% are separated.
Frequency Distribution Tables for Ordinal Variables
Some discrete variables are inherently ordinal. In addition to inherently ordered categories (e.g., excellent, very good, good, fair, poor), investigators will sometimes collect information on continuously distributed measures, but then categorize these measurements because it makes it easier for clinical decision making. For example, the NHLBI (National Heart Lung, and Blood Institute and the American Heart Association use the following classification of blood pressure:
- Normal: systolic blood pressure <120 and diastolic blood pressure <80
- Pre-hypertension: systolic blood pressure between 120-139 or diastolic blood pressure between 80-89
- Stage I hypertension: systolic blood pressure between 140-159 or diastolic blood pressure between 90-99
- Stage II hypertension: systolic blood pressure of 160 or more or diastolic blood pressure of 100 or more
The American Heart Association uses the following classification for total cholesterol levels:
- Desirable: total cholesterol <200 mg/dL,
- Borderline high risk: total cholesterol between 200–239 mg/dL and
- High risk: total cholesterol of 240 mg/dL or greater
Body mass index (BMI) is computed as the ratio of weight in kilograms to height in meters squared and the following categories are often used:
- Underweight: BMI <18.5
- Normal weight: BMI between 18.5-24.9
- Overweight: BMI between 25-29.9
- Obese: BMI of 30 or greater
These are all examples of common continuous measures that have been categorized to create ordinal variables. The table below is a frequency distribution table for the ordinal blood pressure variable. The mutually exclusive and exhaustive categories are shown in the first column of the table. The frequencies, or numbers of participants in each response category, are shown in the middle column and the relative frequencies, as percentages, are shown in the rightmost columns. The key summary statistics for ordinal variables are relative frequencies and cumulative relative frequencies.
Table 5 - Frequency Distribution for Blood Pressure Category
Note that the cumulative frequencies reflect the number of patients at the particular blood pressure level or below . For example, 2,658 patients have normal blood pressure or pre-hypertension. There are 3,311 patients with normal, pre-hypertension or Stage I hypertension. The cumulative relative frequencies are very useful for summarizing ordinal variables and indicate the proportion (between 0-1) or percentage (between 0%-100%) of patients at a particular level or below. In this example, 75.2% of the patients are NOT classified as hypertensive (i.e., they have normal blood pressure or pre-hypertension). Notice that for the last (highest) blood pressure category, the cumulative frequency is equal to the sample size (n=3,533) and the cumulative relative frequency is 100% indicating that all of the patients are at the highest level or below.
Table 6 - Frequency Distribution Table for Smoking Status
Graphical Summaries for Discrete Variables
Bar charts for dichotomous and categorical variables.
Graphical displays are very useful for summarizing data, and both dichotomous and non-ordered categorical variables are best summarized with bar charts. The response options (e.g., yes/no, present/absent) are shown on the horizontal axis and either the frequencies or relative frequencies are plotted on the vertical axis. Figure 1 below is a frequency bar chart which corresponds to the tabular presentation in Table 1 above.
Figure 1 - Frequency Bar Chart
Note that for dichotomous and categorical variables there should be a space in between the response options. The analogous graphical representation for an ordinal variable does not have spaces between the bars in order to emphasize that there is an inherent order.
In contrast, figure 2 below illustrates a relative frequency bar chart of the distribution of treatment with antihypertensive medications. This graphical representation corresponds to the tabular presentation in the last column of Table 2 above.
Figure 2 - Relative Frequency Bar Chart
A frequency bar chart for marital status might look like Figure 3 below.
Consider the graphical representation of the data in Table 3 above, comparing the relative frequency of antihypertensive medications between men and women. It would appropriately look like the figure shown below. Note that a range of 0 - 40 was chosen for the vertical axis.
For the example above the relative frequencies are 31.8% and 37.7%, so scaling the vertical axis from 0 to 40% is appropriate to accommodate the data. However, one can visually mislead the reader regarding the comparison by using a vertical scale that is either too expansive or too restrictive. Consider the two bar charts below (Figures 5 & 6).
These bar charts display the same relative frequencies, i.e., 31.8% and 37.7%. However, the bar chart on the left minimizes the difference, because the vertical scale is too expansive, ranging from 0 - 100%. On the other hand, the bar chart on the right visually exaggerates the difference, because the vertical scale is too restrictive, ranging from 30 - 40%.
Histograms for Ordinal Variables
A distinguishing feature of bar charts for dichotomous and non-ordered categorical variables is that the bars are separated by spaces to emphasize that they describe non-ordered categories. When one is dealing with ordinal variables, however, the appropriate graphical format is a histogram . A histogram is similar to a bar chart, except that the adjacent bars abut one another in order to reinforce the idea that the categories have an inherent order. The frequency histogram below summarizes the blood pressure data that was presented in a tabular format in Table 4 on the previous page. Note that the vertical axis displays the frequencies or numbers of participants classified in each category.
Figure 7 Frequency Histogram for Blood Pressure
This histogram immediately conveys the message that the majority of participants are in the lower two categories of the distribution. A small number of participants are in the Stage II hypertension category. The histogram below is a relative frequency histogram for the same data. Note that the figure is the same, except for the vertical axis, which is scaled to accommodate relative frequencies instead of frequencies.
Figure 8 - Relative Frequency Histogram for Blood Pressure
Descriptive Statistics for Continuous Variables
In order to provide a detailed description of the computations used for numerical and graphical summaries of continuous variables, we selected a small subset (n=10) of participants in the Framingham Heart Study. The data values for these ten participants are shown in the table below. The rightmost column contains the body mass index (BMI) computed using the height and weight measurements.
Table 8 - Data Values for a Small Sample
The first summary statistic that is important to report for a continuous variable (as well as for any discrete variable) is the sample size (in the example here, sample size is n=10). Larger sample sizes produce more precise results and therefore carry more weight. However, there is a point at which increasing the sample size will not materially increase the precision of the analysis. Sample size computations will be discussed in detail in a later module.
Because this sample is small (n=10), it is easy to summarize the sample by inspecting the observed values, for example, by listing the diastolic blood pressures in ascending order:
62 63 64 67 70 72 76 77 81 81
Diastolic blood pressures <80 mm Hg are considered normal, and we can see that the last two exceed the upper limit just barely. However, for a large sample, inspection of the individual data values does not provide a meaningful summary, and summary statistics are necessary. The two key components of a useful summary for a continuous variable are:
- a description of the center or 'average' of the data (i.e., what is a typical value?) and
- an indication of the variability in the data.
Sample Mean
In biostatistics, the term 'average' is a very general term that can be addressed by several statistics. The one that is most familiar is the sample mean, which is computed by summing all of the values and dividing by the sample size. For the sample of diastolic blood pressures in the table above, the sample mean is computed as follows:
Sample mean = (62+63+64+67+70+72+76+77+81+81) /10 = 71.3
To simplify the formulas for sample statistics (and for population parameters), we usually denote the variable of interest as "X". X is simply a placeholder for the variable being analyzed. Here X=diastolic blood pressure.
The general formula for the sample mean is:
The X with the bar over it represents the sample mean, and it is read as "X bar". The Σ indicates summation (i.e., sum of the X's or sum of the diastolic blood pressures in this example).
When reporting summary statistics for a continuous variable, the convention is to report one more decimal place than the number of decimal places measured. Systolic and diastolic blood pressures, total serum cholesterol and weight were measured to the nearest integer, therefore the summary statistics are reported to the nearest tenth place. Height was measured to the nearest quarter inch (hundredths place), therefore the summary statistics are reported to the nearest thousandths place. Body mass index was computed to the nearest tenths place, summary statistics are reported to the nearest hundredths place.
A second measure of the "average" value is the sample median, which is the middle value in the ordered data set, or the value that separates the top 50% of the values from the bottom 50%. When there is an odd number of observations in the sample, the median is the value that holds as many values above it as below it in the ordered data set. When there is an even number of observations in the sample (e.g., n=10) the median is defined as the mean of the two middle values in the ordered data set. In the sample of n=10 diastolic blood pressures, the two middle values are 70 and 72, and thus the median is (70+72)/2 = 71. Half of the diastolic blood pressures are above 71 and half are below. In this case, the sample mean and the sample median are very similar.
The mean and median provide different information about the average value of a continuous variable. Suppose the sample of 10 diastolic blood pressures looked like the following:
62 63 64 67 70 72 76 77 81 140
In this case, the sample mean (x 'bar') = 772/10 = 77.2, but this does not strike us as a "typical" value, since the majority of diastolic blood pressures in this sample are below 77.2. The extreme value of 140 is affecting the computation of the mean. For this same sample, the median is 71. The median is unaffected by extreme or outlying values. For this reason, the median is preferred over the mean when there are extreme values (either very small or very large values relative to the others). When there are no extreme values, the mean is the preferred measure of a typical value, in part because each observation is considered in the computation of the mean. When there are no extreme values in a sample, the mean and median of the sample will be close in value. Below we provide a more formal method to determine when values are extreme and thus when the median should be used.
Table 9 displays the sample means and medians for each of the continuous measures for the sample of n=10 in Table 8.
Table 9 - Means and Medians of Variables in Subsample of Size n=10
For each continuous variable measured in the subsample of n=10 participants, the means and medians are not identical but are relatively close in value suggesting that the mean is the most appropriate summary of a typical value for each of these variables. (If the mean and median are very different, it suggests that there are outliers affecting the mean.)
A third measure of a "typical" value for a continuous variable is the mode, which is defined as the most frequent value. In Table 8 above the mode of the diastolic blood pressures is 81, the mode of the total cholesterol levels is 227, the mode of the heights is 70.00, because these values each appear twice when the other values only appear only once. For each of the other continuous variables, there are 10 distinct values and thus there is no mode, since no value appears more frequently than any other.
Suppose the diastolic blood pressures had been:
62 63 64 64 70 72 76 77 81 81
In this sample there are two modes: 64 and 81. The mode is a useful summary statistic for a continuous variable. It is not presented instead of either the mean or the median, but rather in addition to the mean or median.
Range
The second aspect of a continuous variable that must be summarized is the variability in the sample. A relatively crude, yet important, measure of variability in a sample is the sample range. The sample range is computed as follows:
Sample Range = Maximum – Minimum Value
Table 10 displays the sample ranges for each of the continuous measures in the subsample of n=10 observations.
Table 10 Ranges of Variables in Subsample of Size n=10
The range of a variable depends on the scale of measurement. The blood pressures are measured in millimeters of mercury; total cholesterol is measured in milligrams per deciliter, weight in pounds, and so on. The range of total serum cholesterol is large with the minimum and maximum in the sample of size n=10 differing by 125 units. In contrast, the heights of participants are more homogeneous with a range of 11.25 inches. The range is an important descriptive statistic for a continuous variable, but it is based only on two values in the data set. Like the mean, the sample range can be affected by extreme values and thus it must be interpreted with caution. The most widely used measure of variability for a continuous variable is called the standard deviation, which is illustrated below.
Variance and Standard Deviation
If there are no extreme or outlying values of a variable, the mean is the most appropriate summary of a typical value, and to summarize variability in the data we specifically estimate the variability in the sample around the sample mean. If all of the observed values in a sample are close to the sample mean, the standard deviation will be small (i.e., close to zero), and if the observed values vary widely around the sample mean, the standard deviation will be large. If all of the values in the sample are identical, the sample standard deviation will be zero.
When discussing the sample mean, we found that the sample mean for diastolic blood pressure was 71.3. The table below shows each of the observed values along with its respective deviation from the sample mean.
Table 11 - Diastolic Blood Pressures and Deviation from the Sample Mean
The deviations from the mean reflect how far each individual's diastolic blood pressure is from the mean diastolic blood pressure. The first participant's diastolic blood pressure is 4.7 units above the mean while the second participant's diastolic blood pressure is 7.3 units below the mean. What we need is a summary of these deviations from the mean, in particular a measure of how far, on average, each participant is from the mean diastolic blood pressure. If we compute the mean of the deviations by summing the deviations and dividing by the sample size we run into a problem. The sum of the deviations from the mean is zero. This will always be the case as it is a property of the sample mean, i.e., the sum of the deviations below the mean will always equal the sum of the deviations above the mean. However, the goal is to capture the magnitude of these deviations in a summary measure. To address this problem of the deviations summing to zero, we could take absolute values or square each deviation from the mean. Both methods would address the problem. The more popular method to summarize the deviations from the mean involves squaring the deviations (absolute values are difficult in mathematical proofs). Table 12 below displays each of the observed values, the respective deviations from the sample mean and the squared deviations from the mean.
The squared deviations are interpreted as follows. The first participant's squared deviation is 22.09 meaning that his/her diastolic blood pressure is 22.09 units squared from the mean diastolic blood pressure, and the second participant's diastolic blood pressure is 53.29 units squared from the mean diastolic blood pressure. A quantity that is often used to measure variability in a sample is called the sample variance, and it is essentially the mean of the squared deviations. The sample variance is denoted s 2 and is computed as follows:
In this sample of n=10 diastolic blood pressures, the sample variance is s 2 = 472.10/9 = 52.46. Thus, on average diastolic blood pressures are 52.46 units squared from the mean diastolic blood pressure. Because of the squaring, the variance is not particularly interpretable. The more common measure of variability in a sample is the sample standard deviation, defined as the square root of the sample variance:
InterQuartile Range (IQR)
When a data set has outliers or extreme values, we summarize a typical value using the median as opposed to the mean. When a data set has outliers, variability is often summarized by a statistic called the interquartile range , which is the difference between the first and third quartiles. The first quartile, denoted Q 1 , is the value in the data set that holds 25% of the values below it. The third quartile, denoted Q 3 , is the value in the data set that holds 25% of the values above it. The quartiles can be determined following the same approach that we used to determine the median, but we now consider each half of the data set separately. The interquartile range is defined as follows:
Interquartile Range = Q 3 -Q 1
With an Even Sample Size:
For the sample (n=10) the median diastolic blood pressure is 71 (50% of the values are above 71, and 50% are below). The quartiles can be determined in the same way we determined the median, except we consider each half of the data set separately.
Figure 9 - Interquartile Range with Even Sample Size
There are 5 values below the median (lower half), the middle value is 64 which is the first quartile. There are 5 values above the median (upper half), the middle value is 77 which is the third quartile. The interquartile range is 77 – 64 = 13; the interquartile range is the range of the middle 50% of the data.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
With an Odd Sample Size:
When the sample size is odd, the median and quartiles are determined in the same way. Suppose in the previous example, the lowest value (62) were excluded, and the sample size was n=9. The median and quartiles are indicated below.
Figure 10 - Interquartile Range with Odd Sample Size
When the sample size is 9, the median is the middle number 72. The quartiles are determined in the same way looking at the lower and upper halves, respectively. There are 4 values in the lower half, the first quartile is the mean of the 2 middle values in the lower half ((64+64)/2=64). The same approach is used in the upper half to determine the third quartile ((77+81)/2=79).
Outliers and Tukey Fences:
When there are no outliers in a sample, the mean and standard deviation are used to summarize a typical value and the variability in the sample, respectively. When there are outliers in a sample, the median and interquartile range are used to summarize a typical value and the variability in the sample, respectively.
Table 13 displays the means, standard deviations, medians, quartiles and interquartile ranges for each of the continuous variables in the subsample of n=10 participants who attended the seventh examination of the Framingham Offspring Study.
Table 13 - Summary Statistics on n=10 Participants
Table 14 displays the observed minimum and maximum values along with the limits to determine outliers using the quartile rule for each of the variables in the subsample of n=10 participants. Are there outliers in any of the variables? Which statistics are most appropriate to summarize the average or typical value and the dispersion?
Table 14 - Limits for Assessing Outliers in Characteristics Measured in the n=10 Participants
1 Determined byQ 1 -1.5(Q 3 -Q 1 )
2 Determined by Q 3 +1.5(Q 3 -Q 1 )
Since there are no suspected outliers in the subsample of n=10 participants, the mean and standard deviation are the most appropriate statistics to summarize average values and dispersion, respectively, of each of these characteristics.
The Full Framingham Cohort
For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to illustrate calculations of summary statistics and determination of outliers. For your interest, Table 15 displays the means, standard deviations, medians, quartiles and interquartile ranges for each of the continuous variable displayed in Table 13 in the full sample (n=3,539) of participants who attended the seventh examination of the Framingham Offspring Study.
Table 15 - Summary Statistics on Sample of (n=3,539) Participants
Table 16 displays the observed minimum and maximum values along with the limits to determine outliers using the quartile rule for each of the variables in the full sample (n=3,539).
Table 16 - Limits for Assessing Outliers in Characteristics Presented in Table 15
Box-Whisker Plots for Continuous Variables
A popular graphical display for a continuous variable is a box-whisker plot . Outliers or extreme values can also be assessed graphically with box-whisker plots. For the subsample of n=10 Framingham participants who we considered previously we computed the following summary statistics on diastolic blood pressures:
These are sometimes referred to as quantiles or percentiles of the distribution. A specific quantile or percentile is a value in the data set that holds a specific percentage of the values at or below it. The first quartile , for example, is the 25 th percentile meaning that it holds 25% of the values at or below it. The median is the 50 th percentile, the third quartile is the 75 th percentile and the maximum is the 100 th percentile (i.e., 100% of the values are at or below it).
A box-whisker plot is a graphical display of these percentiles. Figure 11 is a box-whisker plot of the diastolic blood pressures measured in the subsample of n=10 participants described above in Table 14. The horizontal lines represent (from the top) the maximum, the third quartile, the median (also indicated by the dot), the first quartile and the minimum. The shaded box represents the middle 50% of the distribution (between the first and third quartiles). A box-whisker plot is meant to convey the distribution of a variable at a quick glance. We determined that there were no outliers in the distribution of diastolic blood pressures in the subsample of n=10 participants who attended the seventh examination of the Framingham Offspring Study.
Figure 11 - Box-Whisker Plot of Diastolic Blood Pressures in Subsample of n=10.
Figure 12 is a box-whisker plot of the diastolic blood pressures measured in the full sample (n=3,539) of participants. Recall that in the full sample we determined that there were outliers both at the low and the high end (See Table 16). In Figure 12 the outliers are displayed as horizontal lines at the top and bottom of the distribution. At the low end of the distribution, there are 5 values that are considered outliers (i.e., values below 47.5 which was the lower limit for determining outliers). At the high end of the distribution, there are 12 values that are considered outliers (i.e., values above 99.5 which was the upper limit for determining outliers). The "whiskers" of the plot (boldfaced horizontal brackets) are the limits we determined for detecting outliers (47.5 and 99.5).
Figure 12 - Box-Whisker Plot of Diastolic Blood Pressures with Full Sample (n=3,539) of Participants
Box-whisker plots are very useful for comparing distributions. Figure 13 below shows side-by-side box-whisker plots of the distributions of weights, in pounds, for men and women in the Framingham Offspring Study. The figure clearly shows a shift in the distributions with men having much higher weights. In fact, the 25 th percentile of the weights in men is approximately 180 pounds and equal to the 75 th percentile in women. Specifically, 25% of the men weigh 180 or less as compared to 75% of the women. There are many outliers at the high end of the distribution among both men and women. There are two outlying low values among men.
Figure 13 - Side-by-Side Box-Whisker Plots of Weights in Men and Women in the Framingham Offspring Study
Because men are generally taller than women (see Figure 14 below), it is not surprising that men have higher weights than women.
Figure 14 - Side-by-Side Box-Whisker Plots of Heights in Men and Women in the Framingham Offspring Study
Because men are taller, a more appropriate comparison is of body mass index, see Figure 15 below.
Figure 15 - Side-by-Side Box-Whisker Plots of Body Mass Index in Men and Women in the Framingham Offspring Study
The distributions of body mass index are similar for men and women. There are again many outliers in the distributions in both men and women. However, when taking height into account (by comparing body mass index instead of comparing weights alone), we see that the most extreme outliers are among the women.
In the box-whisker plots, outliers are values which either exceed Q 3 + 1.5 IQR or fall below Q 1 - 1.5 IQR. Some statistical computing packages use the following to determine outliers: values which either exceed Q 3 + 3 IQR or fall below Q 1 - 3 IQR, which would result in fewer observations being classified as outliers. 7,8 The rule using 1.5 IQR is the more commonly applied rule to determine outliers.
The first important aspect of any statistical analysis is an appropriate summary of the key analytic variables. This involves first identifying the type of variable being analyzed. This step is extremely important as the appropriate numerical and graphical summaries depend on the type of variable being analyzed. Variables are dichotomous, ordinal, categorical or continuous. The best numerical summaries for dichotomous, ordinal and categorical variables involve relative frequencies. The best numerical summaries for continuous variables include the mean and standard deviation or the median and interquartile range, depending on whether or not there are outliers in the distribution. The mean and standard deviation or the median and interquartile range summarize central tendency (also called location) and dispersion, respectively. The best graphical summary for dichotomous and categorical variables is a bar chart and the best graphical summary for an ordinal variable is a histogram. Both bar charts and histograms can be designed to display frequencies or relative frequencies, with the latter being the more popular display. Box-whisker plots provide a very useful and informative summary for continuous variables. Box-whisker plots are also useful for comparing the distributions of a continuous variable among mutually exclusive (i.e., non-overlapping) comparison groups.
The following table summarizes key statistics and graphical displays organized by variable type.
- National Heart, Lung, and Blood Institute. Available at: http://www.nhlbi.nih.gov.
- Sullivan LM. Repeated Measures. Circulation (In Press).
- Little RJ, Rubin DB. Statistical Analysis with Missing Data. New York, NY: John Wiley and Sons, Inc.; 1987.
- American Heart Association. Available at: http://www.americanheart.org.
- The Expert Panel. Expert panel on detection, evaluation and treatment of high blood cholesterol in adults: summary of the second report of the NCEP expert panel (Adult Treatment Panel II). Journal of the American Medical Association. 1993; 269: 3015-3023.
- Hoaglin DC. John W. Tukey and data analysis. Statistical Science . 2003; 18(3): 311-318.
- SAS version 9.1© 2002-2003 by SAS Institute Inc., Cary, NC.
- S-PLUS version 7.0© 1999-2006 by Insightful Corp., Seattle, WA.
IMAGES
VIDEO
COMMENTS
The Structure of a Research Summary typically include: 1. Introduction: This section provides a brief background of the research problem or question, explains the purpose of the study, and outlines th…
Data summarization is typically numerical, visual or a combination of the two. It is a key skill in data analysis - we use it to provide insights both to others and to ourselves. Data summarization is also an integral part of exploratory data …
There are four key areas to consider when summarizing a set of numbers: Centrality – the middle value or average. Dispersion – how spread out the values are from the average. …
Descriptive statistics summarize the characteristics of a data set. There are three types: distribution, central tendency, and variability.
One meaning of the term statistic is equivalent to the idea of a data summary. That is, a statistic is a value derived from data that tells us something in summary form about the data. Here we …
A simple way to summarize data is to generate a table representing counts of various types of observations. This type of table has been used for thousands of years (see Figure 3.1). Figure …
Summarizing Data. Descriptive Statistics. Introduction. The first step in solving problems in public health and making evidence-based decisions is to collect accurate data and to describe, summarize, and present it in such a way that it …