File Name: compute bar plot and compare the resault of the guassian .zip
This lab discusses the basics of visualizing data, probability, the normal distribution, and z scores. The following packages are required for this lab:.
Published on October 23, by Pritha Bhandari. Revised on January 19, In a normal distribution, data is symmetrically distributed with no skew. When plotted on a graph, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the center. Normal distributions are also called Gaussian distributions or bell curves because of their shape.
Select a Web Site
This lab discusses the basics of visualizing data, probability, the normal distribution, and z scores. The following packages are required for this lab:. Recall that histograms are used to visualize continuous data. Histograms are not used to visualize categorical data.
Instead, a bar plot is advised for categorical data. The following is an example of creating a histogram of the age variable within the ds data set. The histogram displays the frequency of age for given bins. Alternatively, the density of age can be shown instead of frequency by making a slight change in the visualization.
The shape of the plot is the same for the frequency and density histograms; however, the y-axis measures in different units. The area associated with the largest y-axis value suggest that a higher percentage of respondents are likely to provide an age within the ages on the x-axis.
Data is organized into ranges, known as bins, to compose the x-axis. The square root of n for the current data set is a little over 50, so set the bins to be Using various functions along with the histogram function, the visualization is improved with more meaningful information.
These functions can help:. Data approximated by the normal distribution can define probabilities. Comparing the histogram plot to the normal distribution curve generated may prove difficult. The two shapes can then be compared visually to interpret whether the age data can be approximated by the normal distribution. The culmination of the histogram, curve, and density line is improved via the addition of limits and labels to the x-axis and y-axis, defining a number of bins, and a chart title.
Including fill and outline colors for the histogram can also make it more readable:. R supports a number of distributions; however, for the purpose of these labs we will focus primarily on the normal and binomial distributions. View the help Distributions documentation to explore the distributions supported by R. Note: The value returned by the dnorm function is not the probability associated with the occurrence of the x value! The default mean and standard deviation for the dnorm function is 0 and 1, respectively.
The dnorm function used in conjunction with the age variable from ds data set can find the height of the probability distribution function. In the following example, the dnorm function will find the height of the probability distribution function for Similar to previous examples, an argument exists to ignore NA and missing values.
The dnorm returns the height of the probability distribution function as 0. Note: This is a random value and, by itself, is not meaningful.
The dnorm function returns the relative likelihood, which can lead to determining a probability; however, to understand this value further requires an explanation of calculus.
For continuous data, the probability of a single value is small near zero , so instead the approach should be to find the probability a value occurs within a specified range. The probability associated to a value occurring within a specified range is equal to the area of the probability distribution function between the two points. In calculus this is defined as finding the integral of the probability distribution function. The probability associated to an age between 65 and 66 in the age variable is.
Similarly, the pnorm function calculates probabilities associated to a given x value. The following example uses the pnorm function with the ds data set to find the probability that a respondent is 65 or less years old. To calculate the probability associated to an age of 65 or greater, the lower. This is equal to the difference between 1 and the lower tail probability previously calculated. The qnorm function is the inverse function of the pnorm function.
Given a probability, mean, and standard deviation, the qnorm function will return an x value from the probability distribution function. The following example finds the upper bound x value of the probability distribution function associated to the probability, or area under the curve, of 0. The random values are stored to the rvalues object. Note: The discussed functions are relevant to the normal distribution functions provided by R. R includes similar functions for other distributions, with equivalent functionality.
Thus far the normal distribution has been discussed without visualization. When graphed, data that follow a normal distribution resemble a bell shaped curve. There are many packages than will generate a density curve of your data and a projected normal distribution for comparison, but building all of the visualizations in ggplot provides both an intuitive and informative method of doing so.
Start by creating a density plot of the randomly generated data. We use the sm. Given the random values consists of values generated by the rnorm function, this distrubtion resembling the normal distribution is unsurprising. The following code generates a density line for the age variable from the ds data set and a projected normal distribution given the mean and standaard deviation of the variable. The shape of the density line closely resembles a normal distribution; however, note the slight skew.
Another method to view whether data follows a normal distribution is to view the QQ plot, available via the qqPlot function. A QQ plot visualizes data based on the quantiles of the provided variable against the quantiles that would exist if the data were normally distributed. Data that follows the normal distribution should be in a line with a set slope.
The following example generates a QQ plot of the age variable. To further inspect the normality, a diagonal line can be generated that will visualize what the slope of the data should be if it were normally distributed.
The slope and intercept of the line must also be calculated. Doing so requires a small amount of basic linear algebra. Find the first and third quantiles of the age variable and set it to an object named y , then find the theoretical first and third quantiles for normally-distributed data and set it to x. Calculate the slope by taking the diference in y over the difference in x and set that to an object named slope.
Then solve for the intercept and it it to an object named intercept. In the graphic above, the solid blue line exhibits where data should fall if it follows a normal distribution, and the blue dash lines represent confidence intervals.
The individual circles represent data points from the variable. If the data points is within the intervals, then the data likely follows a normal distribution.
The interpretation of this QQ plot yields that the data likely follows a normal distribution, as expected given the data was generated via the rnorm function. The QQ plot confirms the sm. Note that some values are outside the confidence interval. These are the points associated to the skew previously observed. Another last method to inspect whether data follows a normal distribution is the box plot.
Box plots provides quartiles and the median, and returns individual unique values at the edges of our data. The following code generates a box plot of the age variable. Note the middle line is the middle quartile, or the median. The distance between the median line and the line below it represents the second quartile. Above the median, the distance between the median line and the next line above it is the third quartile, and the area above that, and below the top whisker, is the fourth quartile.
Notice that for the box plot of our randomly generated data, the distance between quartiles is relatively even. The distance between quartiles is relatively even; however, note the difference to the previously generated box plot from the rnorm data. Standardizing, or scaling, data provides conveniences in discussing data.
For instance, discussing how many standard deviations a particular value occurs from the mean is more meaningful than purely the distance. Scaling data in terms of z-scores provides the number of standard deviations a value is from the mean.
The following example employs the scale function to calculate z-score for each data point and assigns them to a newly created z. To do this, use the mutate function, which is a tidyverse function that creates new variables or modifies existing variables. The scale function is enclosed by the c function to ensure the result is a 1 dimensional vector:. Using a filter approach, the following example finds the z-score associated to women younger than 19 years old.
First filter the data with the preferred stipulations, then use the select verb from the dplyr package part of the tidyverse! The result shows that, within the ds data set, there is one respondent that identified as a woman under 19 years old. The z-score for this respondent is Given a z-score, the mean age of respondents is assumed as much higher than the 18 year old woman. The difference between the mean age of respondents and the woman is the product of the z-score and standard deviation.
The following calculates the standard deviation of the age variable. The sd function returns the standard deviation of data. The product of the sd function and calculated z-score, is the difference between the 18 year old woman and mean age of respondents.
The following packages are required for this lab: sfsmisc psych car tidyverse 4. There are a lot to choose from.
Histogram: Compare to normal distribution
You may have noticed that numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. For example, you might read a report stating that scores were plus or minus 50 the standard deviation. The report has summarized an entire vector of scores with just two numbers.
Understanding normal distributions
When examining data, it is often best to create a graphical representation of the distribution. Visual graphs, such as histograms, help one to easily see a few very important characteristics about the data, such as its overall pattern, striking deviations from that pattern, and its shape, center, and spread. A histogram is particularly useful when there is a large number of observations. Histograms break the range of values in classes, and display only the count or percent of the observations that fall into each class. This chapter will focus specifically on probability histograms, which is an idealization of the relative frequency distribution.