R: How to remove outliers from a smoother in ggplot2?

0

Issue

I have the following data set that I am trying to plot with ggplot2, it is a time series of three experiments A1, B1 and C1 and each experiment had three replicates.

I am trying to add a stat which detects and removes outliers before returning a smoother (mean and variance?). I have written my own outlier function (not shown) but I expect there is already a function to do this, I just have not found it.

I’ve looked at stat_sum_df(“median_hilow”, geom = “smooth”) from some examples in the ggplot2 book, but I didn’t understand the help doc from Hmisc to see if it removes outliers or not.

Is there a function to remove outliers like this in ggplot, or where would I amend my code below to add my own function?

EDIT: I just saw this (How to use Outlier Tests in R Code) and notice that Hadley recommends using a robust method such as rlm. I am plotting bacterial growth curves, so I don’t think a linear model is best, but any advice on other models or using or using robust models in this situation would be appreciated.

library (ggplot2)  

data = data.frame (day = c(1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7), od = 
c(
0.1,1.0,0.5,0.7
,0.13,0.33,0.54,0.76
,0.1,0.35,0.54,0.73
,1.3,1.5,1.75,1.7
,1.3,1.3,1.0,1.6
,1.7,1.6,1.75,1.7
,2.1,2.3,2.5,2.7
,2.5,2.6,2.6,2.8
,2.3,2.5,2.8,3.8), 
series_id = c(
"A1", "A1", "A1","A1",
"A1", "A1", "A1","A1",
"A1", "A1", "A1","A1",
"B1", "B1","B1", "B1",
"B1", "B1","B1", "B1",
"B1", "B1","B1", "B1",
"C1","C1", "C1", "C1",
"C1","C1", "C1", "C1",
"C1","C1", "C1", "C1"),
replicate = c(
"A1.1","A1.1","A1.1","A1.1",
"A1.2","A1.2","A1.2","A1.2",
"A1.3","A1.3","A1.3","A1.3",
"B1.1","B1.1","B1.1","B1.1",
"B1.2","B1.2","B1.2","B1.2",
"B1.3","B1.3","B1.3","B1.3",
"C1.1","C1.1","C1.1","C1.1",
"C1.2","C1.2","C1.2","C1.2",
"C1.3","C1.3","C1.3","C1.3"))

> data
   day   od series_id replicate
1    1 0.10        A1      A1.1
2    3 1.00        A1      A1.1
3    5 0.50        A1      A1.1
4    7 0.70        A1      A1.1
5    1 0.13        A1      A1.2
6    3 0.33        A1      A1.2
7    5 0.54        A1      A1.2
8    7 0.76        A1      A1.2
9    1 0.10        A1      A1.3
10   3 0.35        A1      A1.3
11   5 0.54        A1      A1.3
12   7 0.73        A1      A1.3
13   1 1.30        B1      B1.1
... etc...

This is what I have so far and is working nicely, but outliers are not removed:

r <- ggplot(data = data, aes(x = day, y = od))
r + geom_point(aes(group = replicate, color = series_id)) + # add points
   geom_line(aes(group = replicate, color = series_id)) + # add lines
   geom_smooth(aes(group = series_id))  # add smoother, average of each replicate

EDIT: I just added two charts below showing examples of the outlier problems that I’m having from the real data rather than the example data above.

The first plots shows series p26s4 and around day 32 something really weird went on in two of the replicates, showing 2 outliers.

The second plots shows series p22s5 and on day 18, something weird went on with the reading that day, likely machine error I think.

At the moment I am eye-balling the data, to check that the growth curves look OK. After taking Hadley’s advice and setting family = “symmetric”, I am confident that the loess smoother does a decent job of ignoring the outliers.

p26s4 shows around day 32 something really weird went on in two of the replicates, showing 2 outliers

p22s5 shows that on day 18, something weird went on with the reading that day, likely machine error I think

@Peter/@hadley, the next thing I’d like to do is to try and fit a logistic, gompertz or richard’s growth curve to this data instead of a loess and calculate the growth rate in the exponential stage. Eventually I plan to use the grofit package in R (http://cran.r-project.org/web/packages/grofit/index.html), but for now I’d like to plot these manually using ggplot2 if possible. If you have any pointers then it would be much appreciated.

Solution

Have you tried the family = "symmetric" argument to geom_smooth (which will in turn get passed on to loess)? This will make the loess smooth resistant to outliers.

The syntax would be:

geom_smooth(method = loess, method.args = list(family = "symmetric"))

However, looking at your data, why do you think a linear fit is not adequate? You only have 4 x values, and there certainly doesn’t seem to be strong evidence for a departure from linearity.

Answered By – hadley

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave A Reply

Your email address will not be published.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More