Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

joshuaulrich · 2018-10-10T12:12:06Z

Given a xts object with a date index that contains duplicate dates, incorrect results are returned if the object is subset with its own date index.

Thanks to scs for their Stackoverflow question.

require(xts)
Data <- structure(
  list(
    timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
                  "2014-07-31 23:00:00", "2014-08-09 17:00:00",
                  "2014-08-14 20:00:00", "2014-08-14 22:00:00",
                  "2014-08-16 15:00:00", "2014-08-19 02:00:00", 
                  "2014-12-28 18:00:00", "2015-01-17 17:00:00"),
    user = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4)),
  .Names = c("timestamp", "user"),
  row.names = c("220667", "331481", "422653", "629430", "378111", "646137",
                "558638", "151641", "599370", "482750"),
  class = "data.frame")   

(x <- xts(Data$user, as.Date(Data$timestamp)))
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
x[index(x)]  # Different (wrong) data. Why?
#            [,1]
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4

Session Info

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.11-2 zoo_1.8-3 

loaded via a namespace (and not attached):
[1] compiler_3.5.1  grid_3.5.1      lattice_0.20-35

pverspeelt · 2018-10-10T13:00:57Z

I did some more checking.

# remove rownames from data
rownames(Data) <- NULL

xts with as.Date test

test_xts_date <- xts(x=Data[, -1], order.by = as.Date(Data$timestamp))

# check if identical xts and xts with index
identical(test_xts_date, test_xts_date[index(test_xts_date)])
[1] FALSE

test_date <- data.frame(timestamp = as.character(index(test_xts_date)), user = coredata(test_xts_date), stringsAsFactors = FALSE)
Data2 <- data.frame(timestamp = as.character(as.Date(as.POSIXct(Data$timestamp))), user = Data$user, stringsAsFactors = FALSE)

# Check if input equals output
identical(Data2, test_date)
[1] TRUE

So the issue is purely when indexing / selecting records within the xts object. Now if you look at the outcome of the xts with index compared to the input data, you can see that the first 2 records disappeared.

test_xts_date[index(test_xts_date)])
           [,1]
2014-07-31    2
2014-08-09    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-16    3
2014-08-19    3
2014-12-28    4
2015-01-17    4

But the issue is not as such with the index. If I change the index with the dates from the Data$timestamp you have the same issue.

dates <- as.Date(Data$timestamp)
test_xts_date[dates]
           [,1]
2014-07-31    2
2014-08-09    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-16    3
2014-08-19    3
2015-01-17    4
2015-01-17    4

I also checked what happened if the timestamp was not passed as a date, but as Posixct. Then it works correctly, but only if there are no duplicate timestamps. As soon as you introduce a duplicate record, the issue appears.

I will do some more tests later.

pverspeelt · 2018-10-11T11:09:51Z

I have narrowed the issue down to the function fill_window_dups_rev. When window_idx is called to subset the xts object, all the firstlast objects and the tmp object just before calling fill_window_dups_rev contain 10 records. The res object after calling fill_window_dups_rev suddenly contains 12 records with the first 2 as NA and index 5 and 6 (the ones with date "2014-08-14") are duplicated in the res object. This is passed into the last filling of firstlast and hence you get duplicated records for index position 5 and 6 and index positions 1 and 2 are dropped.

debugging:

if(usr_idx && !is.null(firstlast)) {
    # Translate from user .index to xts index
    # We get back upper bound of index as per findInterval
    tmp <- base_idx[firstlast]

    res <- .Call("fill_window_dups_rev", tmp, .index(x), PACKAGE = "xts")
    firstlast <- rev(res)
  }

result of the objects:

tmp
 [1]  1  2  3  4  6  6  7  8  9 10

res
 [1] 10  9  8  7  6  5  6  5  4  3 NA NA

firstlast 
[1] NA NA  3  4  5  6  5  6  7  8  9 10

Which when the NA's are removed from the firstlast object the outcome results in the issue specified above.

joshuaulrich · 2018-10-11T11:40:08Z

Thanks for the detailed investigation @pverspeelt! The fill_window_dups_rev() function was introduced as part of the fix for #251. I used the previous commit (1d707c5) to check whether this was a regression caused by the fix, or if it existed prior to the fix as well.

The output below shows the results prior to introducing the fill_window_dups_rev() function.

require(xts)
Data <- structure(
  list(
    timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
                  "2014-07-31 23:00:00", "2014-08-09 17:00:00",
                  "2014-08-14 20:00:00", "2014-08-14 22:00:00",
                  "2014-08-16 15:00:00", "2014-08-19 02:00:00", 
                  "2014-12-28 18:00:00", "2015-01-17 17:00:00"),
    user = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4)),
  .Names = c("timestamp", "user"),
  row.names = c("220667", "331481", "422653", "629430", "378111", "646137",
                "558638", "151641", "599370", "482750"),
  class = "data.frame")   

(x <- xts(Data$user, as.Date(Data$timestamp)))
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
x[index(x)]  # Different (wrong) data. Why?
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
packageVersion("xts")
# [1] '0.11.0'

These results look correct. There are duplicate values for 2014-08-14, but that isn't obviously incorrect. The index contains duplicate values for that date, and the i in the subset also contains duplicate values for that date. So you could argue that all observations with that date should be returned for each value of that date in i.

In situations where behavior is unclear, I prefer to defer to consistent behavior with zoo. That's not possible in this case, because zoo does not attempt to handle objects with duplicate index values. For example:

z <- as.zoo(x)
z[index(z)]
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-14 2014-08-16 
#          1          2          2          3          3          3          3 
# 2014-08-19 2014-12-28 2015-01-17 
#          3          4          4 
# Warning message:
# In zoo(rval, index(x)[i]) :
#   some methods for "zoo" objects do not work if the index entries in 'order.by' are not unique

We may consider what zoo does when you subset a zoo object with and i that contains duplicate values. Although we still need to keep in mind that zoo may behave this way because it doesn't try to handle objects with duplicate index values.

z[-5,][index(z)]  # Remove one of the 2014-08-14 rows
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-16 2014-08-19 
#          1          2          2          3          3          3          3 
# 2014-12-28 2015-01-17 
#          4          4

The result of subsetting an object with 'i' that contains duplicates, may be longer than 'i' and/or the object. But the code assumed the subset result would never be longer than the object index. This worked correctly prior to 1d707c5, when the fill_window_dups_rev() function was added. Check for when the subset output length reaches the length of the initially allocated result object. Increase the length of the result object by twice the remaining elements in 'i' if we run out of space. Fixes #275.

joshuaulrich added the bug label Oct 10, 2018

joshuaulrich closed this as completed in 02fe358 Nov 3, 2018

joshuaulrich added this to the 0.11-2 milestone Jul 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

joshuaulrich commented Oct 10, 2018

pverspeelt commented Oct 10, 2018

pverspeelt commented Oct 11, 2018

joshuaulrich commented Oct 11, 2018

Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

Comments

joshuaulrich commented Oct 10, 2018

Session Info

pverspeelt commented Oct 10, 2018

xts with as.Date test

pverspeelt commented Oct 11, 2018

joshuaulrich commented Oct 11, 2018