Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

Closed
joshuaulrich opened this issue Oct 10, 2018 · 3 comments
Labels
Milestone

Comments

@joshuaulrich
Copy link
Owner

Given a xts object with a date index that contains duplicate dates, incorrect results are returned if the object is subset with its own date index.

Thanks to scs for their Stackoverflow question.

require(xts)
Data <- structure(
  list(
    timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
                  "2014-07-31 23:00:00", "2014-08-09 17:00:00",
                  "2014-08-14 20:00:00", "2014-08-14 22:00:00",
                  "2014-08-16 15:00:00", "2014-08-19 02:00:00", 
                  "2014-12-28 18:00:00", "2015-01-17 17:00:00"),
    user = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4)),
  .Names = c("timestamp", "user"),
  row.names = c("220667", "331481", "422653", "629430", "378111", "646137",
                "558638", "151641", "599370", "482750"),
  class = "data.frame")   

(x <- xts(Data$user, as.Date(Data$timestamp)))
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
x[index(x)]  # Different (wrong) data. Why?
#            [,1]
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4

Session Info

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.11-2 zoo_1.8-3 

loaded via a namespace (and not attached):
[1] compiler_3.5.1  grid_3.5.1      lattice_0.20-35
@pverspeelt
Copy link
Contributor

I did some more checking.

# remove rownames from data
rownames(Data) <- NULL

xts with as.Date test

test_xts_date <- xts(x=Data[, -1], order.by = as.Date(Data$timestamp))

# check if identical xts and xts with index
identical(test_xts_date, test_xts_date[index(test_xts_date)])
[1] FALSE

test_date <- data.frame(timestamp = as.character(index(test_xts_date)), user = coredata(test_xts_date), stringsAsFactors = FALSE)
Data2 <- data.frame(timestamp = as.character(as.Date(as.POSIXct(Data$timestamp))), user = Data$user, stringsAsFactors = FALSE)

# Check if input equals output
identical(Data2, test_date)
[1] TRUE

So the issue is purely when indexing / selecting records within the xts object. Now if you look at the outcome of the xts with index compared to the input data, you can see that the first 2 records disappeared.

test_xts_date[index(test_xts_date)])
           [,1]
2014-07-31    2
2014-08-09    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-16    3
2014-08-19    3
2014-12-28    4
2015-01-17    4

But the issue is not as such with the index. If I change the index with the dates from the Data$timestamp you have the same issue.

dates <- as.Date(Data$timestamp)
test_xts_date[dates]
           [,1]
2014-07-31    2
2014-08-09    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-16    3
2014-08-19    3
2015-01-17    4
2015-01-17    4

I also checked what happened if the timestamp was not passed as a date, but as Posixct. Then it works correctly, but only if there are no duplicate timestamps. As soon as you introduce a duplicate record, the issue appears.

I will do some more tests later.

@pverspeelt
Copy link
Contributor

I have narrowed the issue down to the function fill_window_dups_rev. When window_idx is called to subset the xts object, all the firstlast objects and the tmp object just before calling fill_window_dups_rev contain 10 records. The res object after calling fill_window_dups_rev suddenly contains 12 records with the first 2 as NA and index 5 and 6 (the ones with date "2014-08-14") are duplicated in the res object. This is passed into the last filling of firstlast and hence you get duplicated records for index position 5 and 6 and index positions 1 and 2 are dropped.

debugging:

if(usr_idx && !is.null(firstlast)) {
    # Translate from user .index to xts index
    # We get back upper bound of index as per findInterval
    tmp <- base_idx[firstlast]

    res <- .Call("fill_window_dups_rev", tmp, .index(x), PACKAGE = "xts")
    firstlast <- rev(res)
  }

result of the objects:

tmp
 [1]  1  2  3  4  6  6  7  8  9 10

res
 [1] 10  9  8  7  6  5  6  5  4  3 NA NA

firstlast 
[1] NA NA  3  4  5  6  5  6  7  8  9 10

Which when the NA's are removed from the firstlast object the outcome results in the issue specified above.

@joshuaulrich
Copy link
Owner Author

Thanks for the detailed investigation @pverspeelt! The fill_window_dups_rev() function was introduced as part of the fix for #251. I used the previous commit (1d707c5) to check whether this was a regression caused by the fix, or if it existed prior to the fix as well.

The output below shows the results prior to introducing the fill_window_dups_rev() function.

require(xts)
Data <- structure(
  list(
    timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
                  "2014-07-31 23:00:00", "2014-08-09 17:00:00",
                  "2014-08-14 20:00:00", "2014-08-14 22:00:00",
                  "2014-08-16 15:00:00", "2014-08-19 02:00:00", 
                  "2014-12-28 18:00:00", "2015-01-17 17:00:00"),
    user = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4)),
  .Names = c("timestamp", "user"),
  row.names = c("220667", "331481", "422653", "629430", "378111", "646137",
                "558638", "151641", "599370", "482750"),
  class = "data.frame")   

(x <- xts(Data$user, as.Date(Data$timestamp)))
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
x[index(x)]  # Different (wrong) data. Why?
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
packageVersion("xts")
# [1] '0.11.0'

These results look correct. There are duplicate values for 2014-08-14, but that isn't obviously incorrect. The index contains duplicate values for that date, and the i in the subset also contains duplicate values for that date. So you could argue that all observations with that date should be returned for each value of that date in i.

In situations where behavior is unclear, I prefer to defer to consistent behavior with zoo. That's not possible in this case, because zoo does not attempt to handle objects with duplicate index values. For example:

z <- as.zoo(x)
z[index(z)]
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-14 2014-08-16 
#          1          2          2          3          3          3          3 
# 2014-08-19 2014-12-28 2015-01-17 
#          3          4          4 
# Warning message:
# In zoo(rval, index(x)[i]) :
#   some methods for "zoo" objects do not work if the index entries in 'order.by' are not unique

We may consider what zoo does when you subset a zoo object with and i that contains duplicate values. Although we still need to keep in mind that zoo may behave this way because it doesn't try to handle objects with duplicate index values.

z[-5,][index(z)]  # Remove one of the 2014-08-14 rows
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-16 2014-08-19 
#          1          2          2          3          3          3          3 
# 2014-12-28 2015-01-17 
#          4          4

joshuaulrich added a commit that referenced this issue Oct 15, 2018
The result of subsetting an object with 'i' that contains duplicates,
may be longer than 'i' and/or the object. But the code assumed the
subset result would never be longer than the object index. This worked
correctly prior to 1d707c5, when the
fill_window_dups_rev() function was added.

Check for when the subset output length reaches the length of the
initially allocated result object. Increase the length of the result
object by twice the remaining elements in 'i' if we run out of space.

Fixes #275.
@joshuaulrich joshuaulrich added this to the 0.11-2 milestone Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants