Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve time-of-day subsetting performance #193

Closed
joshuaulrich opened this issue Jun 10, 2017 · 7 comments
Closed

Improve time-of-day subsetting performance #193

joshuaulrich opened this issue Jun 10, 2017 · 7 comments
Assignees
Labels
enhancement Enhancement to existing feature
Milestone

Comments

@joshuaulrich
Copy link
Owner

A StackOverflow user wrote a function that is 200x faster than xts' current time-of-day subset implementation. Included below, byte-compiled, and in case the StackOverflow page is not available sometime in the future.

cut_time_of_day <- compiler::cmpfun(function(x, t_str_begin, t_str_end) {
  tstr_to_sec <- function(t_str) {
    #"09:00:00" to sec of day
    as.numeric(as.POSIXct(paste("1970-01-01", t_str), "UTC")) %% 86400L
  }

  #handle tzone
  tz <- indexTZ(x)
  sec_of_day = {
    lt = as.POSIXlt(index(x), tz = tz)
    lt$hour *60*60 + lt$min*60 + lt$sec
  }
  sec_begin  = tstr_to_sec(t_str_begin)
  sec_end    = tstr_to_sec(t_str_end)

  return(x[ sec_of_day >= sec_begin & sec_of_day <= sec_end,])
})

And the relevant benchmark:

require(microbenchmark)
require(xts)

n <- 100000
dtime <- seq(ISOdate(2001,1,1), by = 60*60, length.out = n)
attributes(dtime)$tzone <- "America/Chicago"
x <- xts(1:n, order.by = dtime)

f1 <- function(x) { cut_time_of_day(x, "07:00:00", "09:00:00") }
f2 <- function(x) { x["T07:00:00/T09:00:00"] }

microbenchmark(f1(x), f2(x), times=2)
# Unit: milliseconds
#   expr        min         lq       mean     median         uq        max neval cld
#  f1(x)   13.72245   13.72245   14.66432   14.66432   15.60619   15.60619     2  a 
#  f2(x) 3013.40674 3013.40674 3056.50941 3056.50941 3099.61208 3099.61208     2   b

identical(f1(x), f2(x))
# [1] TRUE
@joshuaulrich joshuaulrich added the enhancement Enhancement to existing feature label Jun 10, 2017
@ghost
Copy link

ghost commented Jun 10, 2017

ok, but what if index:

  1. contains negative values
  2. is out of bound (duplicated?)
  3. is not ordered
  4. is NA (duplicated?)

Daniel

@braverock
Copy link
Contributor

All valid xts indices must be ordered, so checking for an unordered index is nice defensive programming, but the user should expect an error if they are constructing an index (e.g. in C/C++) which is not ordered.

Negative values may be avoided by an appropriate choice of origin.

For intraday data, one should probably use make.index.unique to avoid duplicated indices.

As for NA or out of bounds data, I think this could be contained in a try block, and we could handle any errors in the catch.

All these checks in the existing code seem nice, but none of them seem strictly necessary. A well-constructed object should not have these problems, except possibly duplicated indices. The normal constructors will prevent the rest.

@joshuaulrich
Copy link
Owner Author

@silentbits All your questions point to the i argument of [.xts, not the index of the xts object. Therefore, none of your questions are relevant for the function I included in the original post. The function only subsets an xts by a logical i that is always the same length as the index. Perhaps it's possible that this logical vector has some NA values, but that's handled as a special case in the xts subsetting functionality.

@joshuaulrich
Copy link
Owner Author

Need unit tests around daylight saving time. This should include DST starting and ending inside the desired interval, as well as when the interval contains a time affected by DST. For example, the interval is something like "T02:00:00/T19:00:00" when DST starts (no 0200) and ends (0200 occurs twice).

@ckatsulis
Copy link

Did the improved sub-setting ever get released?

@joshuaulrich
Copy link
Owner Author

@ckatsulis No, and in general, an issue is not released if was opened prior to a release and remains open after the release. I also try to ensure commits reference the issues they address, and GitHub will display those commits inline with issue comments. No work has been done on this issue, so there are no commits that reference it.

You (or anyone else) could help get this issue included in the next release by providing some unit test cases. I'm very reluctant to release code changes that are not accompanied by relevant tests. That is especially true in this case, since it is essentially a re-write of the existing functionality.

@joshuaulrich joshuaulrich self-assigned this Jan 13, 2019
joshuaulrich added a commit that referenced this issue Jan 13, 2019
This restores the behavior that existed before the prior commit to
improve performance.

Yes, this actually makes sense and is useful for financial markets.
For example, some futures markets open at 18:00 and close at 16:00 the
next day.

See #193.
@joshuaulrich
Copy link
Owner Author

@ckatsulis This is now in master. I'd appreciate it if you could use this development version and provide any feedback before the next release.

joshuaulrich added a commit that referenced this issue May 12, 2019
Conflicts in R/toperiod.R related to:
 - #53 time-based to Date-based index causes duplicate index values)
 - #277 to.daily producing duplicates

Conflicts in R/xts.methods.R related to:
 - #193 time-of-day subset performance

Fixes #245.
@joshuaulrich joshuaulrich added this to the 0.12-0 milestone Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants