Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split.xts() repeats list element names #392

Closed
ocallaghanm opened this issue Feb 25, 2023 · 5 comments
Closed

split.xts() repeats list element names #392

ocallaghanm opened this issue Feb 25, 2023 · 5 comments
Assignees
Labels
Milestone

Comments

@ocallaghanm
Copy link

ocallaghanm commented Feb 25, 2023

Description

I have been trying to split.xts() my dataset by month. The splits are correct in terms of endpoints, but the nomenclature of the elements in the returned list isn't. In some cases, one month appears twice in a row (but with the data for the correct month). I updated my version of xts, as well as R itself, but nothing seems to be changing. The issue is not systematic, however.

Expected behavior

In the code below, I use a full-scale (testdat_large.csv) and a reduced (testdat.csv) version of my original data. baz, from the reduced data, has 3 elements Sep 2021, Oct 2021 and Nov 2021. So far, so good. But qux has Sep 2021, Sep 2021 and Nov 2021 - but the splits themselves are correct.

Thinking it might be a problem with my data, I used sample_matrix. The resulting foo however, has Jan 2007, Jan 2007, Feb 2007, Mar 2007, Apr 2007, May 2007 instead of the expected Jan through Jun. But again, the endpoints are correct (e.g. the second instance of Jan 2007 has 28 days so is indeed February).

I thought it might be due to the fact that these datasets all have multiple columns, so I tried subsetting one in dat.sing. But the resulting bar presents the exact same problem. So the issue seems to be that for some reason, some list elements receive the wrong name, and subsequent elements are offset.

Minimal, reproducible example

library("xts")
data("sample_matrix")
dat <- as.xts(sample_matrix)

dat.sing <- dat$Open
dat2 <- read.delim.zoo("~/testdat.csv", 
                       format = "%d.%m.%Y %H:%M", 
                       tz = "CET", 
                       sep = ";", 
                       header = TRUE) |> as.xts()
dat3 <- read.delim.zoo("~/testdat_large.csv", 
                       format = "%d.%m.%Y %H:%M", 
                       tz = "CET", 
                       sep = ";", 
                       header = TRUE) |> as.xts()

foo <- split(dat, f = "months")
bar <- split(dat.sing, f = "months")
baz <- split(dat2, f = "months")
qux <- split(dat3, f = "months")

Session Info

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=French_Switzerland.utf8  LC_CTYPE=French_Switzerland.utf8    LC_MONETARY=French_Switzerland.utf8
[4] LC_NUMERIC=C                        LC_TIME=French_Switzerland.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.0     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
 [8] tibble_3.1.8    ggplot2_3.4.1   tidyverse_2.0.0 xts_0.13.0      zoo_1.8-11     

loaded via a namespace (and not attached):
 [1] rstudioapi_0.14  magrittr_2.0.3   hms_1.1.2        tidyselect_1.2.0 munsell_0.5.0    timechange_0.2.0 colorspace_2.1-0
 [8] lattice_0.20-45  R6_2.5.1         rlang_1.0.6      fansi_1.0.4      tools_4.2.2      grid_4.2.2       gtable_0.3.1    
[15] utf8_1.2.3       cli_3.4.1        withr_2.5.0      ellipsis_0.3.2   lifecycle_1.0.3  tzdb_0.3.0       vctrs_0.5.2     
[22] glue_1.6.2       stringi_1.7.12   compiler_4.2.2   pillar_1.8.1     generics_0.1.3   scales_1.2.1     pkgconfig_2.0.3 
@joshuaulrich
Copy link
Owner

Thanks for the report and reproducible example!

I can't replicate this behavior. My hunch is that it's a timezone issue. Can you provide the output of:

names(foo)
## [1] "Jan 2007" "Feb 2007" "Mar 2007" "Apr 2007" "May 2007" "Jun 2007"
names(bar)
## [1] "Jan 2007" "Feb 2007" "Mar 2007" "Apr 2007" "May 2007" "Jun 2007"
names(baz)
## [1] "Sep 2021" "Oct 2021" "Nov 2021"
names(qux)
## [1] "Sep 2021" "Sep 2021" "Nov 2021"

Sys.getenv("TZ")  # "" for me
Sys.timezone()    # "America/Chicago" for me

@ocallaghanm
Copy link
Author

ocallaghanm commented Feb 25, 2023

Thanks for your prompt reply! That might be it indeed, though I don't understand why it would cause the names to repeat themselves... Here is the desired info:

names(foo)
[1] "janv. 2007" "janv. 2007" "févr. 2007" "mars 2007"  "avr. 2007"  "mai 2007"  
names(bar)
[1] "janv. 2007" "janv. 2007" "févr. 2007" "mars 2007"  "avr. 2007"  "mai 2007"  
names(baz)
[1] "sept. 2021" "oct. 2021"  "nov. 2021" 
names(qux)
[1] "sept. 2021" "sept. 2021" "nov. 2021" 
Sys.getenv("TZ")
[1] ""
Sys.timezone()
[1] "Europe/Berlin"

Edit: I just tested setting tz = "UTC" in the declaration of dat3, and re-running from there. In that case, names(qux) yields "sept. 2021" "oct. 2021" "nov.2021", i.e. as expected. So you must be right. But it seems a bit of a dirty trick to pass everything off as UTC, do you have an idea how to work around this?

@joshuaulrich
Copy link
Owner

joshuaulrich commented Feb 25, 2023

Thanks for the info. I can replicate if I set Sys.setenv(TZ = "Europe/Berlin"). I'll investigate further.

Sys.setenv(TZ = "Europe/Berlin")
library("xts")
data("sample_matrix")
dat <- as.xts(sample_matrix)
foo <- split(dat, f = "months")
names(foo)
## [1] "Jan 2007" "Jan 2007" "Feb 2007" "Mar 2007" "Apr 2007" "May 2007"

@joshuaulrich joshuaulrich self-assigned this Feb 25, 2023
@joshuaulrich joshuaulrich added this to the 0.13.1 milestone Feb 25, 2023
@joshuaulrich
Copy link
Owner

joshuaulrich commented Feb 25, 2023

This happens with sample_matrix because as.xts() creates an xts object with a POSIXct index, and uses Sys.getenv("TZ") as the timezone by default. You can force a Date index by setting dateFormat = "Date" in the call to as.xts(). This is related to #192.

Sys.setenv(TZ = "Europe/Berlin")
library("xts")
data("sample_matrix")
dat <- as.xts(sample_matrix, dateFormat = "Date")
foo <- split(dat, f = "months")
names(foo)
## [1] "Jan 2007" "Feb 2007" "Mar 2007" "Apr 2007" "May 2007" "Jun 2007"

Something else is going on with your use case. I'm looking into that.


EDIT: the issue in your actual case is that as.yearmon.POSIXct() always sets tz = "GMT" in its call to as.POSIXlt(). That converts your Europe/Berlin times to GMT before converting them to a yearmon object. So sometimes the Europe/Berlin GMT offset causes that to be the prior day in GMT.

For example:

x <- structure(c(1632481200, 1633042800, 1635724800),
    class = c("POSIXct", "POSIXt"),
    tzone = "Europe/Berlin")
x
## [1] "2021-09-24 13:00:00 CEST" "2021-10-01 01:00:00 CEST" "2021-11-01 01:00:00 CET" 

as.yearmon(x)
## [1] "Sep 2021" "Sep 2021" "Nov 2021"

# as.yearmon.POSIXt() calls as.yearmon(with(as.POSIXlt(x, tz = "GMT"), 1900 + year + mon/12))
as.yearmon(with(as.POSIXlt(x, tz = "GMT"), 1900 + year + mon/12))
## [1] "Sep 2021" "Sep 2021" "Nov 2021"

# setting tz in as.POSIXlt() gives the correct answer
as.yearmon(with(as.POSIXlt(x, tz = tzone(x)), 1900 + year + mon/12))
## [1] "Sep 2021" "Oct 2021" "Nov 2021"

# as.Date() has the same behavior and solution
as.Date(x)
## [1] "2021-09-24" "2021-09-30" "2021-11-01"

as.Date(x, tz = tzone(x))
## [1] "2021-09-24" "2021-10-01" "2021-11-01"

@zeileis and @ggrothendieck cc'ing you to make sure you're aware of this behavior in as.yearmon() so you can decide whether or not it's desired.

@ocallaghanm I'll make split.xts() more careful about converting the index times into names for the result. Thanks again for the report!

@ocallaghanm
Copy link
Author

Thanks a bunch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants