@nalimilan we have the following related problems in the core design of DataFrames.jl. What do you think we should do about it?
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> sdf = @view df[:, 2:1]
0×0 SubDataFrame
julia> sdf[[true, true, true], :]
0×0 DataFrame
julia> sdf[Bool[], :]
ERROR: BoundsError: attempt to access 3-element Base.OneTo{Int64} at index [0-element Vector{Bool}]
The issue: view allows to set its :rows
field to whatever parent allows, but with 0 columns it should have 0 rows.
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> sdf = @view df[:, :]
3×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> dfr = df[1, :]
DataFrameRow
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
julia> select!(df)
0×0 DataFrame
julia> sdf[:, :]
ERROR: BoundsError: attempt to access 0×0 DataFrame at index [1:3, 1:0]
julia> dfr
Error showing value of type DataFrameRow{DataFrame, DataFrames.Index}:
ERROR: BoundsError: attempt to access 0×0 DataFrame at index [[1], 1:0]
The issue here is that if we use :
as column selector it is a valid operation to drop/add columns to a data frame.
However, if we move past 0-columns case we might even change number of rows in a data frame.
If something is not clear please let me know. In general I was not sure how to best handle it (there are several possible options, but in general it shows a hole in the design so I was not sure what is best).
Results of some additional thinking:
Regarding the second issue (changing set of columns in a parent of a view
with :
as column selector). While it is irritating, this is strictly speaking not an issue. If we drop all columns in the parent we change its number of rows - and this is disallowed when working with views. So we can leave things as is
Regarding the first issue (0-column view of a data frame with some rows). I was thinking of two solutions:
rows
argument when sub data frame is created to be empty;view
and getindex
so that they work correctly in this case.The first (fixing rows
) is easier to do, but it will give an incorrect parentindices
result (i.e. you cannot recover the row indices used when creating a SubDataFrame
. Though it is not strictly required that parentindices
should allow for this. Also fixing rows
would be technically breaking.
So I am leaning to changing view
and getindex
implementations, but keeping rows
field untouched. What do you think?
I do not see any (except for complexity). See https://github.com/JuliaData/DataFrames.jl/pull/3273 for an implementation.