Are dataframe[ ,-1] and dataframe[-1] the same?
Almost.
[-1]
uses the fact that a data.frame is a list, so when you do dataframe[-1]
it returns another data.frame (list) without the first element (i.e. column).
[ ,-1]
uses the fact that a data.frame is a two dimensional array, so when you do dataframe[, -1]
you get the sub-array that does not include the first column.
A priori, they sound like the same, but the second case also tries by default to reduce the dimension of the subarray it returns. So depending on the dimensions of your dataframe
you may get a data.frame or a vector, see for example:
> data <- data.frame(a = 1:2, b = 3:4)
> class(data[-1])
[1] "data.frame"
> class(data[, -1])
[1] "integer"
You can use drop = FALSE
to override that behavior:
> class(data[, -1, drop = FALSE])
[1] "data.frame"
What's the difference between [1], [1,], [,1], [[1]] for a dataframe in R?
In R, operators are not used for one data type only. Operators can be overloaded for whatever data type you like (e.g. also S3/S4 classes).
In fact, that's the case for data.frames.
as data.frames are lists, the
[i]
and[[i]]
(and$
) show list-like behaviour.row, colum indices do have an intuitive meaning for tables, and data.frames look like tables. Probably that is the reason why methods for data.frame [i, j] were defined.
You can even look at the definitions, they are coded in the S3 system (so methodname.class
):
> `[.data.frame`
and
> `[[.data.frame`
(the backticks quote the function name, otherwise R would try to use the operator and end up with a syntax error)
Calculate similarity of 1-row dataframe and a large dataframe with the same columns in Python?
I usually don't do matrix manipulation with DataFrame
but with numpy.array
. So I will first convert them
df_npy = df.values
input_npy = input.values
And then I don't want to use scipy.spatial.distance.cosine
so I will take care of the calculation myself, which is to first normalize each of the vectors
df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)
And then matrix multiply them together
df_npy @ input_npy.T
which will give you
array([[0.213],
[0.524],
[0.431]])
The reason I don't want to use scipy.spatial.distance.cosine
is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.
Find difference between two data frames
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin
with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge
with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
What does axis in pandas mean?
It specifies the axis along which the means are computed. By default axis=0
. This is consistent with the numpy.mean
usage when axis
is specified explicitly (in numpy.mean
, axis==None by default, which computes the mean value over the flattened array) , in which axis=0
along the rows (namely, index in pandas), and axis=1
along the columns. For added clarity, one may choose to specify axis='index'
(instead of axis=0
) or axis='columns'
(instead of axis=1
).
+------------+---------+--------+
| | A | B |
+------------+---------+---------
| 0 | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
| |
| axis=0 |
↓ ↓
How to merge multiple dataframes
Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.
Just simply merge with DATE as the index and merge using OUTER method (to get all the data).
import pandas as pd
from functools import reduce
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
Now, basically load all the files you have as data frame into a list. And, then merge the files using merge
or reduce
function.
# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]
Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.
To keep the values that belong to the same date you need to merge it on the DATE
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
- Now, the output will the values from the same date on the same lines.
- You can fill the non existing data from different frames for different columns using fillna().
Then write the merged data to the csv file if desired.
pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
This should give you
DATE VALUE1 VALUE2 VALUE3 ....
Extract first and last row of a dataframe in pandas
I think the most simple way is .iloc[[0, -1]]
.
df = pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
df2 = df.iloc[[0, -1]]
print(df2)
a b
0 1 a
3 4 d
How do I create test and train samples from one dataframe with pandas?
I would just use numpy's randn
:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Related Topics
How to Rotate the X-Axis Labels 90 Degrees in Levelplot
Plotting Multiple Lines from a Data Frame with Ggplot2
Remove a Character from the Entire Data Frame
Grid.Arrange Using List of Plots
Knitr: How to Use Child .Rnw Docs with (Relative) Figure Paths
Alternate Geom_Text Position with Hjust
Have Lubridate Subtraction Return Only a Numeric Value
How to Increase the Resolution of My Plot in R
Linear Models in R with Different Combinations of Variables
Using Mean with .Sd and .Sdcols in Data.Table
Setting Midpoint for Continuous Diverging Color Scale on a Heatmap
Scale_Y_Log10() and Coord_Trans(Ytrans = 'Log10') Lead to Different Results
Programmatically Rename Columns in Dplyr
How to Use Aggregate Function in R
Convert Lat/Lon to Zipcode/Neighborhood Name
Oauth Authentification to Fitbit Using Httr
How to Find the First and Last Occurrences of an Element in a Data.Frame