Are Dataframe[ ,-1] and Dataframe[-1] the Same

Are dataframe[ ,-1] and dataframe[-1] the same?

Almost.

[-1] uses the fact that a data.frame is a list, so when you do dataframe[-1] it returns another data.frame (list) without the first element (i.e. column).

[ ,-1]uses the fact that a data.frame is a two dimensional array, so when you do dataframe[, -1] you get the sub-array that does not include the first column.

A priori, they sound like the same, but the second case also tries by default to reduce the dimension of the subarray it returns. So depending on the dimensions of your dataframe you may get a data.frame or a vector, see for example:

> data <- data.frame(a = 1:2, b = 3:4)
> class(data[-1])
[1] "data.frame"
> class(data[, -1])
[1] "integer"

You can use drop = FALSE to override that behavior:

> class(data[, -1, drop = FALSE])
[1] "data.frame"

What's the difference between [1], [1,], [,1], [[1]] for a dataframe in R?

In R, operators are not used for one data type only. Operators can be overloaded for whatever data type you like (e.g. also S3/S4 classes).

In fact, that's the case for data.frames.

  • as data.frames are lists, the [i] and [[i]] (and $) show list-like behaviour.

  • row, colum indices do have an intuitive meaning for tables, and data.frames look like tables. Probably that is the reason why methods for data.frame [i, j] were defined.

You can even look at the definitions, they are coded in the S3 system (so methodname.class):

> `[.data.frame`

and

> `[[.data.frame`

(the backticks quote the function name, otherwise R would try to use the operator and end up with a syntax error)

Calculate similarity of 1-row dataframe and a large dataframe with the same columns in Python?

I usually don't do matrix manipulation with DataFrame but with numpy.array. So I will first convert them

df_npy = df.values
input_npy = input.values

And then I don't want to use scipy.spatial.distance.cosine so I will take care of the calculation myself, which is to first normalize each of the vectors

df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)

And then matrix multiply them together

df_npy @ input_npy.T

which will give you

array([[0.213],
[0.524],
[0.431]])

The reason I don't want to use scipy.spatial.distance.cosine is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.

Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3

Correct Output

Out[656]: 
A B
1 2 3
2 3 4
3 3 4


How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

What does axis in pandas mean?

It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).

+------------+---------+--------+
| | A | B |
+------------+---------+---------
| 0 | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
| |
| axis=0 |
↓ ↓

How to merge multiple dataframes

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.

Just simply merge with DATE as the index and merge using OUTER method (to get all the data).

import pandas as pd
from functools import reduce

df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')

Now, basically load all the files you have as data frame into a list. And, then merge the files using merge or reduce function.

# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]

Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.

To keep the values that belong to the same date you need to merge it on the DATE

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)

# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as

df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
  • Now, the output will the values from the same date on the same lines.
  • You can fill the non existing data from different frames for different columns using fillna().

Then write the merged data to the csv file if desired.

pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)

This should give you

DATE VALUE1 VALUE2 VALUE3 ....

Extract first and last row of a dataframe in pandas

I think the most simple way is .iloc[[0, -1]].

df = pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
df2 = df.iloc[[0, -1]]

print(df2)

a b
0 1 a
3 4 d

How do I create test and train samples from one dataframe with pandas?

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red


Related Topics



Leave a reply



Submit