Why Should I Make a Copy of a Data Frame in Pandas

why should I make a copy of a data frame in pandas

This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)

You'll get:

   x
0 -1
1 2

In contrast, the following leaves df unchanged:

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1

This answer has been deprecated in newer versions of pandas. See docs

How copy method works in pandas dataframe?

But for df = df.copy(), does the new df overrides the old df? I mean in the RAM for df = df.copy() how many dataframes I have?

This is not a question about Pandas or the DataFrame class. It is a question about the = operator in Python.

df.copy() creates a new object, which happens to be a new instance of the DataFrame class. That's all you have to know. (You do have to know this, because functions can return objects that already existed.) It will do this exactly the same way whether you write dg = df.copy() or df = df.copy() - it could not possibly matter, because there is no way for the method to know that the assignment is even going to happen.

Assignment causes a name to refer to some particular object. That's it. dg = df.copy() means "when you get the object back from df.copy(), let dg be a name for that object". df = df.copy() means "when you get the object back from df.copy(), let df (stop being a name for what it was naming before, and) be a name for that object".

Objects persist for as long as they have a name.

When you write dg = df.copy(), the df name is still a name for the original DataFrame, so now you necessarily have two DataFrames in memory.

When you write df = df.copy(), the df name is not a name for that original DataFrame any more, because it was changed to be a name for the new one. So now the old one may or may not still be in memory.

It will definitely still be in memory if it has any other names (or other references - for example, being an element of a list somewhere).

In the reference implementation, it will be freed up if that was the last remaining name for the object. This happens because the reference implementation uses reference-counting-based garbage collection. Other implementations (for example, Jython) may not do this; they may use any sort of garbage collection technique.

why should I make a *shallow* copy of a dataframe?

A shallow copy allows you

  1. have access to frames data without copying it (memory optimization, etc.)
  2. modify frames structure without reflecting it to the original dataframe

In backtesting the developer tries to change the index to datetime format (line 640) and adds a new column 'Volume' with np.nan values if it's not already in dataframe. And those changes won't reflect on the original dataframe.

Example

>>> a = pd.DataFrame([[1, 'a'], [2, 'b']], columns=['i', 's'])
>>> b = a.copy(False)
>>> a
i s
0 1 a
1 2 b
>>> b
i s
0 1 a
1 2 b
>>> b.index = pd.to_datetime(b.index)
>>> b['volume'] = 0
>>> b
i s volume
1970-01-01 00:00:00.000000000 1 a 0
1970-01-01 00:00:00.000000001 2 b 0
>>> a
i s
0 1 a
1 2 b

Of course, if you won't create a shallow copy, those changes to dataframe structure will reflect in the original one.

why is blindly using df.copy() a bad idea to fix the SettingWithCopyWarning

here is my 2 cent on this with a very simple example why the warning is important.

so assuming that I am creating a df such has

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
print(x)
a b
0 0 0
1 1 1
2 2 2
3 3 3

now I want to create a new dataframe based on a subset of the original and modify it such has:

 q = x.loc[:, 'a']

now this is a slice of the original and whatever I do on it will affect x:

q += 2
print(x) # checking x again, wow! it changed!
a b
0 2 0
1 3 1
2 4 2
3 5 3

this is what the warning is telling you. you are working on a slice, so everything you do on it will be reflected on the original DataFrame

now using .copy(), it won't be a slice of the original, so doing an operation on q wont affect x :

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
print(x)
a b
0 0 0
1 1 1
2 2 2
3 3 3

q = x.loc[:, 'a'].copy()
q += 2
print(x) # oh, x did not change because q is a copy now
a b
0 0 0
1 1 1
2 2 2
3 3 3

and btw, a copy just mean that q will be a new object in memory. where a slice share the same original object in memory

imo, using .copy()is very safe. as an example df.loc[:, 'a'] return a slice but df.loc[df.index, 'a'] return a copy. Jeff told me that this was an unexpected behavior and : or df.index should have the same behavior as an indexer in .loc[], but using .copy() on both will return a copy, better be safe. so use .copy() if you don't want to affect the original dataframe.

now using .copy() return a deepcopy of the DataFrame, which is a very safe approach not to get the phone call you are talking about.

but using df.is_copy = None, is just a trick that does not copy anything which is a very bad idea, you will still be working on a slice of the original DataFrame

one more thing that people tend not to know:

df[columns] may return a view.

df.loc[indexer, columns] also may return a view, but almost always does not in practice.
emphasis on the may here

How to create modified copy of dataframe rows based on conditions in Pandas?

Most ideal way out is cross tab, see code below

  m=df['Interval.']==0

new=(

df[['Temperature.','Pressure.']]#Subset all temps

.merge# cross merge to subset of dfs with interval=0

(df[m].reset_index(),how='cross',suffixes = ('','_y')).drop_duplicates().drop(columns=['Temperature._y','Pressure._y'])# append back to subset of original df whose Interval was not 0

.append(df[~m].reset_index())# sort values by index

.sort_values(by=['index'])
)

Outcome

   Temperature.  Pressure. index  ColXYZ.  Interval.  ColCDE.
0 25.0 60.0 A. 121.0 0.0 0.195
2 40.0 50.0 A. 121.0 0.0 0.195
0 40.0 50.0 B. 246.0 4.0 0.350
1 25.0 60.0 C. 241.0 0.0 0.133
3 40.0 50.0 C. 241.0 0.0 0.133

Alternative is to create repeat rows and insert as detailed below
Original df

    ColXYZ  Interval  Temperature  Pressure  ColCDE
A. 121 0 25 60 0.195
B. 246 4 40 50 0.350
C. 241 0 40 50 0.133

#Generate list of Temps and Pressures

df=df.reset_index()#to preserve index
m=df['Interval']==0#selection of soert criteria
s=df['Temperature'].agg(list)
s1=df['Pressure'].agg(list)

#Duplicate rows in df

df1 = pd.DataFrame(np.repeat(df[m].values, len(df), axis=0), columns=df.columns)

#distribute values of Temp and Pressure to ensure each unique value in original df is represented in each unique ColXYZ

df1['Temperature']= np.tile(s, int(len(df1)/(len(s))))
df1['Pressure']= np.tile(s1, int(len(df1)/(len(s1))))

#Drop duplicates

   df1= df[~m].append(df1.assign(Temperature=np.tile(s, int(len(df1)/(len(s)))),Pressure= np.tile(s1, int(len(df1)/(len(s1))))).drop_duplicates()).sort_values(by=['index'])

Outcome

   index ColXYZ Interval  Temperature  Pressure ColCDE
0 A. 121 0 25 60 0.195
1 A. 121 0 40 50 0.195
1 B. 246 4 40 50 0.35
3 C. 241 0 25 60 0.133
4 C. 241 0 40 50 0.133

Truly deep copying Pandas DataFrames

One way is to convert df_in to Python dictionary which works better with copy:

def pop(df_in):
df = pd.DataFrame(copy.deepcopy(df_in.to_dict()) )
print(df['sets'].apply(lambda x: set([x.pop()])))

for i in range(3): pop(df)

Output:

0    {1}
Name: sets, dtype: object
0 {1}
Name: sets, dtype: object
0 {1}
Name: sets, dtype: object


Related Topics



Leave a reply



Submit