Pandas column of lists, create a row for each list element
UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode()
.
lst_col = 'samples'
r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N
times where N
- is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2
[18 rows x 2 columns]
using np.concatenate()
we can flatten all values in the list
column (samples
) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31
[18 rows x 3 columns]
using pd.DataFrame()[df.columns]
will guarantee that we are selecting columns in the original order...
Compare each element in a list with a column of lists in a dataframe python
In order to have a working example, I considered that your data were lists of strings :
df = pd.DataFrame({
'fruits':[['apple','orange','berry'],['orange']]
})
ml = ['apple', 'orange', 'banana']
Then I created a function that return the list of matching elements, or 0 if there's no match :
def matchFruits(row):
result = []
for fruit in row['fruits']:
if fruit in ml :
result.append(fruit)
return result if len(result) > 0 else 0
result = [fruit for fruit in row['fruits'] if fruit in ml]
for list comprehesion aficionados.
Finally, I called this function on the whole DataFrame with axis = 1
to add a new column to the initial DataFrame :
df["match"] = df.apply(matchFruits, axis = 1)
The output is the following, it is different from your example since your result was 0 even though 'orange' was in both list. If it is not the requested behavior, please edit your question.
fruits match
0 [apple, orange, berry] [apple, orange]
1 [orange] [orange]
How do I iterate through a df column (where each row is a list), looking for elements in a different list?
A better approach would be to use set intersection (assuming you're trying to count unique matches, i.e., you're not interested in how many times "apple" is mentioned in a review, only that it is mentioned, period).
This should get you what you want, again, assuming you want to count unique matches and assuming your lemmatized
column values are indeed lists of strings:
df["lemmatized"].apply(lambda r: len(set(r) & set(menu_items)))
Pandas column tolist() while each row data being list of strings?
Your problem lies with the saving method. CSVs are not natively able to store lists unless you specifically parse them after reading.
Would it be possible for you to save time and effort by saving in another format instead? JSON natively supprots lists and is also a format that can be easily read by humans.
Here is an obligatory snippet for you:
import pandas as pd
df = pd.DataFrame([{"sentence":['aa', 'bb', 'cc']},{"sentence":['dd', 'ee', 'ff']}])
df.to_json("myfile.json")
df2 = pd.read_json("myfile.json")
Giving the following result:
>>> df2
sentence
0 [aa, bb, cc]
1 [dd, ee, ff]
how to create columns for each element given a column of lists pandas
Fastest way is to initialise the new dataframe with list of values obtained using Series.tolist
:
df1 = pd.DataFrame(df['omega'].tolist())
Example:
df = pd.DataFrame({'omega': [np.arange(12).tolist()]* 5})
print(df)
omega
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
df1 = pd.DataFrame(df['omega'].tolist()).add_prefix('col')
Result:
# print(df1)
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11
0 0 1 2 3 4 5 6 7 8 9 10 11
1 0 1 2 3 4 5 6 7 8 9 10 11
2 0 1 2 3 4 5 6 7 8 9 10 11
3 0 1 2 3 4 5 6 7 8 9 10 11
4 0 1 2 3 4 5 6 7 8 9 10 11
Create a new column from two columns of a dataframe where rows of each column contains list in string format
IIUC, do some data clean up by remove an intra-string single quote.
And, then use library yaml to convert your string to actual list in each pandas dataframe cell with applymap. Lastly, apply explode to your dataframe twice once for each column you want to expand.
import yaml
import pandas as pd
df = pd.read_csv('Downloads/nodes_list.csv', index_col=[0])
df['Opp1'] = df['Opp1'].str.replace("[\'\"]s",'s', regex=True)
df['Opp2'] = df['Opp2'].str.replace("[\'\"]s",'s', regex=True)
df = df.applymap(yaml.safe_load)
df_new = df.explode('Opp1').explode('Opp2').apply(list, axis=1)
df_new
Output:
0 [KingdomofPoland, Georgia]
0 [GrandDuchyofLithuania, Georgia]
1 [NorthernYuanDynasty, Georgia]
2 [SpanishEmpire, ChechenRepublic]
2 [CaptaincyGeneralofChile, ChechenRepublic]
...
3411 [SyrianOpposition, SpanishEmpire]
3412 [UnitedStates, SpanishEmpire]
3412 [UnitedKingdom, SpanishEmpire]
3412 [SaudiArabia, SpanishEmpire]
3413 [Turkey, Russia]
Length: 31170, dtype: object
How to add column(s) for each classification that contains values in a list
You can store the conditions/codes in a dictionary, loop over that, and then use isin
+ any(axis=1)
to check if any codes from each condition are in each row of the dataframe:
all_codes = {
'dementia': ['F01', 'F02', 'F03', 'F051', 'G30', 'G311'],
'solid_tumour': ['C77', 'C78', 'C79', 'C80'],
}
for condition, codes in all_codes.items():
df[condition + '_yn'] = df.isin(codes).any(axis=1).astype(int)
Output:
>>> df
patient_num DIAGX1 DIAGX2 DIAGX3 DIAGX4 dementia_yn solid_tumour_yn
0 pat1 C77 F01 M32 M315 1 1
1 pat2 I099 I278 M05 F01 1 0
2 pat3 N057 N057 N058 N057 0 0
Related Topics
Python Error - "Importerror: Cannot Import Name 'Dist'"
How to Get the Python Call Stack with the Linux Perf
No Schema Has Been Selected to Create in ... Error
Letsencrypt Importerror: No Module Named Interface on Amazon Linux While Renewing
Install Tkinter on Amazon Linux
Running Python Script as Another User
Why am I Getting Socket.Gaierror: [Errno -2] from Python Httplib
Python 2.7 Cannot Import Pyqt4
Can Python Detect Which Os Is It Running Under
Google Cloud Sdk: Set Environment Variable_ Python --> Linux
Python: Xlib -- How to Raise(Bring to Top) Windows
Histogram of an Image's "Black Ink Level" by Horizontal Axis
How to Downgrade My Version of Python from 3.7.5 to 3.6.5 on Ubuntu
How to Start/Stop Linux Processes with Python
Loading a Config File from Operation System Independent Place in Python