Split Column of Semicolon-Separated Values and Duplicate Row with each Value in Pandas
You can use pandas.Series.str.split
to make a list of the players then pandas.DataFrame.explode
to make multiple rows :
play_by_play['players'] = play_by_play['players'].str.split(';')
play_by_play = play_by_play.explode('players').reset_index(drop=True)
# Output :
print(play_by_play)
players down to_go play_type yards_gained pass_attempt complete_pass rush_attempt
0 Tom Brady 1 10 pass 8 1 1 0
1 Mike Evans 1 10 pass 8 1 1 0
2 Tristan Wirfs 1 10 pass 8 1 1 0
3 Leonard Fournette 1 10 pass 8 1 1 0
4 Chris Godwin 1 10 pass 8 1 1 0
R semicolon delimited a column into rows
You could try unnest
from tidyr
after splitting the "PolId" column and get the unique
rows
library(dplyr)
library(tidyr)
unnest(setNames(strsplit(df$PolId, ';'), df$Description),
Description) %>% unique()
Or using base R
with stack/strsplit/duplicated
. Split the "PolId" (strsplit
) by the delimiter(;
), name the output list elements with "Description" column, stack
the list to get a 'data.frame' and use duplicated
to remove the duplicate rows.
df1 <- stack(setNames(strsplit(df$PolId, ';'), df$Description))
setNames(df1[!duplicated(df1),], names(df))
# PolId Description
#1 ABC123 TEST1
#2 ABC456 TEST1
#3 ABC789 TEST1
#10 AAA123 TEST1
#11 AAA123 TEST2
#12 ABB123 TEST3
#13 ABC123 TEST3
Or another option without using strsplit
v1 <- with(df, tapply(PolId, Description, FUN= function(x) {
x1 <- paste(x, collapse=";")
gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '', x1, perl=TRUE)}))
library(stringr)
Description <- rep(names(v1), str_count(v1, '\\w+'))
PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE)
data.frame(PolId, Description)
# PolId Description
#1 ABC123 TEST1
#2 ABC456 TEST1
#3 ABC789 TEST1
#4 AAA123 TEST1
#5 AAA123 TEST2
#6 ABB123 TEST3
#7 ABC123 TEST3
Removing duplication sub-strings separated by semicolon from the string
After little modification my following code is working fine:
import pandas as pd
pipe_data = pd.read_excel('/content/sample_data/aff.xlsx', sheet_name='Sheet2')
df = pd.DataFrame(pipe_data)
df.dropna(inplace = True)
df['RepStr'] = df['RepStr'].str.split("; ")
df['RepStr'] = df['RepStr'].map(pd.unique).str.join("; ")
with pd.ExcelWriter('/content/sample_data/aff.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet3')
Format output if column 2 field has more than one value
Could you please try following. This will create 2 output files where one will have lines which have 2 values in 2nd column and other output file will have other than 2 values in 2nd column. Output file names will be out_file_two_cols
and out_file_more_than_two_cols
you could change it as per your need.
awk '
BEGIN{
FS=OFS=":"
}
{
delete a
val=""
num=split($2,array,";")
for(j=1;j<=num;j++){
if(!a[array[j]]++){
val=(val?val ";":"")array[j]
}
}
$2=val
num=split($2,array,";")
}
num==1{
print > ("out_file_two_cols")
next
}
{
print > ("out_file_more_than_two_cols")
}
' Input_file
Explanation: setting field separator and output field separator as :
here for all lines of Input_file in BEGIN section. Then in main section deleting array named a and nullifying variable val, which will be explained further and being used by program in later section, deleting them to avoid conflict of getting their previous values here.
Splitting 2nd field into array by putting delimiter as ;
and taking its total number of elements in num variable here. Now running for loop from 1 to till value of num here to traverse through all elements of 2nd field.
Checking condition if current value of 2nd field our of all elements not present in array a then add it on variable val and keep doing this for all elements of 2nd field.
Then assigning value of val to 2nd column. Now again checking now how many element present in new 2nd column by splitting it and num will tell us the same.
Then checking condition if num is 1 means current/edited 2nd field has only 1 element then print it one field output file else print it in other output file.
Split content of field and duplicate row
You can use below code to split and write the data to different sheet...
Sheet 1 contains input and Sheet 2 contains the output as you requested...
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim x As Integer
Dim y As Integer
i = 1 'Row
j = 1 'Col
'Destination Row & Col
x = 1
y = 1
While (Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, j).Value) <> "")
Dim CellValue1 As String
Dim CellValue2 As String
Dim CellValue3 As String
Dim ValArray() As String
Dim arrayLength As Integer
CellValue1 = Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, j).Value)
CellValue2 = Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, (j + 1)).Value)
CellValue3 = Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, (j + 2)).Value)
ValArray = Split(CellValue1, ";")
arrayLength = UBound(ValArray, 1) - LBound(ValArray, 1) + 1
k = 0
While (k < arrayLength)
'MsgBox ((ValArray(k) & CellValue2 & CellValue3))
ThisWorkbook.Sheets("Sheet2").Cells(x, y).Value = ValArray(k)
y = y + 1
ThisWorkbook.Sheets("Sheet2").Cells(x, y).Value = CellValue2
y = y + 1
ThisWorkbook.Sheets("Sheet2").Cells(x, y).Value = CellValue3
x = x + 1
y = 1
k = k + 1
Wend
i = i + 1
Wend
Script for removing rows based on entries from specific column in CSV file
To output the row where Header_2 never contains an entry from all Header_1 values, you can do the following:
Windows PowerShell:
$data = Import-Csv file.csv -Delimiter "`t"
($data | where Header_1 -notin $data.Header_2 |
ConvertTo-Csv -NoType -Delimiter "`t") -replace '^"|"$|"(\t)"','$1' |
Set-Content file.csv
PowerShell 7:
$data = Import-Csv file.csv -Delimiter "`t"
$data | where Header_1 -notin $data.Header_2 |
Export-Csv -NoType -Delimiter "`t" -UseQuotes AsNeeded
I feel like what you want to do is output rows where Header_2 has not yet appeared as a Header_1 value, which means you are ignoring future Header_1 values.
$list = [system.collections.generic.list[string]]@()
(Import-Csv file.csv -delimiter "`t" | Foreach-Object {
$list.Add($_.Header_1)
if ($_.Header_2 -notin $list) {
$_
}
} | ConvertTo-Csv -NoType -Delimiter "`t") -replace '^"|"$|"(\t)"','$1' |
Set-Content file.csv
You can go a route without using *-Csv
commands and then you don't have to deal with qualifying text for PowerShell non-core versions.
$list = [system.collections.generic.list[string]]@()
Get-Content file.csv | Foreach-Object {
$h1,$h2 = $_ -split '\t'
$list.Add($h1)
if ($h2 -notin $list) {
$_
}
} | Set-Content file.csv
Related Topics
What's the Meaning of a ! Before a Command in the Shell
How to Enable Bash in Windows 10 Developer Preview
Linux Execute Command Remotely
Limit on File Name Length in Bash
Linux Tool to Send Raw Data to a Tcp Server
Move Files to Directories Based on Extension
How to Access an Environment Variable in a .Desktop File's Exec Line
Why Disabling Interrupts Disables Kernel Preemption and How Spin Lock Disables Preemption
Rm: Cannot Remove: Permission Denied
C Program Shows %Zu After Conversion to Windows
What Does 'Set -O Errtrace' Do in a Shell Script
Why Doing I/O in Linux Is Uninterruptible
Write(2)/Read(2) Atomicity Between Processes in Linux
Rename Multiple Files While Keeping the Same Extension on Linux