List Only Duplicate Lines Based on One Column from a Semi-Colon Delimited File

Split Column of Semicolon-Separated Values and Duplicate Row with each Value in Pandas

You can use pandas.Series.str.split to make a list of the players then pandas.DataFrame.explode to make multiple rows :

play_by_play['players'] = play_by_play['players'].str.split(';')
play_by_play = play_by_play.explode('players').reset_index(drop=True)

# Output :

print(play_by_play)

              players  down  to_go play_type  yards_gained  pass_attempt  complete_pass  rush_attempt
0           Tom Brady     1     10      pass             8             1              1             0
1          Mike Evans     1     10      pass             8             1              1             0
2       Tristan Wirfs     1     10      pass             8             1              1             0
3   Leonard Fournette     1     10      pass             8             1              1             0
4        Chris Godwin     1     10      pass             8             1              1             0

R semicolon delimited a column into rows

You could try unnest from tidyr after splitting the "PolId" column and get the unique rows

library(dplyr)
library(tidyr)
 unnest(setNames(strsplit(df$PolId, ';'), df$Description), 
                                  Description) %>% unique()

Or using base R with stack/strsplit/duplicated. Split the "PolId" (strsplit) by the delimiter(;), name the output list elements with "Description" column, stack the list to get a 'data.frame' and use duplicated to remove the duplicate rows.

df1 <- stack(setNames(strsplit(df$PolId, ';'), df$Description))
setNames(df1[!duplicated(df1),], names(df))
#     PolId Description
#1  ABC123       TEST1
#2  ABC456       TEST1
#3  ABC789       TEST1
#10 AAA123       TEST1
#11 AAA123       TEST2
#12 ABB123       TEST3
#13 ABC123       TEST3

Or another option without using strsplit

v1 <- with(df, tapply(PolId, Description, FUN= function(x) {
            x1 <- paste(x, collapse=";")
        gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '', x1, perl=TRUE)}))
library(stringr)
Description <- rep(names(v1),  str_count(v1, '\\w+'))
PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE)
data.frame(PolId, Description)
#   PolId Description
#1 ABC123       TEST1
#2 ABC456       TEST1
#3 ABC789       TEST1
#4 AAA123       TEST1
#5 AAA123       TEST2
#6 ABB123       TEST3
#7 ABC123       TEST3

Removing duplication sub-strings separated by semicolon from the string

After little modification my following code is working fine:

import pandas as pd
pipe_data = pd.read_excel('/content/sample_data/aff.xlsx', sheet_name='Sheet2')
df = pd.DataFrame(pipe_data)
df.dropna(inplace = True)
df['RepStr'] = df['RepStr'].str.split("; ")
df['RepStr'] = df['RepStr'].map(pd.unique).str.join("; ")
with pd.ExcelWriter('/content/sample_data/aff.xlsx', mode='a') as writer:
  df.to_excel(writer, sheet_name='Sheet3')

Format output if column 2 field has more than one value

Could you please try following. This will create 2 output files where one will have lines which have 2 values in 2nd column and other output file will have other than 2 values in 2nd column. Output file names will be out_file_two_cols and out_file_more_than_two_cols you could change it as per your need.

awk '
BEGIN{
  FS=OFS=":"
}
{
  delete a
  val=""
  num=split($2,array,";")
  for(j=1;j<=num;j++){
    if(!a[array[j]]++){
       val=(val?val ";":"")array[j]
    }
  }
  $2=val
  num=split($2,array,";")
}
num==1{
  print > ("out_file_two_cols")
  next
}
{
  print > ("out_file_more_than_two_cols")
}
' Input_file

Explanation: setting field separator and output field separator as : here for all lines of Input_file in BEGIN section. Then in main section deleting array named a and nullifying variable val, which will be explained further and being used by program in later section, deleting them to avoid conflict of getting their previous values here.

Splitting 2nd field into array by putting delimiter as ; and taking its total number of elements in num variable here. Now running for loop from 1 to till value of num here to traverse through all elements of 2nd field.

Checking condition if current value of 2nd field our of all elements not present in array a then add it on variable val and keep doing this for all elements of 2nd field.

Then assigning value of val to 2nd column. Now again checking now how many element present in new 2nd column by splitting it and num will tell us the same.

Then checking condition if num is 1 means current/edited 2nd field has only 1 element then print it one field output file else print it in other output file.

Split content of field and duplicate row

You can use below code to split and write the data to different sheet...

Sheet 1 contains input and Sheet 2 contains the output as you requested...

Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim x As Integer
Dim y As Integer

i = 1  'Row
j = 1  'Col

'Destination Row & Col
x = 1
y = 1

While (Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, j).Value) <> "")
    Dim CellValue1 As String
    Dim CellValue2 As String
    Dim CellValue3 As String
    Dim ValArray() As String
    Dim arrayLength As Integer

    CellValue1 = Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, j).Value)
    CellValue2 = Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, (j + 1)).Value)
    CellValue3 = Trim(ThisWorkbook.Sheets("Sheet1").Cells(i, (j + 2)).Value)
    ValArray = Split(CellValue1, ";")
    arrayLength = UBound(ValArray, 1) - LBound(ValArray, 1) + 1

    k = 0
    While (k < arrayLength)
        'MsgBox ((ValArray(k) & CellValue2 & CellValue3))
        ThisWorkbook.Sheets("Sheet2").Cells(x, y).Value = ValArray(k)
        y = y + 1
        ThisWorkbook.Sheets("Sheet2").Cells(x, y).Value = CellValue2
        y = y + 1
        ThisWorkbook.Sheets("Sheet2").Cells(x, y).Value = CellValue3
        x = x + 1
        y = 1
        k = k + 1
    Wend
    i = i + 1
Wend

Script for removing rows based on entries from specific column in CSV file

To output the row where Header_2 never contains an entry from all Header_1 values, you can do the following:

Windows PowerShell:

$data = Import-Csv file.csv -Delimiter "`t"
($data | where Header_1 -notin $data.Header_2 |
    ConvertTo-Csv -NoType -Delimiter "`t") -replace '^"|"$|"(\t)"','$1' |
        Set-Content file.csv

PowerShell 7:

$data = Import-Csv file.csv -Delimiter "`t"
$data | where Header_1 -notin $data.Header_2 |
    Export-Csv -NoType -Delimiter "`t" -UseQuotes AsNeeded

I feel like what you want to do is output rows where Header_2 has not yet appeared as a Header_1 value, which means you are ignoring future Header_1 values.

$list = [system.collections.generic.list[string]]@()
(Import-Csv file.csv -delimiter "`t" | Foreach-Object {
    $list.Add($_.Header_1)
    if ($_.Header_2 -notin $list) { 
        $_ 
    }
} | ConvertTo-Csv -NoType -Delimiter "`t") -replace '^"|"$|"(\t)"','$1' |
        Set-Content file.csv

You can go a route without using *-Csv commands and then you don't have to deal with qualifying text for PowerShell non-core versions.

$list = [system.collections.generic.list[string]]@()
Get-Content file.csv | Foreach-Object {
    $h1,$h2 = $_ -split '\t'
    $list.Add($h1)
    if ($h2 -notin $list) { 
        $_ 
    }
} | Set-Content file.csv