How to Extract Column Value Within Square Brackets in Pyspark

Pyspark - Regex - Extract value from last brackets

To extract the substring between parentheses with no other parentheses inside at the end of the string you may use

tmp = tmp.withColumn("new", regexp_extract(col("txt"), r"\(([^()]+)\)$", 1));

Details

  • \( - matches (
  • ([^()]+) - captures into Group 1 any 1+ chars other than ( and )
  • \) - a ) char
  • $ - at the end of the string.

The 1 argument tells the regexp_extract to extract Group 1 value.

See the regex demo online.

NOTE: To allow trailing whitespace, add \s* right before $: r"\(([^()]+)\)\s*$"

NOTE2: To match the last occurrence of such a substring in a longer string, with exactly the same code as above, use

r"(?s).*\(([^()]+)\)"

The .* will grab all the text up to the end, and then backtracking will do the job.

Extract text between brackets and create rows for each bit of text

If assign back output of Series.str.findall to column is possible use DataFrame.explode, last for unique index is used DataFrame.reset_index with drop=True:

df2['text'] = df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])")

df4 = df2.explode('text').reset_index(drop=True)

Solution with Series.str.extractall, removed second level of MultiIndex and last use DataFrame.join for append to original:

s = (df2.pop('text').str.extractall(r"(?<=\[)([^]]+)(?=\])")[0]
.reset_index(level=1, drop=True)
.rename('text'))

df4 = df2.join(s).reset_index(drop=True)

print (df4)
studyid Question text
0 101 Q1 Bananas
1 101 Q1 oranges
2 101 Q1 figs
3 101 Q2 Apples
4 102 Q1 Grapes
5 103 Q3 Mandarins
6 103 Q3 oranges

How to extract content from the regex output which has square bracket in python

You can use str.strip if type of values is string:

print type(df.at[0,'email'])
<type 'str'>

df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw@yahoo.com
1 jfsjhj@yahoo.com
2 jwrk@yahoo.com
3 rankw@yahoo.com

If type is list apply Series:

print type(df.at[0,'email'])
<type 'list'>

df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw@yahoo.com
1 jfsjhj@yahoo.com
2 jwrk@yahoo.com
3 rankw@yahoo.com

EDIT: If you have multiple values in array, you can use:

df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw@yahoo.com
1 jfsjhj@yahoo.com
2 jwrk@yahoo.com
3 rankw@yahoo.com fsffsnl@gmail.com
4 mklcu@yahoo.com riserk@gmail.com funkdl@yahoo.com

How to remove square brackets from dataframe

Try with apply, explode and groupby:

>>> df.apply(lambda x: x.explode().astype(str).groupby(level=0).agg(", ".join))
column1 column2 column3
0 data1 data1 data1
1 nan data2 data2
2 data2 data3 data3, data3, testing how are you guys hope yo...
3 data3 data3 data4, dummy text to test to test test test
4 nan data4 data5
  1. Use pandas.explode() to transform each list element to its own row, replicating index values.
  2. Then groupby identical index values and aggregate using str.join().
  3. Use apply to apply the same function to all columns of the DataFrame.

Pyspark - Regex_Extract value between forward slash (/)

How about split?

you can:

.withColumn("Acode", split("column1", "/")[0])
.withColumn("Bcode", split("column1", "/")[1])
.withColumn("Ccode", split("column1", "/")[2])

How do I get the value without the square brackets

first returns a Row object, and you can use getString method to extract elements from the row as string:

sigh.select("accountId").first.getString(0)


Related Topics



Leave a reply



Submit