How to Extract Column Value Within Square Brackets in Pyspark

Pyspark - Regex - Extract value from last brackets

To extract the substring between parentheses with no other parentheses inside at the end of the string you may use

tmp = tmp.withColumn("new", regexp_extract(col("txt"), r"\(([^()]+)\)$", 1));

Details

\( - matches (
([^()]+) - captures into Group 1 any 1+ chars other than ( and )
\) - a ) char
$ - at the end of the string.

The 1 argument tells the regexp_extract to extract Group 1 value.

See the regex demo online.

NOTE: To allow trailing whitespace, add \s* right before $: r"$([^()]+)$\s*$"

NOTE2: To match the last occurrence of such a substring in a longer string, with exactly the same code as above, use

r"(?s).*\(([^()]+)\)"

The .* will grab all the text up to the end, and then backtracking will do the job.

Extract text between brackets and create rows for each bit of text

If assign back output of Series.str.findall to column is possible use DataFrame.explode, last for unique index is used DataFrame.reset_index with drop=True:

df2['text'] = df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])")

df4 = df2.explode('text').reset_index(drop=True)

Solution with Series.str.extractall, removed second level of MultiIndex and last use DataFrame.join for append to original:

s = (df2.pop('text').str.extractall(r"(?<=\[)([^]]+)(?=\])")[0]
                   .reset_index(level=1, drop=True)
                   .rename('text'))

df4 = df2.join(s).reset_index(drop=True)

print (df4)
  studyid Question       text
0     101       Q1    Bananas
1     101       Q1    oranges
2     101       Q1       figs
3     101       Q2     Apples
4     102       Q1     Grapes
5     103       Q3  Mandarins
6     103       Q3    oranges

How to extract content from the regex output which has square bracket in python

You can use str.strip if type of values is string:

print type(df.at[0,'email'])
<type 'str'>

df['email'] = df.email.str.strip("[]'")
print df
              email
0    jsaw@yahoo.com
1  jfsjhj@yahoo.com
2    jwrk@yahoo.com
3   rankw@yahoo.com

If type is list apply Series:

print type(df.at[0,'email'])
<type 'list'>

df['email'] = df.email.apply(pd.Series)
print df
              email
0    jsaw@yahoo.com
1  jfsjhj@yahoo.com
2    jwrk@yahoo.com
3   rankw@yahoo.com

EDIT: If you have multiple values in array, you can use:

df1 = df['email'].apply(pd.Series).fillna('')
print df1
                  0                  1                 2
0    jsaw@yahoo.com                                     
1  jfsjhj@yahoo.com                                     
2    jwrk@yahoo.com                                     
3   rankw@yahoo.com  fsffsnl@gmail.com                  
4   mklcu@yahoo.com   riserk@gmail.com  funkdl@yahoo.com

How to remove square brackets from dataframe

Try with apply, explode and groupby:

>>> df.apply(lambda x: x.explode().astype(str).groupby(level=0).agg(", ".join))
  column1 column2                                            column3
0   data1   data1                                              data1
1     nan   data2                                              data2
2   data2   data3  data3, data3, testing how are you guys hope yo...
3   data3   data3        data4, dummy text to test to test test test
4     nan   data4                                              data5

Use pandas.explode() to transform each list element to its own row, replicating index values.
Then groupby identical index values and aggregate using str.join().
Use apply to apply the same function to all columns of the DataFrame.

Pyspark - Regex_Extract value between forward slash (/)

How about split?

you can:

.withColumn("Acode", split("column1", "/")[0])
.withColumn("Bcode", split("column1", "/")[1])
.withColumn("Ccode", split("column1", "/")[2])

How do I get the value without the square brackets

first returns a Row object, and you can use getString method to extract elements from the row as string:

sigh.select("accountId").first.getString(0)