PySpark - Pass list as parameter to UDF
from pyspark.sql.functions import udf, col
#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
label_list = ["Great", "Good", "OK", "Please Move", "Dead"]
def cate(label, feature_list):
if feature_list == 0:
return label[4]
else: #you may need to add 'else' condition as well otherwise 'null' will be added in this case
return 'I am not sure!'
def udf_score(label_list):
return udf(lambda l: cate(l, label_list))
a.withColumn("category", udf_score(label_list)(col("distances"))).show()
Output is:
+------+---------+--------------+
|Letter|distances| category|
+------+---------+--------------+
| A| 20|I am not sure!|
| B| 30|I am not sure!|
| D| 80|I am not sure!|
+------+---------+--------------+
Passing a data frame column and external list to udf under withColumn
The cleanest solution is to pass additional arguments using closure:
def make_topic_word(topic_words):
return udf(lambda c: label_maker_topic(c, topic_words))
df = sc.parallelize([(["union"], )]).toDF(["tokens"])
(df.withColumn("topics", make_topic_word(keyword_list)(col("tokens")))
.show())
This doesn't require any changes in keyword_list
or the function you wrap with UDF. You can also use this method to pass an arbitrary object. This can be used to pass for example a list of sets
for efficient lookups.
If you want to use your current UDF and pass topic_words
directly you'll have to convert it to a column literal first:
from pyspark.sql.functions import array, lit
ks_lit = array(*[array(*[lit(k) for k in ks]) for ks in keyword_list])
df.withColumn("ad", topicWord(col("tokens"), ks_lit)).show()
Depending on your data and requirements there can alternative, more efficient solutions, which don't require UDFs (explode + aggregate + collapse) or lookups (hashing + vector operations).
Pyspark pass function as a parameter to UDF
You are not so far from the solution. Here is how I would do it :
def foo_fun_udf(func):
def foo_fun(row) -> str:
return 'a' + func()
out_udf = udf(foo_fun, StringType())
return out_udf
df_to_test.withColumn(
'foo',
foo_fun_udf(bar_fun)(struct([df_to_test[x] for x in df_to_test.columns]))
).show()
Pass list to udf in dataframe with Colum
the below dynamic column approach might solve your problem.
from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = spark.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
'''
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A1| 11| A3| A4|
| B1| 22| B3| B4|
| C1| 33| C3| C4|
+----+----+----+----+
'''
col_list = ['col1','col2']
df = df.withColumn('concatenated_cols2',concat(*col_list))
col_list = ['col1','col2','col3']
df = df.withColumn('concatenated_cols3',concat(*col_list))
col_list = ['col1','col2','col3','col4']
df = df.withColumn('concatenated_cols4',concat(*col_list))
df.show()
'''
+----+----+----+----+------------------+------------------+------------------+
|col1|col2|col3|col4|concatenated_cols2|concatenated_cols3|concatenated_cols4|
+----+----+----+----+------------------+------------------+------------------+
| A1| 11| A3| A4| A111| A111A3| A111A3A4|
| B1| 22| B3| B4| B122| B122B3| B122B3B4|
| C1| 33| C3| C4| C133| C133C3| C133C3C4|
+----+----+----+----+------------------+------------------+------------------+
'''
Related Topics
How to Extract Rar Files Inside Google Colab
In Dictionary, Converting the Value from String to Integer
How to Change Default Python Version
Python Flask Threaded True Not Working
Python - Remove Any Element from a List of Strings That Is a Substring of Another Element
Auto Reloading Python Flask App Upon Code Changes
Django: Calling .Update() on a Single Model Instance Retrieved by .Get()
How to Read Gz Compressed File by Pyspark
Navigating Through Pagination With Selenium in Python
Pandas: Difference Between Pivot and Pivot_Table. Why Is Only Pivot_Table Working
In Python, How to Find the Vowels in a Word
Sqlalchemy: How to Filter Date Field
Collect_List by Preserving Order Based on Another Variable
Most Efficient Way to Forward-Fill Nan Values in Numpy Array
Print a List of Space-Separated Elements
How to Increase the Font Size of the Markdown Table in Jupyter Notebook