Spark dataframe get column value into a string variable
The col("name")
gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name")
:
val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]
Then for a row you could get name in String by:
val name = row.getString(0)
Extract column value based on another column in Pandas
You could use loc
to get series which satisfying your condition and then iloc
to get first element:
In [2]: df
Out[2]:
A B
0 p1 1
1 p1 2
2 p3 3
3 p2 4
In [3]: df.loc[df['B'] == 3, 'A']
Out[3]:
2 p3
Name: A, dtype: object
In [4]: df.loc[df['B'] == 3, 'A'].iloc[0]
Out[4]: 'p3'
How to efficiently grab data based on string value of a row
What you are looking for is groupby
.
Suppose that you have the following DataFrame
:
julia> df = DataFrame(Country=rand([:A,:B,:C],7), year=rand(2000:2020,7), tax=rand(7))
7×3 DataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ A │ 2014 │ 0.913118 │
│ 2 │ C │ 2003 │ 0.894182 │
│ 3 │ A │ 2018 │ 0.917585 │
│ 4 │ C │ 2011 │ 0.869531 │
│ 5 │ A │ 2011 │ 0.45841 │
│ 6 │ B │ 2001 │ 0.808954 │
│ 7 │ B │ 2008 │ 0.969813 │
You can collect information by each country:
dfg = groupby(df, :Country);
and now:
julia> dfg[1]
3×3 SubDataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ A │ 2014 │ 0.913118 │
│ 2 │ A │ 2018 │ 0.917585 │
│ 3 │ A │ 2011 │ 0.45841 │
julia> dfg[2]
2×3 SubDataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ C │ 2003 │ 0.894182 │
│ 2 │ C │ 2011 │ 0.869531 │
julia> dfg[3]
2×3 SubDataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ B │ 2001 │ 0.808954 │
│ 2 │ B │ 2008 │ 0.969813 │
Note that for faster search it is better to use Symbol
s than string. You can always use vectorized Symbol.()
constructor to convert any column of String
s.
How to extract part of a string in Pandas column and make a new column
Use str.extract
with a regex and str.replace
to rename values:
dff['Version_short'] = dff['Name'].str.extract('_(V\d+)$').fillna('')
dff['Version_long'] = dff['Version_short'].str.replace('V', 'Version ')
Output:
>>> dff
col1 col3 Name Date Version_short Version_long
0 1 1 2a df a1asd_V1 2021-06-13 V1 Version 1
1 2 22 xcd a2asd_V3 2021-06-13 V3 Version 3
2 3 33 23vg aabsd_V1 2021-06-13 V1 Version 1
3 4 44 dfgdf_aabsd_V0 2021-06-14 V0 Version 0
4 5 55 a3as d_V1 2021-06-15 V1 Version 1
5 60 60 aa bsd_V3 2021-06-15 V3 Version 3
6 0 1 aasd_V4 2021-06-13 V4 Version 4
7 0 5 aabsd_V4 2021-06-16 V4 Version 4
8 6 6 aa_adn sd_V15 2021-06-13 V15 Version 15
9 3 3 NaN 2021-06-13
10 2 2 aasd_V12 2021-06-13 V12 Version 12
11 4 4 aasd120Abs 2021-06-16
How would I get everything before a : in a string Python
Just use the split
function. It returns a list, so you can keep the first element:
>>> s1.split(':')
['Username', ' How are you today?']
>>> s1.split(':')[0]
'Username'
Create list based on column value and use that list to extract words from string column in df without overwriting row value with for loop
Adding another answer to show a shorter/simpler way to do what you wanted. (The first one was just to fix what was not working in your code.)
Using .apply()
, you can call a modified verison of your function per row of df
and then do the checking with the street names in df2
.
def extract_street(row):
street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
streets_found = set(re.findall(streets_regex, row['text']))
return ', '.join(streets_found)
## or if you want this to return a list of streets
# return list(streets_found)
df['street_match'] = df.apply(extract_street, axis=1)
df
Output:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
Note:
There's an issue with your regex - the
join
part of the expression generates strings likePlantage\b|Pollux
. Which will give a match if (a) the last street name is at the beginning of another word or (b) if the any-except-the-last street names is at the end of another word: "I'm going to NotPlantage, Polluxsss and Oostvaardersdiep" will match for both streets, but it shouldn't. Instead, the word boundary\b
should be at ends of the list of options and with parentheses to separate them. It should generate strings like:\b(Plantage|Pollux)\b
. This won't match with "Polluxsss" or "NotPlantage". I've made that change in the code above.I'm using
set
to get a unique list of street matches. If the line was "I'm going to Pollux, Pollux, Pollux" it would haven given the result 3 times instead of just once.
How to extract first 8 characters from a string in pandas
You are close, need indexing with str
which is apply for each value of Serie
s:
data['Order_Date'] = data['Shipment ID'].str[:8]
For better performance if no NaN
s values:
data['Order_Date'] = [x[:8] for x in data['Shipment ID']]
print (data)
Shipment ID Order_Date
0 20180504-S-20000 20180504
1 20180514-S-20537 20180514
2 20180514-S-20541 20180514
3 20180514-S-20644 20180514
4 20180514-S-20644 20180514
5 20180516-S-20009 20180516
6 20180516-S-20009 20180516
7 20180516-S-20009 20180516
8 20180516-S-20009 20180516
If omit str
code filter column by position, first N values like:
print (data['Shipment ID'][:2])
0 20180504-S-20000
1 20180514-S-20537
Name: Shipment ID, dtype: object
Related Topics
Full-Text Search SQL Server 2005
How to Assign a Normal Table from a Dynamic Pivot Table
Splitting String Using SQL Statement (Ip Address)
Retrieving I18N Data with Fallback Language
Kafka Connect Jdbc VS Debezium Cdc
Table Creation Ddl from Microsoft Access
How to Find Duplicate Entries in a Database Table
Multiple Counts Within a Single SQL Query
How to Set Collation of a Column with SQL
How to Get Rid of #Temp Tables from the Query
Differencebetween a Candidate Key and a Primary Key
Oracle Insert into Two Tables in One Query
What Is Difference Between Inner Join and Outer Join
Identify a 3-Column Pk Duplicate in Vba Access
Get .SQL File from SQL Server 2012 Database