How to Grab a Value of a Column That Is Set as a String

Spark dataframe get column value into a string variable

The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):

val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]

Then for a row you could get name in String by:

val name = row.getString(0)

Extract column value based on another column in Pandas

You could use loc to get series which satisfying your condition and then iloc to get first element:

In [2]: df
Out[2]:
A B
0 p1 1
1 p1 2
2 p3 3
3 p2 4

In [3]: df.loc[df['B'] == 3, 'A']
Out[3]:
2 p3
Name: A, dtype: object

In [4]: df.loc[df['B'] == 3, 'A'].iloc[0]
Out[4]: 'p3'

How to efficiently grab data based on string value of a row

What you are looking for is groupby.

Suppose that you have the following DataFrame:

julia> df = DataFrame(Country=rand([:A,:B,:C],7), year=rand(2000:2020,7), tax=rand(7))
7×3 DataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ A │ 2014 │ 0.913118 │
│ 2 │ C │ 2003 │ 0.894182 │
│ 3 │ A │ 2018 │ 0.917585 │
│ 4 │ C │ 2011 │ 0.869531 │
│ 5 │ A │ 2011 │ 0.45841 │
│ 6 │ B │ 2001 │ 0.808954 │
│ 7 │ B │ 2008 │ 0.969813 │

You can collect information by each country:

dfg = groupby(df, :Country);

and now:

julia> dfg[1]
3×3 SubDataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ A │ 2014 │ 0.913118 │
│ 2 │ A │ 2018 │ 0.917585 │
│ 3 │ A │ 2011 │ 0.45841 │

julia> dfg[2]
2×3 SubDataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ C │ 2003 │ 0.894182 │
│ 2 │ C │ 2011 │ 0.869531 │

julia> dfg[3]
2×3 SubDataFrame
│ Row │ Country │ year │ tax │
│ │ Symbol │ Int64 │ Float64 │
├─────┼─────────┼───────┼──────────┤
│ 1 │ B │ 2001 │ 0.808954 │
│ 2 │ B │ 2008 │ 0.969813 │

Note that for faster search it is better to use Symbols than string. You can always use vectorized Symbol.() constructor to convert any column of Strings.

How to extract part of a string in Pandas column and make a new column

Use str.extract with a regex and str.replace to rename values:

dff['Version_short'] = dff['Name'].str.extract('_(V\d+)$').fillna('')
dff['Version_long'] = dff['Version_short'].str.replace('V', 'Version ')

Output:

>>> dff
col1 col3 Name Date Version_short Version_long
0 1 1 2a df a1asd_V1 2021-06-13 V1 Version 1
1 2 22 xcd a2asd_V3 2021-06-13 V3 Version 3
2 3 33 23vg aabsd_V1 2021-06-13 V1 Version 1
3 4 44 dfgdf_aabsd_V0 2021-06-14 V0 Version 0
4 5 55 a3as d_V1 2021-06-15 V1 Version 1
5 60 60 aa bsd_V3 2021-06-15 V3 Version 3
6 0 1 aasd_V4 2021-06-13 V4 Version 4
7 0 5 aabsd_V4 2021-06-16 V4 Version 4
8 6 6 aa_adn sd_V15 2021-06-13 V15 Version 15
9 3 3 NaN 2021-06-13
10 2 2 aasd_V12 2021-06-13 V12 Version 12
11 4 4 aasd120Abs 2021-06-16

How would I get everything before a : in a string Python

Just use the split function. It returns a list, so you can keep the first element:

>>> s1.split(':')
['Username', ' How are you today?']
>>> s1.split(':')[0]
'Username'

Create list based on column value and use that list to extract words from string column in df without overwriting row value with for loop

Adding another answer to show a shorter/simpler way to do what you wanted. (The first one was just to fix what was not working in your code.)

Using .apply(), you can call a modified verison of your function per row of df and then do the checking with the street names in df2.

def extract_street(row):
street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
streets_found = set(re.findall(streets_regex, row['text']))
return ', '.join(streets_found)
## or if you want this to return a list of streets
# return list(streets_found)

df['street_match'] = df.apply(extract_street, axis=1)
df

Output:

  municipality                                                text      street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof

Note:

  1. There's an issue with your regex - the join part of the expression generates strings like Plantage\b|Pollux. Which will give a match if (a) the last street name is at the beginning of another word or (b) if the any-except-the-last street names is at the end of another word: "I'm going to NotPlantage, Polluxsss and Oostvaardersdiep" will match for both streets, but it shouldn't. Instead, the word boundary \b should be at ends of the list of options and with parentheses to separate them. It should generate strings like: \b(Plantage|Pollux)\b. This won't match with "Polluxsss" or "NotPlantage". I've made that change in the code above.

  2. I'm using set to get a unique list of street matches. If the line was "I'm going to Pollux, Pollux, Pollux" it would haven given the result 3 times instead of just once.

How to extract first 8 characters from a string in pandas

You are close, need indexing with str which is apply for each value of Series:

data['Order_Date'] = data['Shipment ID'].str[:8]

For better performance if no NaNs values:

data['Order_Date'] = [x[:8] for x in data['Shipment ID']]

print (data)
Shipment ID Order_Date
0 20180504-S-20000 20180504
1 20180514-S-20537 20180514
2 20180514-S-20541 20180514
3 20180514-S-20644 20180514
4 20180514-S-20644 20180514
5 20180516-S-20009 20180516
6 20180516-S-20009 20180516
7 20180516-S-20009 20180516
8 20180516-S-20009 20180516

If omit str code filter column by position, first N values like:

print (data['Shipment ID'][:2])
0 20180504-S-20000
1 20180514-S-20537
Name: Shipment ID, dtype: object


Related Topics



Leave a reply



Submit