How to Grab a Value of a Column That Is Set as a String

Spark dataframe get column value into a string variable

The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):

val names = test.filter(test("id").equalTo("200"))
                .select("name")
                .collectAsList() // returns a List[Row]

Then for a row you could get name in String by:

val name = row.getString(0)

Extract column value based on another column in Pandas

You could use loc to get series which satisfying your condition and then iloc to get first element:

In [2]: df
Out[2]:
    A  B
0  p1  1
1  p1  2
2  p3  3
3  p2  4

In [3]: df.loc[df['B'] == 3, 'A']
Out[3]:
2    p3
Name: A, dtype: object

In [4]: df.loc[df['B'] == 3, 'A'].iloc[0]
Out[4]: 'p3'

How to efficiently grab data based on string value of a row

What you are looking for is groupby.

Suppose that you have the following DataFrame:

julia> df = DataFrame(Country=rand([:A,:B,:C],7), year=rand(2000:2020,7), tax=rand(7))
7×3 DataFrame
│ Row │ Country │ year  │ tax      │
│     │ Symbol  │ Int64 │ Float64  │
├─────┼─────────┼───────┼──────────┤
│ 1   │ A       │ 2014  │ 0.913118 │
│ 2   │ C       │ 2003  │ 0.894182 │
│ 3   │ A       │ 2018  │ 0.917585 │
│ 4   │ C       │ 2011  │ 0.869531 │
│ 5   │ A       │ 2011  │ 0.45841  │
│ 6   │ B       │ 2001  │ 0.808954 │
│ 7   │ B       │ 2008  │ 0.969813 │

You can collect information by each country:

dfg = groupby(df, :Country);

and now:

julia> dfg[1]
3×3 SubDataFrame
│ Row │ Country │ year  │ tax      │
│     │ Symbol  │ Int64 │ Float64  │
├─────┼─────────┼───────┼──────────┤
│ 1   │ A       │ 2014  │ 0.913118 │
│ 2   │ A       │ 2018  │ 0.917585 │
│ 3   │ A       │ 2011  │ 0.45841  │

julia> dfg[2]
2×3 SubDataFrame
│ Row │ Country │ year  │ tax      │
│     │ Symbol  │ Int64 │ Float64  │
├─────┼─────────┼───────┼──────────┤
│ 1   │ C       │ 2003  │ 0.894182 │
│ 2   │ C       │ 2011  │ 0.869531 │

julia> dfg[3]
2×3 SubDataFrame
│ Row │ Country │ year  │ tax      │
│     │ Symbol  │ Int64 │ Float64  │
├─────┼─────────┼───────┼──────────┤
│ 1   │ B       │ 2001  │ 0.808954 │
│ 2   │ B       │ 2008  │ 0.969813 │

Note that for faster search it is better to use Symbols than string. You can always use vectorized Symbol.() constructor to convert any column of Strings.

How to extract part of a string in Pandas column and make a new column

Use str.extract with a regex and str.replace to rename values:

dff['Version_short'] = dff['Name'].str.extract('_(V\d+)$').fillna('')
dff['Version_long'] = dff['Version_short'].str.replace('V', 'Version ')

Output:

>>> dff
    col1  col3            Name        Date Version_short Version_long
0      1     1  2a df a1asd_V1  2021-06-13            V1    Version 1
1      2    22    xcd a2asd_V3  2021-06-13            V3    Version 3
2      3    33   23vg aabsd_V1  2021-06-13            V1    Version 1
3      4    44  dfgdf_aabsd_V0  2021-06-14            V0    Version 0
4      5    55      a3as  d_V1  2021-06-15            V1    Version 1
5     60    60       aa bsd_V3  2021-06-15            V3    Version 3
6      0     1         aasd_V4  2021-06-13            V4    Version 4
7      0     5        aabsd_V4  2021-06-16            V4    Version 4
8      6     6   aa_adn sd_V15  2021-06-13           V15   Version 15
9      3     3             NaN  2021-06-13                           
10     2     2        aasd_V12  2021-06-13           V12   Version 12
11     4     4      aasd120Abs  2021-06-16

How would I get everything before a : in a string Python

Just use the split function. It returns a list, so you can keep the first element:

>>> s1.split(':')
['Username', ' How are you today?']
>>> s1.split(':')[0]
'Username'

Create list based on column value and use that list to extract words from string column in df without overwriting row value with for loop

Adding another answer to show a shorter/simpler way to do what you wanted. (The first one was just to fix what was not working in your code.)

Using .apply(), you can call a modified verison of your function per row of df and then do the checking with the street names in df2.

def extract_street(row):
    street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
    streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
    streets_found = set(re.findall(streets_regex, row['text']))
    return ', '.join(streets_found)
    ## or if you want this to return a list of streets
    # return list(streets_found)

df['street_match'] = df.apply(extract_street, axis=1)
df

Output:

  municipality                                                text      street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  
2       Almere                     I'm not going to the Balijelaan                  
3      Utrecht                  I'm not going to Socrateshof today                  
4       Huizen              Next week I'll be going to Socrateshof       Socrateshof

Note:

There's an issue with your regex - the join part of the expression generates strings like Plantage\b|Pollux. Which will give a match if (a) the last street name is at the beginning of another word or (b) if the any-except-the-last street names is at the end of another word: "I'm going to NotPlantage, Polluxsss and Oostvaardersdiep" will match for both streets, but it shouldn't. Instead, the word boundary \b should be at ends of the list of options and with parentheses to separate them. It should generate strings like: \b(Plantage|Pollux)\b. This won't match with "Polluxsss" or "NotPlantage". I've made that change in the code above.
I'm using set to get a unique list of street matches. If the line was "I'm going to Pollux, Pollux, Pollux" it would haven given the result 3 times instead of just once.

How to extract first 8 characters from a string in pandas

You are close, need indexing with str which is apply for each value of Series:

data['Order_Date'] = data['Shipment ID'].str[:8]

For better performance if no NaNs values:

data['Order_Date'] = [x[:8] for x in data['Shipment ID']]

print (data)
        Shipment ID Order_Date
0  20180504-S-20000   20180504
1  20180514-S-20537   20180514
2  20180514-S-20541   20180514
3  20180514-S-20644   20180514
4  20180514-S-20644   20180514
5  20180516-S-20009   20180516
6  20180516-S-20009   20180516
7  20180516-S-20009   20180516
8  20180516-S-20009   20180516

If omit str code filter column by position, first N values like:

print (data['Shipment ID'][:2])
0    20180504-S-20000
1    20180514-S-20537
Name: Shipment ID, dtype: object