Why Is Split(' ') Trying to Be (Too) Smart

Why is split(' ') trying to be (too) smart?

It's consistent with Perl's split() behavior. Which in turn is based on Gnu awk's split(). So it's a long-standing tradition with origins in Unix.

From the perldoc on split:

As another special case, split emulates the default behavior of the
command line tool awk when the PATTERN is either omitted or a literal
string composed of a single space character (such as ' ' or "\x20" ,
but not e.g. / / ). In this case, any leading whitespace in EXPR is
removed before splitting occurs, and the PATTERN is instead treated as
if it were /\s+/ ; in particular, this means that any contiguous
whitespace (not just a single space character) is used as a separator.
However, this special treatment can be avoided by specifying the
pattern / / instead of the string " " , thereby allowing only a single
space character to be a separator.

Using `split` on columns too slow - how can I get better performance?

This answer is outdated, as is this question. The problem identified below was fixed some time ago. The pandas str.split method should now be fast.

It turns out that the str.split in Pandas (in core/strings.py as str_split) is actually very slow; it isn't any more efficient, and still iterates through using Python, offering no speedup whatsoever.

Actually, see below. Pandas performance on this is simply miserable; it's not just Python vs C iteration, as doing the same thing with Python lists is the fastest method!

Interestingly, though, there's a trick solution that's much faster: writing the Series out to text, and then reading it in again, with '.' as the separator:

df[['ip0', 'ip1', 'ip2', 'ip3']] = \
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')

To compare, I use Marius' code and generate 20,000 ips:

import pandas as pd
import random
import numpy as np
from StringIO import StringIO

def make_ip():
return '.'.join(str(random.randint(0, 255)) for n in range(4))

df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
# 1 loops, best of 3: 3.06 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
# 1 loops, best of 3: 3.1 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.',header=None)
# 10 loops, best of 3: 46.4 ms per loop

Ok, so I wanted to compare all of this to just using a Python list and the Python split, which should be slower than using the more efficient Pandas:

iplist = list(df['ip'])
%timeit [ x.split('.') for x in iplist ]
100 loops, best of 3: 10 ms per loop

What!? Apparently, the best way to do a simple string operation on a large number of strings is to throw out Pandas entirely. Using Pandas makes the process 400 times slower. If you want to use Pandas, though, you may as well just convert to a Python list and back:

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
# 100 loops, best of 3: 18.4 ms per loop

There's something very wrong here.

Split Strings into words with multiple word boundary delimiters

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

split string only on first instance of specified character

Use capturing parentheses:

'good_luck_buddy'.split(/_(.*)/s)
['good', 'luck_buddy', ''] // ignore the third element

They are defined as

If separator contains capturing parentheses, matched results are returned in the array.

So in this case we want to split at _.* (i.e. split separator being a sub string starting with _) but also let the result contain some part of our separator (i.e. everything after _).

In this example our separator (matching _(.*)) is _luck_buddy and the captured group (within the separator) is lucky_buddy. Without the capturing parenthesis the luck_buddy (matching .*) would've not been included in the result array as it is the case with simple split that separators are not included in the result.

We use the s regex flag to make . match on newline (\n) characters as well, otherwise it would only split to the first newline.

How do I split a string with multiple separators in JavaScript?

Pass in a regexp as the parameter:

js> "Hello awesome, world!".split(/[\s,]+/)
Hello,awesome,world!

Edited to add:

You can get the last element by selecting the length of the array minus 1:

>>> bits = "Hello awesome, world!".split(/[\s,]+/)
["Hello", "awesome", "world!"]
>>> bit = bits[bits.length - 1]
"world!"

... and if the pattern doesn't match:

>>> bits = "Hello awesome, world!".split(/foo/)
["Hello awesome, world!"]
>>> bits[bits.length - 1]
"Hello awesome, world!"

STRING_SPLIT skyrockets execution time of sql query

The probleme is that any explicit values in a "IN" operator is translate by a multiple OR in the WHERE clause by the algebrizer, before optimizing the query...
A great number of values in the IN operator will allways causes a lack of performances, whatever the manner to do it... !

By creating a temporay table, you will have another query execution plan that will boost your performances.

Si try this way :

SELECT DISTINCT CAST(value AS INTEGER) 
INTO #T
FROM STRING_SPLIT(@inputStr, ',');

SELECT surname,
firstname
FROM Addresses
WHERE id IN (SELECT value
FROM #T);

Eventually you can add a UNIQUE index to the temp table to increase performances :

SELECT DISTINCT CAST("value" AS INTEGER) 
INTO #T
FROM STRING_SPLIT(@inputStr, ',');

CREATE UNIQUE INDEX X123456789 ON #T ("value");

SELECT surname,
firstname
FROM Addresses
WHERE id IN (SELECT value
FROM #T);

Attribute Error: 'list' object has no attribute 'split'

I think you've actually got a wider confusion here.

The initial error is that you're trying to call split on the whole list of lines, and you can't split a list of strings, only a string. So, you need to split each line, not the whole thing.

And then you're doing for points in Type, and expecting each such points to give you a new x and y. But that isn't going to happen. Types is just two values, x and y, so first points will be x, and then points will be y, and then you'll be done. So, again, you need to loop over each line and get the x and y values from each line, not loop over a single Types from a single line.

So, everything has to go inside a loop over every line in the file, and do the split into x and y once for each line. Like this:

def getQuakeData():
filename = input("Please enter the quake file: ")
readfile = open(filename, "r")

for line in readfile:
Type = line.split(",")
x = Type[1]
y = Type[2]
print(x,y)

getQuakeData()

As a side note, you really should close the file, ideally with a with statement, but I'll get to that at the end.


Interestingly, the problem here isn't that you're being too much of a newbie, but that you're trying to solve the problem in the same abstract way an expert would, and just don't know the details yet. This is completely doable; you just have to be explicit about mapping the functionality, rather than just doing it implicitly. Something like this:

def getQuakeData():
filename = input("Please enter the quake file: ")
readfile = open(filename, "r")
readlines = readfile.readlines()
Types = [line.split(",") for line in readlines]
xs = [Type[1] for Type in Types]
ys = [Type[2] for Type in Types]
for x, y in zip(xs, ys):
print(x,y)

getQuakeData()

Or, a better way to write that might be:

def getQuakeData():
filename = input("Please enter the quake file: ")
# Use with to make sure the file gets closed
with open(filename, "r") as readfile:
# no need for readlines; the file is already an iterable of lines
# also, using generator expressions means no extra copies
types = (line.split(",") for line in readfile)
# iterate tuples, instead of two separate iterables, so no need for zip
xys = ((type[1], type[2]) for type in types)
for x, y in xys:
print(x,y)

getQuakeData()

Finally, you may want to take a look at NumPy and Pandas, libraries which do give you a way to implicitly map functionality over a whole array or frame of data almost the same way you were trying to.



Related Topics



Leave a reply



Submit