Summarize the List into a Comma-Separated String

Summarize the list into a comma-separated string

Use:

declare @t table(Number int, Grade varchar)

insert @t values(1, 'a'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c'),
(3, 'b'), (3, 'a')

select t1.Number
, stuff((
select ',' + Grade
from @t t2
where t2.Number = t1.Number
for xml path(''), type
).value('.', 'varchar(max)'), 1, 1, '') [values]
from @t t1
group by t1.Number

Capturing comma inside text of comma separated values

I would suggest you follow the suggestions given to you in the comments and not use regex.

However, if you did need to do this using regex, the following should do the trick:

(.*?)=("?)([^"]+?)\2(?:,|$)
  • (.*?)= Captures the key to the left of the = sign. It only captures one key because the ? makes it match as few characters as possible.
  • ("?) Captures whether or not the value is in quotes.
  • ([^"]+?)\2(?:,|$)
    • ([^"]+?) Captures more than 1 character that is not ", but as few as possible.
    • \2(?:,|$) This stops either if there was a quote and it finds one again, or at the next comma or if the string has finished.

Test online

Sum values in a comma-separated string

This looks a bit ugly but should work. Assuming column QTY is a character -

your_df$QTY_new <- sapply(strsplit(your_df$QTY, ", "), function(x) sum(as.numeric(x)))

Dataframe: Cell Level: Convert Comma Separated String to List

  • Use pandas.Series.str.split to split the string into a list.
# use str split on the column
df.mgrs_grids = df.mgrs_grids.str.split(',')

# display(df)
driver_code journey_code mgrs_grids
0 7211863 7211863-140 [18TWL927129, 18TWL888113, 18TWL888113, 18TWL887113, 18TWL888113, 18TWL887113, 18TWL887113, 18TWL887113, 18TWL903128]
1 7211863 7211863-105 [18TWL927129, 18TWL939112, 18TWL939112, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL960111, 18TWL960112]
2 7211863 7211863-50 [18TWL927129, 18TWL889085, 18TWL889085, 18TWL888085, 18TWL888085, 18TWL888085, 18TWL888085, 18TWL888085, 18TWL890085]
3 7211863 7211863-109 [18TWL927129, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952105, 18TWL951103]

print(type(df.loc[0, 'mgrs_grids']))
[out]:
list

separate row per value

  • After creating a column of lists.
  • Use pandas.DataFrame.explode to create separate rows for each value in the list.
# get a separate row for each value
df = df.explode('mgrs_grids').reset_index(drop=True)

# display(df.hea())
driver_code journey_code mgrs_grids
0 7211863 7211863-140 18TWL927129
1 7211863 7211863-140 18TWL888113
2 7211863 7211863-140 18TWL888113
3 7211863 7211863-140 18TWL887113
4 7211863 7211863-140 18TWL888113

Update

  • Here is another option, which combines the 'journey_code' to the front of 'mgrs_grids', and then splits the string into a list.
    • This list is assigned back to 'mgrs_grids', but can also be assigned to a new column.
# add the journey code to mgrs_grids and then split
df.mgrs_grids = (df.journey_code + ',' + df.mgrs_grids).str.split(',')

# display(df.head())
driver_code journey_code mgrs_grids
0 7211863 7211863-140 [7211863-140, 18TWL927129, 18TWL888113, 18TWL888113, 18TWL887113, 18TWL888113, 18TWL887113, 18TWL887113, 18TWL887113, 18TWL903128]
1 7211863 7211863-105 [7211863-105, 18TWL927129, 18TWL939112, 18TWL939112, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL939113, 18TWL960111, 18TWL960112]
2 7211863 7211863-50 [7211863-50, 18TWL927129, 18TWL889085, 18TWL889085, 18TWL888085, 18TWL888085, 18TWL888085, 18TWL888085, 18TWL888085, 18TWL890085]
3 7211863 7211863-109 [7211863-109, 18TWL927129, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952106, 18TWL952105, 18TWL951103]

# output to nested list
df.mgrs_grids.tolist()

[out]:
[['7211863-140', '18TWL927129', '18TWL888113', '18TWL888113', '18TWL887113', '18TWL888113', '18TWL887113', '18TWL887113', '18TWL887113', '18TWL903128'],
['7211863-105', '18TWL927129', '18TWL939112', '18TWL939112', '18TWL939113', '18TWL939113', '18TWL939113', '18TWL939113', '18TWL939113', '18TWL939113', '18TWL960111', '18TWL960112'],
['7211863-50', '18TWL927129', '18TWL889085', '18TWL889085', '18TWL888085', '18TWL888085', '18TWL888085', '18TWL888085', '18TWL888085', '18TWL890085'],
['7211863-109', '18TWL927129', '18TWL952106', '18TWL952106', '18TWL952106', '18TWL952106', '18TWL952106', '18TWL952106', '18TWL952106', '18TWL952105', '18TWL951103']]

Collapse / concatenate / aggregate a column to a single comma separated string within each group

Here are some options using toString, a function that concatenates a vector of strings using comma and space to separate components. If you don't want commas, you can use paste() with the collapse argument instead.

data.table

# alternative using data.table
library(data.table)
as.data.table(data)[, toString(C), by = list(A, B)]

aggregate This uses no packages:

# alternative using aggregate from the stats package in the core of R
aggregate(C ~., data, toString)

sqldf

And here is an alternative using the SQL function group_concat using the sqldf package :

library(sqldf)
sqldf("select A, B, group_concat(C) C from data group by A, B", method = "raw")

dplyr A dplyr alternative:

library(dplyr)
data %>%
group_by(A, B) %>%
summarise(test = toString(C)) %>%
ungroup()

plyr

# plyr
library(plyr)
ddply(data, .(A,B), summarize, C = toString(C))

Sum values using lookup for comma-separated list in one of the columns in Excel

This formula iterates the parts and uses SUMIFS to return the number to SUMPRODUCT:

=SUMPRODUCT(SUMIFS(B:B,A:A,TRIM(MID(SUBSTITUTE(E2,",",REPT(" ",999)),(ROW($ZZ$1:INDEX($ZZ:$ZZ,LEN(E2)-LEN(SUBSTITUTE(E2,",",""))+1))-1)*999+1,999))))

No vba or named range workarounds needed.

Sample Image

comma separated field and comma separated search data

You are currently comparing two arrays using ANY when that is incorrect.

If you want to see if the two arrays are identical in their contents, you can simply remove ANY:

string_to_array(data,',') = (string_to_array('25,2,3,15', ',')::character varying[])

If you want to see if there is overlap between the two arrays, you use &&:

string_to_array(data,',') && (string_to_array('25,2,3,15', ',')::character varying[])

Additional operators are available here: https://www.postgresql.org/docs/current/static/functions-array.html

Comma separated results in SQL

Update (As suggested by @Aaron in the comment)

STRING_AGG is the preferred way of doing this in the modern versions of SQL Server (2017 or later). It also supports easy ordering.

SELECT
STUDENTNUMBER
, STRING_AGG(INSTITUTIONNAME, ', ') AS StringAggList
, STRING_AGG(INSTITUTIONNAME, ', ') WITHIN GROUP (ORDER BY INSTITUTIONNAME DESC) AS StringAggListDesc
FROM Education E
GROUP BY E.STUDENTNUMBER;

Original Answer:

Use FOR XML PATH('') - which is converting the entries to a comma separated string and STUFF() -which is to trim the first comma- as follows Which gives you the same comma separated result

SELECT
STUFF((SELECT ',' + INSTITUTIONNAME
FROM EDUCATION EE
WHERE EE.STUDENTNUMBER = E.STUDENTNUMBER
ORDER BY sortOrder
FOR XML PATH(''), TYPE).value('text()[1]', 'nvarchar(max)')
, 1, LEN(','), '') AS XmlPathList
FROM EDUCATION E
GROUP BY E.STUDENTNUMBER

Here is the FIDDLE showing results for both STRING_AGG and FOR XML PATH('').

Reverse summary to expand comma separated strings in dataframe

data.frame(
group = c("cat", "dog", "horse"),
value = c("1", "2", "3"),
list = c("siamese,burmese,balinese","corgi,sheltie,collie","arabian,friesian,andalusian"),
stringsAsFactors = FALSE
) -> xdf

tidyverse:

tidyr::separate_rows(xdf, list, sep=",")
## group value list
## 1 cat 1 siamese
## 2 cat 1 burmese
## 3 cat 1 balinese
## 4 dog 2 corgi
## 5 dog 2 sheltie
## 6 dog 2 collie
## 7 horse 3 arabian
## 8 horse 3 friesian
## 9 horse 3 andalusian

Base R:

do.call(
rbind.data.frame,
lapply(1:nrow(xdf), function(idx) {

data.frame(
group = xdf[idx, "group"],
value = xdf[idx, "value"],
list = strsplit(xdf[idx, "list"], ",")[[1]],
stringsAsFactors = FALSE
)

})
)
## group value list
## 1 cat 1 siamese
## 2 cat 1 burmese
## 3 cat 1 balinese
## 4 dog 2 corgi
## 5 dog 2 sheltie
## 6 dog 2 collie
## 7 horse 3 arabian
## 8 horse 3 friesian
## 9 horse 3 andalusian

The shootout:

microbenchmark::microbenchmark(

unnest = transform(xdf, list = strsplit(list, ",")) %>%
tidyr::unnest(list),

separate_rows = tidyr::separate_rows(xdf, list, sep=","),

base = do.call(
rbind.data.frame,
lapply(1:nrow(xdf), function(idx) {

data.frame(
group = xdf[idx, "group"],
value = xdf[idx, "value"],
list = strsplit(xdf[idx, "list"], ",")[[1]],
stringsAsFactors = FALSE
)

})
)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## unnest 3689.890 4280.7045 6326.231 4881.160 6428.508 16670.715 100
## separate_rows 5093.618 5602.2510 8479.712 6289.193 10352.847 24447.528 100
## base 872.343 975.1615 1589.915 1099.391 1660.324 6663.132 100

I'm constantly surprised at the horrible performance of tidyr ops.



Related Topics



Leave a reply



Submit