What's the asymptotic complexity of GroupBy operation?
Grouping can be done in one pass (n complexity) on sorted rows (nlog(n) complexity) so complexity of group by is nlog(n) where n is number of rows. If there are indices for each column used in group by statement, the sorting is not necessary and the complexity is n.
python - pandas - O() Big O complexity of grouping and summing a data frame
Everything depends on the implementation of the sum (is it naive, does it cache things? do lazy evaluations?). But in general your loops complexity is:
O(N * comp(sum))
or more strictly
O(SUM_i comp(sum_i) )
Now, with naive implementation
comp(sum_i) = comp(sum) = O(K)
where K
is number of elements in the container. Consequently whole loop is O(NK)
However, if the sum is always the same between calls (nothing changes in the structure) and you cache in between sum calls you get
comp(sum_1) = O(K)
comp(sum_i) = O(1) i>1
thus the whole loop is O(N+K)
, but since you are "refreshing data every iteration" this is not the case, yet you could still have a data structure which does incremental updates to sum (since if you modify a single row in the structure, the sum changes in a simple way). Then, you could have
comp(sum_i) = O(elements_modified_in_ith_iteration)
and then if you assume that you modify at most M
elements in each iteration, and you have .sum operation which is aware of the updates you get O(NM)
.
As far as I can tell pandas .sum is the naive method, thus it will have O(NK)
complexity (assuming that your container has at most K
elements). However, if your container grows, for example you add D
elements in every iteration, then you get
comp(sum_i) = O(K + i*D)
and whole loop becomes
O(SUM_i comp(sum_i)) = O(N(K + D(N+1)/2))
which is quadratic in N
.
Asymptotic behaviour of Scala methods
Performance characteristics of the other methods is really difficult to assert. Consider the following:
- These methods are all implemented based on
foreach
oriterator
, and at usually very high levels in the hierachy.Vector
'smap
is implemented oncollection.TraversableLike
, for example.
To add insult to injury, which method implementation is used depends on the linearization of the class inheritance. This also applies to any method called as a helper. It has happened before that changes here caused unforeseen performance problems.
Sinceforeach
anditerator
are bothO(n)
, any improved performance depends on specialization at other methods, such assize
andslice
. - For many of them, there's further dependency on the performance characteristics of the builder that was provided, which depends on the call site instead of the definition site.
So the result is that the place where the method is defined -- and documented -- does not have near enough information to state its performance characteristics, and may depend not only on how other methods are implemented by the inheriting collection, but even by the performance characteristics of an object, Builder, obtained from CanBuildFrom, that is passed at the call site.
At best, any such documentation would be described in terms of other methods. Which doesn't mean it isn't worthwhile, but it isn't easily done -- and hard tasks on open source projects depend on volunteers, who usually work at what they like, not what is needed.
What guarantees are there on the run-time complexity (Big-O) of LINQ methods?
There are very, very few guarantees, but there are a few optimizations:
Extension methods that use indexed access, such as
ElementAt
,Skip
,Last
orLastOrDefault
, will check to see whether or not the underlying type implementsIList<T>
, so that you get O(1) access instead of O(N).The
Count
method checks for anICollection
implementation, so that this operation is O(1) instead of O(N).Distinct
,GroupBy
Join
, and I believe also the set-aggregation methods (Union
,Intersect
andExcept
) use hashing, so they should be close to O(N) instead of O(N²).Contains
checks for anICollection
implementation, so it may be O(1) if the underlying collection is also O(1), such as aHashSet<T>
, but this is depends on the actual data structure and is not guaranteed. Hash sets override theContains
method, that's why they are O(1).OrderBy
methods use a stable quicksort, so they're O(N log N) average case.
I think that covers most if not all of the built-in extension methods. There really are very few performance guarantees; Linq itself will try to take advantage of efficient data structures but it isn't a free pass to write potentially inefficient code.
The Timecomplexity of Built in SQL Functions such as sum, count, avg
In SQL the math function complexity of aggregates is totally irelevant. The only thing that really matters is the data acceess complexity: what access path is chosen (table scan, index range scan, index seek etc) and how many pages are read. There may be slight differences in the internals of each aggregate, but they all work pretty much the same way (keep running state and compute running aggregation for each input value) and there is absolutely NO aggregate that looks at input twice, so they all O(n) as internal implementation, where 'n' is the number of recordes fed to the aggregate (not necesarily the number of record in the table!).
Some aggregates have internal shortcuts, eg. COUNT(*) may return the count from metadata on some systems, if possible.
What is time complexity of .NET List.sort()
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
This method uses Array.Sort, which uses the QuickSort algorithm. This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.
On average, this method is an O(n log n) operation, where n is Count; in the worst case it is an O(n ^ 2) operation.
Related Topics
Create a Unique Index on a Non-Unique Column
SQL Query on Multiple Databases
SQL Server Convert Columns to Rows
Exists/Not Exists: 'Select 1' VS 'Select Field'
SQL Joins: Future of the SQL Ansi Standard (Where VS Join)
SQL Primary Key - Is It Necessary
How to Substitute a String If Record Is Null in T-Sql
How to Use % Operator from the Extension Pg_Trgm
Oracle Combine Several Columns into One
How to Pivot Rows to Columns in MySQL Without Using Case
Access Db Update One Table with Value from Another
Why Is My Case Expression Non-Deterministic
Login Failed. the Login Is from an Untrusted Domain and Cannot Be Used with Windows Authentication
Change Separator of Wm_Concat Function of Oracle 11Gr2
SQL Server, Combining Like and In