What's the Asymptotic Complexity of Groupby Operation

What's the asymptotic complexity of GroupBy operation?

Grouping can be done in one pass (n complexity) on sorted rows (nlog(n) complexity) so complexity of group by is nlog(n) where n is number of rows. If there are indices for each column used in group by statement, the sorting is not necessary and the complexity is n.

python - pandas - O() Big O complexity of grouping and summing a data frame

Everything depends on the implementation of the sum (is it naive, does it cache things? do lazy evaluations?). But in general your loops complexity is:

O(N * comp(sum))

or more strictly

O(SUM_i comp(sum_i) )

Now, with naive implementation

comp(sum_i) = comp(sum) = O(K)

where K is number of elements in the container. Consequently whole loop is O(NK)

However, if the sum is always the same between calls (nothing changes in the structure) and you cache in between sum calls you get

comp(sum_1) = O(K)
comp(sum_i) = O(1) i>1

thus the whole loop is O(N+K), but since you are "refreshing data every iteration" this is not the case, yet you could still have a data structure which does incremental updates to sum (since if you modify a single row in the structure, the sum changes in a simple way). Then, you could have

comp(sum_i) = O(elements_modified_in_ith_iteration)

and then if you assume that you modify at most M elements in each iteration, and you have .sum operation which is aware of the updates you get O(NM).

As far as I can tell pandas .sum is the naive method, thus it will have O(NK) complexity (assuming that your container has at most K elements). However, if your container grows, for example you add D elements in every iteration, then you get

comp(sum_i) = O(K + i*D)

and whole loop becomes

O(SUM_i comp(sum_i)) = O(N(K + D(N+1)/2))

which is quadratic in N.

Asymptotic behaviour of Scala methods

Performance characteristics of the other methods is really difficult to assert. Consider the following:

  • These methods are all implemented based on foreach or iterator, and at usually very high levels in the hierachy. Vector's map is implemented on collection.TraversableLike, for example.
    To add insult to injury, which method implementation is used depends on the linearization of the class inheritance. This also applies to any method called as a helper. It has happened before that changes here caused unforeseen performance problems.
    Since foreach and iterator are both O(n), any improved performance depends on specialization at other methods, such as size and slice.
  • For many of them, there's further dependency on the performance characteristics of the builder that was provided, which depends on the call site instead of the definition site.

So the result is that the place where the method is defined -- and documented -- does not have near enough information to state its performance characteristics, and may depend not only on how other methods are implemented by the inheriting collection, but even by the performance characteristics of an object, Builder, obtained from CanBuildFrom, that is passed at the call site.

At best, any such documentation would be described in terms of other methods. Which doesn't mean it isn't worthwhile, but it isn't easily done -- and hard tasks on open source projects depend on volunteers, who usually work at what they like, not what is needed.

What guarantees are there on the run-time complexity (Big-O) of LINQ methods?

There are very, very few guarantees, but there are a few optimizations:

  • Extension methods that use indexed access, such as ElementAt, Skip, Last or LastOrDefault, will check to see whether or not the underlying type implements IList<T>, so that you get O(1) access instead of O(N).

  • The Count method checks for an ICollection implementation, so that this operation is O(1) instead of O(N).

  • Distinct, GroupBy Join, and I believe also the set-aggregation methods (Union, Intersect and Except) use hashing, so they should be close to O(N) instead of O(N²).

  • Contains checks for an ICollection implementation, so it may be O(1) if the underlying collection is also O(1), such as a HashSet<T>, but this is depends on the actual data structure and is not guaranteed. Hash sets override the Contains method, that's why they are O(1).

  • OrderBy methods use a stable quicksort, so they're O(N log N) average case.

I think that covers most if not all of the built-in extension methods. There really are very few performance guarantees; Linq itself will try to take advantage of efficient data structures but it isn't a free pass to write potentially inefficient code.

The Timecomplexity of Built in SQL Functions such as sum, count, avg

In SQL the math function complexity of aggregates is totally irelevant. The only thing that really matters is the data acceess complexity: what access path is chosen (table scan, index range scan, index seek etc) and how many pages are read. There may be slight differences in the internals of each aggregate, but they all work pretty much the same way (keep running state and compute running aggregation for each input value) and there is absolutely NO aggregate that looks at input twice, so they all O(n) as internal implementation, where 'n' is the number of recordes fed to the aggregate (not necesarily the number of record in the table!).

Some aggregates have internal shortcuts, eg. COUNT(*) may return the count from metadata on some systems, if possible.

What is time complexity of .NET List.sort()

http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx

This method uses Array.Sort, which uses the QuickSort algorithm. This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.

On average, this method is an O(n log n) operation, where n is Count; in the worst case it is an O(n ^ 2) operation.



Related Topics



Leave a reply



Submit