Interview - Detect/Remove Duplicate Entries

Interview - Detect/remove duplicate entries

delete f
from
(
select ROW_NUMBER()
over (partition by
YourFirstPossibleDuplicateField,
YourSecondPossibleDuplicateField
order by WhateverFieldYouWantSortedBy) as DelId
from YourTable
) as f
where DelId > 1

Cracking the coding interview 2.1 remove duplicate values from singly linked list javascript

Looks like you end up defining a node that references itself within your else condition.

You may be looking for something like this:

class LinkedListNode {    constructor(value) {        this.value = value;        this.next = null;    }}
let head = new LinkedListNode("head");
let x = [1, 1, 1, 1, 4, 10, 10, 3, 10, 9, 5, 5, 5, 8];
for (let ele of x) { let y = new LinkedListNode(ele); let pointer = head; while (pointer.next != null) { pointer = pointer.next; } pointer.next = y;}
function removeDup(currentNode = sll) { const seen = {}; let lastUnique; while (currentNode) { if (seen[currentNode.value]) { lastUnique.next = currentNode.next; } else { seen[currentNode.value] = true; lastUnique = currentNode; } currentNode = currentNode.next; }}
removeDup(head);
let outputNode = head;while (outputNode) { outputNode = outputNode.next; if (outputNode) { console.log(outputNode.value); }}

Find duplicates in large file

The key is that your data will not fit into memory. You can use external merge sort for this:

Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

How to remove duplicate entries from a list in python

Although your hash-table implementation could be made more concise, readable, and idiomatic, with a small boost in speed, I suspect that isn't what your interviewer was disappointed in.

More likely, he pushed you for a better solution in hopes that you would come up with an argument why your solution was effectively optimal, but instead, you searched for a while and gave up.

So, there are multiple things to prove here:

  1. Any solution to this problem has to be O(N) time.
  2. Your solution is (amortized, average-and-almost-always) O(N) time.
  3. The multiplier in your solution's time complexity is reasonable.
  4. Any solution to this problem that's O(N) time has to be O(M) space (where M is the number of distinct values).
  5. Your solution is O(M) space.
  6. The multiplier in your solution's space complexity is reasonable.

Even the easy ones you're not going to come up with an actual rigorous proof in an interview. And some of them, you may not even be able to make a compelling case—but raising possible exceptions and acknowledging where you're waving your hands may be sufficient. For example:

  • Python's dict (and set) has O(N) worst-case time; it's only O(1) amortized average-case. Is there anything about your data that would make it likely to be worse than O(1)? Probably not, but… what if this were part of a server that someone wanted to DoS, and they could send any input data they wanted?
  • All of the values he gave you were small integers. Is that guaranteed to be true? In that case, don't use a dict for your counts, just use list(range(0, P)) where P is the max number. Then it's O(P) space, which sounds worse than O(M), except that the multiplier will be a lot smaller—a list takes about 1/3rd the space (just values, rather than hashes, keys, and values), so if P << M/3 it's a win. And it's also probably a win on speed, because there's no need to keep hashing values. And you can do even better with array.array.
  • A Python hash table is overkill for storing both sets and dicts with small counts. Could a custom hash table cut things significantly, or not enough to be worth it?

Remove duplicate rows on a SQL query

SELECT DISTINCT item FROM myTable WHERE something = 3

Remove duplicates from an unsorted linked list, Cracking the coding interview

How would your code cope with a list of 1,000,000 unique items?

Assuming no compiler or runtime optimisation your code would stackoverflow.

This is why tail recursion is better optimised into a loop. If you did that then, I think, it would look like the answer in the book.

Interview question: remove duplicates from an unsorted linked list

If you give a person a fish, they eat for a day. If you teach a person to fish...

My measures for the quality of an implementation are:

  • Correctness: If you aren't getting the right answer in all cases, then it isn't ready
  • Readability/maintainability: Look at code repetition, understandable names, the number of lines of code per block/method (and the number of things each block does), and how difficult it is to trace the flow of your code. Look at any number of books focused on refactoring, programming best-practices, coding standards, etc, if you want more information on this.
  • Theoretical performance (worst-case and ammortized): Big-O is a metric you can use. CPU and memory consumption should both be measured
  • Complexity: Estimate how it would take an average professional programmer to implement (if they already know the algorithm). See if that is in line with how difficult the problem actually is

As for your implementation:

  • Correctness: I suggest writing unit tests to determine this for yourself and/or debugging it (on paper) from start to finish with interesting sample/edge cases. Null, one item, two items, various numbers of duplicates, etc
  • Readability/maintainability: It looks mostly fine, though your last two comments don't add anything. It is a bit more obvious what your code does than the code in the book
  • Performance: I believe both are N-squared. Whether the amortized cost is lower on one or the other I'll let you figure out :)
  • Time to implement: An average professional should be able to code this algorithm in their sleep, so looking good

Algo to find duplicates in a very large array

One thing to keep in mind is that O-notation doesn't necessarily tell you what algorithm is fastest. If one algorithm is O(n log n) and another algorithm is O(n2), then there is some value M such that the first algorithm is faster for all n > M. But M could be much larger than the amount of data you'll ever have to deal with.

The reason I'm bringing this up is that I think a HashSet is probably still the best answer, although I'd have to profile it to find out for sure. Assuming that you aren't allowed to set up a hash table with 10 million buckets, you may still be able to set up a reasonable-sized table. Say you can create a HashSet with table size 100,000. The buckets will then be sets of objects. If n is the size of the array, the average bucket size will be n / 100000. So to see if an element is already in the HashSet, and add it if not, will take a fixed amount of time to compute the hash value, and O(n) to search the elements in the bucket if they're stored in a linear list(*). Technically, this means that the algorithm to find all duplicates is O(n2). But since one of the n's in n2 is for a linear list that is so much smaller than the array size (by a factor of 100000), it seems likely to me that it will still take much less time than a O(n log n) sort, for 10 million items. The value of M, the point at which the O(n log n) sort becomes faster, is likely to be much, much larger than that. (I'm just guessing, though; to find out for certain would require some profiling.)

I'd tend to lean against using a sort anyway, because if all you need to do is find duplicates, a sort is doing more work than you need. You shouldn't need to put the elements in order, just to find duplicates. That to me suggests that a sort is not likely to be the best answer.

(*) Note that in Java 8, the elements in each bucket will be in some kind of search tree, probably a red-black tree, instead of in a linear list. So the algorithm will still be O(n log n), and still probably lots faster than a sort.



Related Topics



Leave a reply



Submit