Sqlalchemy, Prevent Duplicate Rows

How to avoid inserting duplicate entries when adding values via a sqlalchemy relationship?

The SQLAlchemy wiki has a collection of examples, one of which is how you might check uniqueness of instances.

The examples are a bit convoluted though. Basically, create a classmethod get_unique as an alternate constructor, which will first check a session cache, then try a query for existing instances, then finally create a new instance. Then call Language.get_unique(id, name) instead of Language(id, name).

I've written a more detailed answer in response to OP's bounty on another question.

Preventing duplicate entries with sqlalchemy in preexisting sqllite table

The code snippet below works on my side with python version 2.7 and sqlalchemy version 1.0.9 and sqlite version 3.15.2.

from sqlalchemy import create_engine, MetaData, Column, Integer, Table, Text
from sqlalchemy.exc import IntegrityError


class DynamicSQLlitePipeline(object):

    def __init__(self, table_name):
        db_path = "sqlite:///data.db"
        _engine = create_engine(db_path)
        _connection = _engine.connect()
        _metadata = MetaData()
        _stack_items = Table(table_name, _metadata,
                             Column("id", Integer, primary_key=True),
                             Column("case", Text, unique=True),)
        _metadata.create_all(_engine)
        self.connection = _connection
        self.stack_items = _stack_items

    def process_item(self, item):

        try:
            ins_query = self.stack_items.insert().values(case=item['case'])
            self.connection.execute(ins_query)
        except IntegrityError:
                print('THIS IS A DUP')
        return item

if __name__ == '__main__':

    d = DynamicSQLlitePipeline("pipeline")
    item = {
        'case': 'sdjwaichjkneirjpewjcmelkdfpoewrjlkxncdsd'
    }
    print d.process_item(item)

And the output for the second run would be like :

THIS IS A DUP
{'case': 'sdjwaichjkneirjpewjcmelkdfpoewrjlkxncdsd'}

I did not see much difference between your code logic. The only difference might be the version I guess.

SQLalchemy Avoid duplicate in session() before commiting

It'd seem that you might be better off "deduplicating" in your application:

seen = set()

# Reversed so that the last row wins.
for row in reversed(database):
    c_hash = row['c_hash']
    if c_hash not in seen:
        session.merge(Mytable(hash=c_hash,
                              date=row['date'],
                              text=row['text']))
        seen.add(c_hash)

In theory you could let SQLAlchemy handle the deduplication as well:

for row in database:
    session.merge(Mytable(hash=row['c_hash'],
                          date=row['date'],
                          text=row['text']))
    session.flush()

The trick is to flush in between, so that later merges will consult the DB and find the existing row, but this will be performing more queries, compared to the other solution.

Stop Inserting into Table when Duplicate Value Detected Flask SQLAlchemy

Your model should have unique indexes for some criteria to remove duplicates on. Columns are not unique by default, which you seem to assume (unique=False in a column and the comments). You should either instead of an auto incrementing surrogate key use some "natural" key such as the id provided by twitter, or make the text column tweet unique.

When you've fixed the uniqueness requirements and if you wish to ignore IntegrityErrors and keep going, wrap your inserts in transactions (or use the implicit behaviour) and commit or rollback accordingly:

from sqlalchemy.exc import IntegrityError

class listener(StreamListener):

    def on_data(self, data):
        all_data = json.loads(data)
        tweet_id = all_data["id_str"]
        tweet_text = all_data["text"]
        tweet_username = all_data["user"]["screen_name"]
        label = 1
        ttweets = TrainingTweets(label_id=label,
                                 tweet_username=tweet_username,
                                 tweet=tweet_text)

        try:
            db.session.add(ttweets)
            db.session.commit()
            print((username, tweet))
            # Increment the counter here, as we've truly successfully
            # stored a tweet.
            self.n += 1

        except IntegrityError:
            db.session.rollback()
            # Don't stop the stream, just ignore the duplicate.
            print("Duplicate entry detected!")    

        if self.n >= self.m:
            print("Successfully stored", self.m, "tweets into database")
            # Cross the... stop the stream.
            return False
        else:
            # Keep the stream going.
            return True

Sqlalchemy, Prevent Duplicate Rows

How to avoid inserting duplicate entries when adding values via a sqlalchemy relationship?

Preventing duplicate entries with sqlalchemy in preexisting sqllite table

SQLalchemy Avoid duplicate in session() before commiting

Stop Inserting into Table when Duplicate Value Detected Flask SQLAlchemy

Related Topics

Leave a reply