How to Extract Table Names and Column Names from SQL Query

How to extract table names and column names from sql query?

Really, this is no easy task. You could use a lexer (ply in this example) and define several rules to get several tokens out of a string. The following code defines these rules for the different parts of your SQL string and puts them back together as there could be aliases in the input string. As a result, you get a dictionary (result) with the different tablenames as key.

import ply.lex as lex, re

tokens = (
"TABLE",
"JOIN",
"COLUMN",
"TRASH"
)

tables = {"tables": {}, "alias": {}}
columns = []

t_TRASH = r"Select|on|=|;|\s+|,|\t|\r"

def t_TABLE(t):
r"from\s(\w+)\sas\s(\w+)"

regex = re.compile(t_TABLE.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl

return t

def t_JOIN(t):
r"inner\s+join\s+(\w+)\s+as\s+(\w+)"

regex = re.compile(t_JOIN.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl
return t

def t_COLUMN(t):
r"(\w+\.\w+)"

regex = re.compile(t_COLUMN.__doc__)
m = regex.search(t.value)
if m is not None:
t.value = m.group(1)
columns.append(t.value)
return t

def t_error(t):
raise TypeError("Unknown text '%s'" % (t.value,))
t.lexer.skip(len(t.value))

# here is where the magic starts
def mylex(inp):
lexer = lex.lex()
lexer.input(inp)

for token in lexer:
pass

result = {}
for col in columns:
tbl, c = col.split('.')
if tbl in tables["alias"].keys():
key = tables["alias"][tbl]
else:
key = tbl

if key in result:
result[key].append(c)
else:
result[key] = list()
result[key].append(c)

print result
# {'tb1': ['col1', 'col7'], 'tb2': ['col2', 'col8']}

string = "Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;"
mylex(string)

How to extract tables names in a SQL script?

If you would like to use core python:

txt = """
select *
from (
select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b
on a.col1 = b.col2
left join
test.test_c c
on b.col2 = c.col3
left jon
(select
col4
from
test.test_d) d
on c.col3 = d.col4"""

replace_list = ['\n', '(', ')', '*', '=']
for i in replace_list:
txt = txt.replace(i, ' ')
txt = txt.split()
res = []
for i in range(1, len(txt)):
if txt[i-1] in ['from', 'join'] and txt[i] != 'select':
res.append(txt[i])
print(res)

Extract table name using SQL from a column that stores queries

Seems like this might be a good use case for REGEXP_SUBSTR:

SELECT REGEXP_SUBSTR(querytext, '(<?FROM database\.)(.+?\b)', 1, 1, 'i')

The pattern looks for the string literal FROM database. that immediately precedes a tablename matching the pattern (.+?\b).

Further - the pattern (.+?\b) matches any characters (.) occurring at least one time (+) non-greedily (? token) up to a word boundary character (\b), which judging by your sample data, should match on the whitespace immediately proceeding the table name.

You can see how this pattern is interpreted in even more detail using a utility like Regex101.


Further (albeit slightly tangential) reading regarding the PCRE-style RegExp support in Teradata: Regex syntax in teradata

Extract table name and columns from SQL schema

  1. Extract a table name:
CREATE\s+TABLE\s+([\w_]+)

  1. Get column names:
\s+([\w_]+)[\s\w]+,

  1. Get a primary key field:
\s*PRIMARY\s+KEY\s+\(([\w_]+)\)

  1. Get foreign keys data:
\s*FOREIGN\s+KEY\s+\(([\w_]+)\)\s+REFERENCES\s+([\w_]+)\s+\(([\w_]+)\)

You can test it here (respectively):

  1. https://regexr.com/59251
  2. https://regexr.com/59254
  3. https://regexr.com/5925a
  4. https://regexr.com/594eb


Related Topics



Leave a reply



Submit