How to Compute Tf/Idf with SQL (Bigquery)

How can I compute TF/IDF with SQL (BigQuery)

This one might be easier to understand - takes a dataset that already has the # of words per TV station and day:

# in this query the combination of date+station represents a "document"

WITH data AS (
SELECT *
FROM `gdelt-bq.gdeltv2.iatv_1grams`
WHERE DATE BETWEEN 20190601 AND 20190629
AND station NOT IN ('KSTS', 'KDTV')
)
, word_day_station AS (
# how many times a word is mentioned in each "document"
SELECT word, SUM(count) counts, date, station
FROM data
GROUP BY 1, 3, 4
)
, day_station AS (
# total # of words in each "document"
SELECT SUM(count) counts, date, station
FROM data
GROUP BY 2,3
)
, tf AS (
# TF for a word in a "document"
SELECT word, date, station, a.counts/b.counts tf
FROM word_day_station a
JOIN day_station b
USING(date, station)
)
, word_in_docs AS (
# how many "documents" have a word
SELECT word, COUNT(DISTINCT FORMAT('%i %s', date, station)) indocs
FROM word_day_station
GROUP BY 1
)
, total_docs AS (
# total # of docs
SELECT COUNT(DISTINCT FORMAT('%i %s', date, station)) total_docs
FROM data
)
, idf AS (
# IDF for a word
SELECT word, LOG(total_docs.total_docs/indocs) idf
FROM word_in_docs
CROSS JOIN total_docs
)

SELECT date,
ARRAY_AGG(STRUCT(station, ARRAY_TO_STRING(words, ', ')) ORDER BY station) top_words
FROM (
SELECT date, station, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 5) words
FROM (
SELECT word, date, station, tf.tf * idf.idf tfidf
FROM tf
JOIN idf
USING(word)
)
GROUP BY date, station
)
GROUP BY date
ORDER BY date DESC

Sample Image

TF/IDF Measurement with using MySQL

I have done it based on this wiki: .
Here you go:
1) t1 gets the sum of words per topic
2) t2 gets the idf. This is the log10 of number of topics over the number of topics that contains this word
3) Since you did the wordcount, divide this by sum_per_topic to get tf

select w.Topic_Name, 
w.Word,
w.WordCount/t1.topic_sum as tf,
t2.idf,
(w.WordCount/t1.topic_sum)*(t2.idf) as tf_idf
from weightallofwordsintopic w
join (
select Topic_Name, sum(WordCount) as topic_sum
from weightallofwordsintopic
group by Topic_Name
) t1
on w.Topic_Name=t1.Topic_Name
join (
select w.Word, log10(t_cnts.cnts/count(*)) as idf
from weightallofwordsintopic w,
(select count(distinct Topic_Name) as cnts from weightallofwordsintopic) t_cnts
group by w.Word
) t2
on w.Word=t2.Word
order by tf_idf desc,
w.Word

Query to calculate term frequency * inverse document frequency

You need to join your TF and DF tables and then insert into the destination TFIDF table.
Try this:

insert into TFIDF (documentID, terms, tf_idf)
select abstractID, df.term, (log(10, 132225)-log(10, doccount)+1)*(tf.freq)
from tf, df
where tf.term = df.term;

How to create UDF in BigQuery? Routine name missing dataset

  • just use dataset.function_name or project.dataset.function_name notion
  • if you want to use BQ scripting - you should rather use procedure - see CREATE PROCEDURE statement. Scripting is not supported in BQ functions. so if your logic involves scripting - proc is your option to go with


Related Topics



Leave a reply



Submit