load json files in google cloud storage into big query table
You have 2 solutions
- Either you update the format before the BigQuery integration
- Or you update the format after the BigQuery integration
Before
Before means updating your JSON (manually or by script) or to update it by the process that load the JSON into BigQuery (like Dataflow).
I personally don't like this, file handling are never funny and efficient.
After
In this case, you let BigQuery loading your JSON file into a temporary table and convert your UNIX timestamp into a Number or a String. Then, perform a request into this temporary table, convert the field in the correct timestamp format, and insert the data in the final table.
This way is smoother and easier (a simple SQL query to write). However, it implies cost to read all the loaded data (to write them then)
How to import a json from a file on cloud storage to Bigquery
You're getting hit by the same thing that a lot of people (including me) have gotten hit by --
you are importing a json file but not specifying an import format, so it defaults to csv.
If you set configuration.load.sourceFormat to NEWLINE_DELIMITED_JSON you should be good to go.
We've got a bug to make it harder to do or at least be able to detect when the file is the wrong type, but I'll bump the priority.
Loading JSON Array Into BigQuery
No, only a valid JSON can be ingested by BigQuery and a valid JSON doesn't start by an array.
You have to transform it slightly:
- Either transform it in a valid JSON (add a
{"object":
at the beginning and finish the line by a}
). Ingest that JSON in a temporary table and perform a query to scan the new table and insert the correct values in the target tables - Or remove the array definition
[]
and replace the},{
by}\n{
to have a JSON line.
Alternatively, you can ingest your JSON as a CSV file (you will have only 1 column with you JSON raw text in it) and then use the BigQuery String function to transform the data and insert them in the target database.
Load file from Cloud Storage to BigQuery to single string column
My option in that case is to ingest the CSV with a dummy separator, for instance #
or |
. I know that I will never have those characters and that's why I chose them.
Like that, the schema autodetect detect only 1 column, and create a single string column table.
If you can pick a character like that, it's the easiest solution, but without any guaranty of course (it's corrupted file, it's hard to know in advance what will be the unused characters)
Data type issues when loading JSON file from GCS to Bigquery Table
As mentioned in the comments by @TamirKlein, as you have mixed values for some columns, the auto detect will use the first row to determine what is the appropriate data type for each column.
If you want to set the schema so all columns are strings, but don’t want to hard code each line of the bigquery.SchemaField()
, you can use the Google Cloud Storage Client to get the source file and iterate through each field of the first JSON object while appending the schema field to a list, and then use it to determine the schema configuration of the table.
You could use something like the example below:
bucket = client.get_bucket('[BUCKET_NAME]')
blob = storage.Blob('source.json', bucket)
json_content = blob.download_as_string().decode("utf-8")
json_blobs = json_content.split('\n')
first_object = json.loads(json_blobs[0])
schema = []
for key in first_object:
schema.append(bigquery.SchemaField(key, "STRING"))
job_config.schema = schema
Keep in mind that you might need to adapt this logic in case your source contains nested and repeated fields.
Confusion when uploading a JSON from googlecloud storage to bigquery
Not sure what problem you are having but to load data from a file from GCS to BigQuery is exactly how you are already doing.
If you have a table with this schema:
[{"name": "id", "type": "INT64"}, {"name": "name", "type": "STRING"}]
And if you have this file in GCS (located for instance at "gs://bucket/json_data.json"):
{"id": 1, "name": "test1"}
{"id": 2, "name": "test2"}
You'd just need now to set the job
object to process a JSON file as input, like so:
def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
bigquery_client = bigquery.Client('dataworks-356fa')
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, "gs://bucket/json_data.json")
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
And just it.
(If you have a CSV file then you have to set your job
object accordingly).
As for the second question, it's really a matter of trying it out different approaches and seeing which works best for you.
To delete a table, you'd just need to run:
table.delete()
To remove duplicated data from a table one possibility would be to write a query that removes the duplication and saves the results to the same table. Something like:
query_job = bigquery_client.run_async_query(query=your_query, job_name=job_name)
query_job.destination = Table object
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.begin()
Related Topics
How to Change the Name of a Django App
Writing String to a File on a New Line Every Time
Pip Freeze Creates Some Weird Path Instead of the Package Version
Subclassing Tuple with Multiple _Init_ Arguments
How to Implement a Tree in Python
Parameterized Queries with Psycopg2/Python Db-API and Postgresql
In Python Interpreter, Return Without " ' "
How to Install Python Opencv Through Conda
Download Returned Zip File from Url
Equivalent to Python's Findall() Method in Ruby
A Mutable Type Inside an Immutable Container
Django: Multiple Models in One Template Using Forms
Typeerror: Objectid('') Is Not JSON Serializable
How to Send an Email with Python
How to Refer to Relative Paths of Resources When Working with a Code Repository