How to Import a JSON from a File on Cloud Storage to Bigquery

load json files in google cloud storage into big query table

You have 2 solutions

  • Either you update the format before the BigQuery integration
  • Or you update the format after the BigQuery integration

Before

Before means updating your JSON (manually or by script) or to update it by the process that load the JSON into BigQuery (like Dataflow).

I personally don't like this, file handling are never funny and efficient.

After

In this case, you let BigQuery loading your JSON file into a temporary table and convert your UNIX timestamp into a Number or a String. Then, perform a request into this temporary table, convert the field in the correct timestamp format, and insert the data in the final table.

This way is smoother and easier (a simple SQL query to write). However, it implies cost to read all the loaded data (to write them then)

How to import a json from a file on cloud storage to Bigquery

You're getting hit by the same thing that a lot of people (including me) have gotten hit by --
you are importing a json file but not specifying an import format, so it defaults to csv.

If you set configuration.load.sourceFormat to NEWLINE_DELIMITED_JSON you should be good to go.

We've got a bug to make it harder to do or at least be able to detect when the file is the wrong type, but I'll bump the priority.

Loading JSON Array Into BigQuery

No, only a valid JSON can be ingested by BigQuery and a valid JSON doesn't start by an array.

You have to transform it slightly:

  • Either transform it in a valid JSON (add a {"object": at the beginning and finish the line by a }). Ingest that JSON in a temporary table and perform a query to scan the new table and insert the correct values in the target tables
  • Or remove the array definition [] and replace the },{ by }\n{ to have a JSON line.

Alternatively, you can ingest your JSON as a CSV file (you will have only 1 column with you JSON raw text in it) and then use the BigQuery String function to transform the data and insert them in the target database.

Load file from Cloud Storage to BigQuery to single string column

My option in that case is to ingest the CSV with a dummy separator, for instance # or |. I know that I will never have those characters and that's why I chose them.

Like that, the schema autodetect detect only 1 column, and create a single string column table.

If you can pick a character like that, it's the easiest solution, but without any guaranty of course (it's corrupted file, it's hard to know in advance what will be the unused characters)

Data type issues when loading JSON file from GCS to Bigquery Table

As mentioned in the comments by @TamirKlein, as you have mixed values for some columns, the auto detect will use the first row to determine what is the appropriate data type for each column.

If you want to set the schema so all columns are strings, but don’t want to hard code each line of the bigquery.SchemaField(), you can use the Google Cloud Storage Client to get the source file and iterate through each field of the first JSON object while appending the schema field to a list, and then use it to determine the schema configuration of the table.

You could use something like the example below:

bucket = client.get_bucket('[BUCKET_NAME]')
blob = storage.Blob('source.json', bucket)

json_content = blob.download_as_string().decode("utf-8")
json_blobs = json_content.split('\n')
first_object = json.loads(json_blobs[0])

schema = []
for key in first_object:
schema.append(bigquery.SchemaField(key, "STRING"))

job_config.schema = schema

Keep in mind that you might need to adapt this logic in case your source contains nested and repeated fields.

Confusion when uploading a JSON from googlecloud storage to bigquery

Not sure what problem you are having but to load data from a file from GCS to BigQuery is exactly how you are already doing.

If you have a table with this schema:

[{"name": "id", "type": "INT64"}, {"name": "name", "type": "STRING"}]

And if you have this file in GCS (located for instance at "gs://bucket/json_data.json"):

{"id": 1, "name": "test1"}
{"id": 2, "name": "test2"}

You'd just need now to set the job object to process a JSON file as input, like so:

def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
bigquery_client = bigquery.Client('dataworks-356fa')
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(table_name)
job_name = str(uuid.uuid4())

job = bigquery_client.load_table_from_storage(
job_name, table, "gs://bucket/json_data.json")

job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()

And just it.

(If you have a CSV file then you have to set your job object accordingly).

As for the second question, it's really a matter of trying it out different approaches and seeing which works best for you.

To delete a table, you'd just need to run:

table.delete()

To remove duplicated data from a table one possibility would be to write a query that removes the duplication and saves the results to the same table. Something like:

query_job = bigquery_client.run_async_query(query=your_query, job_name=job_name)
query_job.destination = Table object
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.begin()


Related Topics



Leave a reply



Submit