Spark Strutured Streaming Automatically Converts Timestamp to Local Time

Spark Strutured Streaming automatically converts timestamp to local time

For me it worked to use:

spark.conf.set("spark.sql.session.timeZone", "UTC")

It tells the spark SQL to use UTC as a default timezone for timestamps. I used it in spark SQL for example:

select *, cast('2017-01-01 10:10:10' as timestamp) from someTable

I know it does not work in 2.0.1. but works in Spark 2.2. I used in SQLTransformer also and it worked.

I am not sure about streaming though.

Maintaining timestamp sequence of incoming streaming data

Kafka guarantee messages order within partition only. If you want strict order within Kafka - keep one partition per topic and guarantee ordering when delivering there. You might want to implement "ordering service" which reads from the incoming queue and writes messages to another queue with only one partition. Good explanation and examples may be found in various blog posts: here or here and more.

Ordering messages in Spark is the easiest option. You should consider saving or caching ordered results to storage for reuse.

Timestamp Timezone Wrong/Missing in Spark/Databricks SQL Output

Spark does not support TIMESTAMP WITH TIMEZONE datatype as defined by ANSI SQL. Even though there are some functions that convert the timestamp across timezones, this information is never stored. Databricks documentation on timestamps explains:

Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME
ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR,
MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field
identify a time instant in the UTC time zone, and where SESSION TZ is
taken from the SQL config spark.sql.session.timeZone.

In your case spark.sql.session.timeZone is UTC and Z symbol in datetime pattern will always return UTC. Therefore you will never get a correct behavior with date_format if you deal with multiple timezones in a single query.

The only thing you can do is to explicitly store timezone information in a column and manually append it for display.

concat(
   date_format(from_utc_timestamp(createTimestampUTC, v.timezone), "yyyy-MM-dd'T'HH:mm:s "),
   v.timezone
) createTimestampLocal

This will display 2022-03-01T16:47:22.000 America/New_York. If you need an offset (-05:00) you will need to write a UDF to do the conversion and use Python or Scala native libraries that handle datetime conversions.

prevent spark from shifting timestamp in spark-shell

spark-shell --conf spark.sql.session.timeZone=UTC --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"

seems to give me the desired result.

But it does not explain why this only applies to my UDF and not to sparks internal to_timestamp function.

Timezone mismatch between SPARK SQL and Cassandra

Cassandra by default uses UTC time zone, but timezone specified in cqlshrc would only make a change/typecast while displaying it to the console.

For My case i had modified the local time zone from EDT to UTC to get the things done, but as specified it can also be caste to required time.

Thankyou @Uttam Kasundara for triggering the perfect point.

Spark Strutured Streaming Automatically Converts Timestamp to Local Time