Spark Strutured Streaming automatically converts timestamp to local time
For me it worked to use:
spark.conf.set("spark.sql.session.timeZone", "UTC")
It tells the spark SQL to use UTC as a default timezone for timestamps. I used it in spark SQL for example:
select *, cast('2017-01-01 10:10:10' as timestamp) from someTable
I know it does not work in 2.0.1. but works in Spark 2.2. I used in SQLTransformer
also and it worked.
I am not sure about streaming though.
Maintaining timestamp sequence of incoming streaming data
Kafka guarantee messages order within partition only. If you want strict order within Kafka - keep one partition per topic and guarantee ordering when delivering there. You might want to implement "ordering service" which reads from the incoming queue and writes messages to another queue with only one partition. Good explanation and examples may be found in various blog posts: here or here and more.
Ordering messages in Spark is the easiest option. You should consider saving or caching ordered results to storage for reuse.
Timestamp Timezone Wrong/Missing in Spark/Databricks SQL Output
Spark does not support TIMESTAMP WITH TIMEZONE
datatype as defined by ANSI SQL. Even though there are some functions that convert the timestamp across timezones, this information is never stored. Databricks documentation on timestamps explains:
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME
ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR,
MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field
identify a time instant in the UTC time zone, and where SESSION TZ is
taken from the SQL config spark.sql.session.timeZone.
In your case spark.sql.session.timeZone
is UTC and Z
symbol in datetime pattern will always return UTC. Therefore you will never get a correct behavior with date_format
if you deal with multiple timezones in a single query.
The only thing you can do is to explicitly store timezone information in a column and manually append it for display.
concat(
date_format(from_utc_timestamp(createTimestampUTC, v.timezone), "yyyy-MM-dd'T'HH:mm:s "),
v.timezone
) createTimestampLocal
This will display 2022-03-01T16:47:22.000 America/New_York
. If you need an offset (-05:00
) you will need to write a UDF to do the conversion and use Python or Scala native libraries that handle datetime conversions.
prevent spark from shifting timestamp in spark-shell
spark-shell --conf spark.sql.session.timeZone=UTC --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"
seems to give me the desired result.
But it does not explain why this only applies to my UDF and not to sparks internal to_timestamp
function.
Timezone mismatch between SPARK SQL and Cassandra
Cassandra by default uses UTC time zone, but timezone specified in cqlshrc would only make a change/typecast while displaying it to the console.
For My case i had modified the local time zone from EDT to UTC to get the things done, but as specified it can also be caste to required time.
Thankyou @Uttam Kasundara for triggering the perfect point.
Related Topics
What Is the Equivalent of Java Static Methods in Kotlin
How Does the Spring @Responsebody Annotation Work
Differencebetween Iterator and Iterable and How to Use Them
Understanding Jsf as a MVC Framework
Check If a File Is Locked in Java
Is This the Best Way to Rewrite the Content of a File in Java
Java Desktop Application: Swt VS. Swing
Peer Not Authenticated While Importing Gradle Project in Eclipse
How to Read a File from a Certain Offset in Java
Cannot Create Jdbc Driver of Class ' ' for Connect Url 'Null':I Do Not Understand This Exception
Java Simpledateformat Timezone Offset with Minute Separated by Colon
Comparing Strings by Their Alphabetical Order
How to Bind an Object List with Thymeleaf
How to Update an Entity Using Spring-Data-Jpa