Preserving Large Numbers

Preserving large numbers

It's not in a "1.67E+12 format", it just won't print entirely using the defaults. R is reading it in just fine and the whole number is there.

x <- 1665535004661
> x
[1] 1.665535e+12
> print(x, digits = 16)
[1] 1665535004661

See, the numbers were there all along. They don't get lost unless you have a really large number of digits. Sorting on what you brought in will work fine and you can just explicitly call print() with the digits option to see your data.frame instead of implicitly by typing the name.

How do I read large numbers precisely in R and perform arithmetic on them?

That's not large. It is merely a representation problem. Try this:

options(digits=22)

options('digits') defaults to 7, which is why you are seeing what you do. All twelve digits are being read and stored, but not printed by default.

Problem with storing and retrieving very large numbers in parquet format

The problem isn't related with Parquet, but with your initial conversion of the row_list to a pandas DataFrame:

row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)

>>> df
tree_id
0 NaN
1 2.353130e+17
2 NaN
3 1.353130e+17
4 9.353130e+17
5 8.353130e+17
6 NaN
7 NaN

Because there are missing values, pandas creates a float64 column. And it is this int -> float conversion that looses the precision for such large integers.

Later converting the float to an integer again (when creating the pyarrow Table with a schema that forces an integer column) will then result in a slightly different value, as can be seen doing this manually in python as well:

>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e+17
>>> int(df.loc[1, "tree_id"])
235313013750949472

One possible solution is to avoid the temporary DataFrame. This will depend on your exact (real) use case of course, but if you start from a python list as in the reproducible example above, you can also create a pyarrow.Table directly from this list of values (pa.table({"tree_id": row_list}, schema=..) and this will preserve the exact values in the Parquet file.



Related Topics



Leave a reply



Submit