2,836 questions
Tooling
1
vote
0
replies
74
views
Optimizing dev-to-prod workflow for processing 38M+ rows with DuckDB and Polars on AWS Graviton3
I am developing a Python data pipeline to process and enrich the Brazilian Federal Revenue (CNPJ) dataset, which consists of monthly sharded CSV files totaling over 38 million rows (approx. 100GB+ raw)...
Best practices
0
votes
1
replies
50
views
How to tell which answers are missing from a questionnare (polars)?
I have a polars df, that contains answer codes to questions. I'd like to know if all answers are present in the df (and if not, which ones are missing). I came up with this code (which is working fine)...
1
vote
0
answers
108
views
Understanding Polars' query optimization: optimizing optimization
I am a bit confused about what precisely is happing in Polars' query optimization and whether (or how) it is possible to better control optimization.
I am running a large lazy query (hundreds of ...
4
votes
3
answers
173
views
Python polars script leaks memory and crashes (scans CSV to lazyframes - writes to database)
I wrote a small script to scan a CSV in Python polars, select specific columns/filter specific rows in lazyframes and upload the result to a Postgres DB.
The script works with a smaller test CSV but, ...
Advice
1
vote
6
replies
85
views
In Polars, is there a way to replace or modify the lazyframe at the origin of a computation graph?
I am really enjoying the capabilities that the Polars LazyFrame brings to the table. Recently though, I've been trying to develop a method of defining a lazy plan so that it can be serialized and ...
1
vote
1
answer
86
views
Polars expr: rolling_mean with chained .over() yields inconsistent results compared to step-by-step execution, while sum works fine
I found that in Polars, when using a single expression with multiple chained .over() calls, rolling_mean behaves differently from sum.
The following expression returns all null (expected non-null ...
1
vote
1
answer
207
views
Bigquery storage API `to_arrow_iterable` returns only 8 rows at a time
I have this code to retrieve millions of rows from my BigQuery query results:
query_job = client.query(
query,
)
storage_client = bigquery_storage....
0
votes
2
answers
113
views
how to format a int column into HH:MM:SS.0 in polars
I have a column of numbers, and I want to add a column that changes this to HH:MM:SS
df = pl.DataFrame({"seconds": [1.0, 4562.2, 2.44,123.567]})
I have tried
df.with_columns(hhmmss=pl....
3
votes
1
answer
122
views
Lazily scanning from shared memory
I'm trying to create a system where I read/write Polars from shared memory using Arrow IPC. I have eager code working (read_ipc works) but the scan_ipc function isn't working for me. The following ...
Best practices
0
votes
1
replies
92
views
Best practices of generating `LazyFrame` query metatata
I have a big and complex query using Polars new streaming engine at every physical plan node:
import polars as pl
def big_complex_query(
data: pl.LazyFrame,
) -> pl.LazyFrame:
data = data....
0
votes
2
answers
170
views
Faster way to apply a function across rows?
If we wanted to apply a function across rows where there currently is no built-in method, like rank_horizontal, what is the fastest way?
data = {0: [0, 1, 0, 1, 1, 0, 1, 0, 1, 1],
1: [0, 0, 0, 0, 0, ...
2
votes
4
answers
165
views
How to apply a mask row by row in polars?
Lets say we wanted to find the most amount of consecutive 1's row by row, with the below df.
import polars as pl
df = pl.from_repr("""
┌─────┬─────┬─────┬─────┬─────┐
│ 0 ┆ 1 ┆ 2 ...
1
vote
0
answers
35
views
Customized aggregation in group_by with polars [duplicate]
I need to calculate the amount-weighted avarage of a percentage after grouping in polars. Since I need a priori the total amount as a denominator of the average, I think it's necessary to create an ad-...
3
votes
1
answer
148
views
Why does the cumulative sum over each group result in all 1's?
With Pandas I am using .groupby().cumsum() to generate a count column:
import pandas as pd
df = pd.DataFrame({'ID':['A','B','A','A','B','B','C','D','D','C']})
df['count'] = df['ID'].ne(df['ID']....
Advice
0
votes
3
replies
62
views
Does map_groups receive the data frame in order?
When doing something like
rolled = (
joined.sort("thing", "date")
.rolling("date", period="20d", group_by="thing")
.map_groups(func, None)
...