Newest 'python-polars' Questions

Tooling

1 vote

0 replies

74 views

Optimizing dev-to-prod workflow for processing 38M+ rows with DuckDB and Polars on AWS Graviton3

I am developing a Python data pipeline to process and enrich the Brazilian Federal Revenue (CNPJ) dataset, which consists of monthly sharded CSV files totaling over 38 million rows (approx. 100GB+ raw)...

Vinícius Massagardi

1

asked May 4 at 16:52

Best practices

0 votes

1 replies

50 views

How to tell which answers are missing from a questionnare (polars)?

I have a polars df, that contains answer codes to questions. I'd like to know if all answers are present in the df (and if not, which ones are missing). I came up with this code (which is working fine)...

lmocsi

1,181

asked Apr 23 at 13:29

1 vote

0 answers

108 views

Understanding Polars' query optimization: optimizing optimization

I am a bit confused about what precisely is happing in Polars' query optimization and whether (or how) it is possible to better control optimization. I am running a large lazy query (hundreds of ...

user30525703

35

asked Apr 22 at 17:50

4 votes

3 answers

173 views

Python polars script leaks memory and crashes (scans CSV to lazyframes - writes to database)

I wrote a small script to scan a CSV in Python polars, select specific columns/filter specific rows in lazyframes and upload the result to a Postgres DB. The script works with a smaller test CSV but, ...

queen_macaroni

41

asked Apr 22 at 15:06

Advice

1 vote

6 replies

85 views

In Polars, is there a way to replace or modify the lazyframe at the origin of a computation graph?

I am really enjoying the capabilities that the Polars LazyFrame brings to the table. Recently though, I've been trying to develop a method of defining a lazy plan so that it can be serialized and ...

wterry

1

asked Apr 14 at 20:40

1 vote

1 answer

86 views

Polars expr: rolling_mean with chained .over() yields inconsistent results compared to step-by-step execution, while sum works fine

I found that in Polars, when using a single expression with multiple chained .over() calls, rolling_mean behaves differently from sum. The following expression returns all null (expected non-null ...

noob_191

13

asked Apr 11 at 5:01

1 vote

1 answer

207 views

Bigquery storage API `to_arrow_iterable` returns only 8 rows at a time

I have this code to retrieve millions of rows from my BigQuery query results: query_job = client.query( query, ) storage_client = bigquery_storage....

unitrium

72

asked Mar 25 at 10:16

0 votes

2 answers

113 views

how to format a int column into HH:MM:SS.0 in polars

I have a column of numbers, and I want to add a column that changes this to HH:MM:SS df = pl.DataFrame({"seconds": [1.0, 4562.2, 2.44,123.567]}) I have tried df.with_columns(hhmmss=pl....

frank

3,824

asked Mar 10 at 13:30

3 votes

1 answer

122 views

Lazily scanning from shared memory

I'm trying to create a system where I read/write Polars from shared memory using Arrow IPC. I have eager code working (read_ipc works) but the scan_ipc function isn't working for me. The following ...

akgcodes

33

asked Mar 9 at 15:33

Best practices

0 votes

1 replies

92 views

Best practices of generating `LazyFrame` query metatata

I have a big and complex query using Polars new streaming engine at every physical plan node: import polars as pl def big_complex_query( data: pl.LazyFrame, ) -> pl.LazyFrame: data = data....

Kevin Li

659

asked Mar 9 at 0:40

0 votes

2 answers

170 views

Faster way to apply a function across rows?

If we wanted to apply a function across rows where there currently is no built-in method, like rank_horizontal, what is the fastest way? data = {0: [0, 1, 0, 1, 1, 0, 1, 0, 1, 1], 1: [0, 0, 0, 0, 0, ...

rhug123

9,034

asked Mar 1 at 14:47

2 votes

4 answers

165 views

How to apply a mask row by row in polars?

Lets say we wanted to find the most amount of consecutive 1's row by row, with the below df. import polars as pl df = pl.from_repr(""" ┌─────┬─────┬─────┬─────┬─────┐ │ 0 ┆ 1 ┆ 2 ...

rhug123

9,034

asked Mar 1 at 14:22

1 vote

0 answers

35 views

Customized aggregation in group_by with polars [duplicate]

I need to calculate the amount-weighted avarage of a percentage after grouping in polars. Since I need a priori the total amount as a denominator of the average, I think it's necessary to create an ad-...

Xywa

11

asked Feb 27 at 16:56

3 votes

1 answer

148 views

Why does the cumulative sum over each group result in all 1's?

With Pandas I am using .groupby().cumsum() to generate a count column: import pandas as pd df = pd.DataFrame({'ID':['A','B','A','A','B','B','C','D','D','C']}) df['count'] = df['ID'].ne(df['ID']....

rhug123

9,034

asked Feb 27 at 16:48

Advice

0 votes

3 replies

62 views

Does map_groups receive the data frame in order?

When doing something like rolled = ( joined.sort("thing", "date") .rolling("date", period="20d", group_by="thing") .map_groups(func, None) ...

burk

365

asked Feb 20 at 8:35

Collectives™ on Stack Overflow

Optimizing dev-to-prod workflow for processing 38M+ rows with DuckDB and Polars on AWS Graviton3

How to tell which answers are missing from a questionnare (polars)?

Understanding Polars' query optimization: optimizing optimization

Python polars script leaks memory and crashes (scans CSV to lazyframes - writes to database)

In Polars, is there a way to replace or modify the lazyframe at the origin of a computation graph?

Polars expr: rolling_mean with chained .over() yields inconsistent results compared to step-by-step execution, while sum works fine

Bigquery storage API `to_arrow_iterable` returns only 8 rows at a time

how to format a int column into HH:MM:SS.0 in polars

Lazily scanning from shared memory

Best practices of generating `LazyFrame` query metatata

Faster way to apply a function across rows?

How to apply a mask row by row in polars?

Customized aggregation in group_by with polars [duplicate]

Why does the cumulative sum over each group result in all 1's?

Does map_groups receive the data frame in order?

Hot Network Questions