close
Skip to main content
Filter by
Sorted by
Tagged with
Tooling
1 vote
0 replies
74 views

I am developing a Python data pipeline to process and enrich the Brazilian Federal Revenue (CNPJ) dataset, which consists of monthly sharded CSV files totaling over 38 million rows (approx. 100GB+ raw)...
Vinícius Massagardi's user avatar
Best practices
0 votes
1 replies
50 views

I have a polars df, that contains answer codes to questions. I'd like to know if all answers are present in the df (and if not, which ones are missing). I came up with this code (which is working fine)...
lmocsi's user avatar
  • 1,181
1 vote
0 answers
108 views

I am a bit confused about what precisely is happing in Polars' query optimization and whether (or how) it is possible to better control optimization. I am running a large lazy query (hundreds of ...
user30525703's user avatar
4 votes
3 answers
173 views

I wrote a small script to scan a CSV in Python polars, select specific columns/filter specific rows in lazyframes and upload the result to a Postgres DB. The script works with a smaller test CSV but, ...
queen_macaroni's user avatar
Advice
1 vote
6 replies
85 views

I am really enjoying the capabilities that the Polars LazyFrame brings to the table. Recently though, I've been trying to develop a method of defining a lazy plan so that it can be serialized and ...
wterry's user avatar
  • 1
1 vote
1 answer
86 views

I found that in Polars, when using a single expression with multiple chained .over() calls, rolling_mean behaves differently from sum. The following expression returns all null (expected non-null ...
noob_191's user avatar
1 vote
1 answer
207 views

I have this code to retrieve millions of rows from my BigQuery query results: query_job = client.query( query, ) storage_client = bigquery_storage....
unitrium's user avatar
0 votes
2 answers
113 views

I have a column of numbers, and I want to add a column that changes this to HH:MM:SS df = pl.DataFrame({"seconds": [1.0, 4562.2, 2.44,123.567]}) I have tried df.with_columns(hhmmss=pl....
frank's user avatar
  • 3,824
3 votes
1 answer
122 views

I'm trying to create a system where I read/write Polars from shared memory using Arrow IPC. I have eager code working (read_ipc works) but the scan_ipc function isn't working for me. The following ...
akgcodes's user avatar
Best practices
0 votes
1 replies
92 views

I have a big and complex query using Polars new streaming engine at every physical plan node: import polars as pl def big_complex_query( data: pl.LazyFrame, ) -> pl.LazyFrame: data = data....
Kevin Li's user avatar
  • 659
0 votes
2 answers
170 views

If we wanted to apply a function across rows where there currently is no built-in method, like rank_horizontal, what is the fastest way? data = {0: [0, 1, 0, 1, 1, 0, 1, 0, 1, 1], 1: [0, 0, 0, 0, 0, ...
rhug123's user avatar
  • 9,034
2 votes
4 answers
165 views

Lets say we wanted to find the most amount of consecutive 1's row by row, with the below df. import polars as pl df = pl.from_repr(""" ┌─────┬─────┬─────┬─────┬─────┐ │ 0 ┆ 1 ┆ 2 ...
rhug123's user avatar
  • 9,034
1 vote
0 answers
35 views

I need to calculate the amount-weighted avarage of a percentage after grouping in polars. Since I need a priori the total amount as a denominator of the average, I think it's necessary to create an ad-...
Xywa's user avatar
  • 11
3 votes
1 answer
148 views

With Pandas I am using .groupby().cumsum() to generate a count column: import pandas as pd df = pd.DataFrame({'ID':['A','B','A','A','B','B','C','D','D','C']}) df['count'] = df['ID'].ne(df['ID']....
rhug123's user avatar
  • 9,034
Advice
0 votes
3 replies
62 views

When doing something like rolled = ( joined.sort("thing", "date") .rolling("date", period="20d", group_by="thing") .map_groups(func, None) ...
burk's user avatar
  • 365

15 30 50 per page
1
2 3 4 5
190