ENH: Check whether image is displayed on a given page by andreasntr · Pull Request #3738 · py-pdf/pypdf

andreasntr · 2026-04-20T21:40:05Z

Addresses #3737

Code brainstormed with Qwen3.5 9B via OpenCode

What was changed

ImageFile now has an is_displayed_on_page(page) method that:

checks its content stream for determining whether the image is actually displayed in the given page
looks for INLINE IMAGE operators for inline images and Do operators for XObject images
~~uses caching with pages + bools lists, so repeated checks are faster~~ removed as not giving enough advantage

Both inline images and XObject references are supported:

inline images: the image name is looked up in the page content
image references (XObjects): the image reference is checked against the list of references in the page

Backward compatibility

Checks are performed lazily, so that if a user is not interested in the feature, there is no overhead while reading the PDF or interacting with it.

codecov · 2026-04-20T21:55:14Z

Codecov Report

❌ Patch coverage is 87.83784% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.62%. Comparing base (e044789) to head (bb11c8c).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
pypdf/_page.py	87.83%	5 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3738      +/-   ##
==========================================
- Coverage   97.66%   97.62%   -0.05%     
==========================================
  Files          55       55              
  Lines       10291    10391     +100     
  Branches     1890     1920      +30     
==========================================
+ Hits        10051    10144      +93     
- Misses        135      138       +3     
- Partials      105      109       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andreasntr · 2026-04-20T21:57:14Z

@stefan6419846 also added a new test, waiting for #37 to be merged to restart CI jobs (tested locally with pytest though)

…test decorator

stefan6419846

As mentioned in the issue already, I think we should determine this value directly and save it into the ImageFile instead of doing the heavy lifting of possibly parsing all content streams for each image.

This means: When I get the ImageFile through PageObject.images, just accessing ImageFile.is_displayed should expose the desired value.

If I remember correctly, retrieving the inline images already parses the content stream of the page, thus hooking into it and populating the display values here should be possible without too much overhead.

Do you think this is possible?

andreasntr · 2026-04-22T09:14:47Z

This means shifting this logic to the pdf/page constructor I think, which implies that even user not interested in this feature would be impacted by the possible overhead (?)

stefan6419846 · 2026-04-22T09:17:44Z

I would hope that a proper approach would not introduce way more overhead than we already require for parsing inline images from the content stream.

andreasntr · 2026-04-22T20:15:07Z

@stefan6419846 switched to a bool field in the ImageFIle class. The flag is set at object creation time:

inline images: is_displayed is always True since they don't have references and can not be duplicated
reference images (via the images property of each page): the page content is checked to determine whether the reference is actually used

Failing tests have been fixed as well

stefan6419846 · 2026-04-27T12:47:32Z

switched to a bool field in the ImageFIle class. The flag is set at object creation time

Besides now saving it inside a dedicated attribute, we still parse the content stream for each image, thus it produces overhead.

Instead, I would propose to generalize the logic inside _get_inline_images to populate regular image data as well.

andreasntr · 2026-04-27T12:50:06Z

I would propose to generalize the logic inside _get_inline_images to populate regular image data as well.

How does this remove overhead? We still need to parse each image to compute whether it is displayed or not

stefan6419846 · 2026-04-27T12:51:46Z

We do not need to create a ContentStream and let it generate the operations for each image, but can do this once at the beginning if I am not mistaken.

andreasntr · 2026-04-27T13:40:41Z

We do not need to create a ContentStream and let it generate the operations for each image, but can do this once at the beginning if I am not mistaken.

Do you mean at document level then? I would need a bit of high-level guidance on this if possible

stefan6419846 · 2026-04-27T13:48:27Z

Do you mean at document level then?

ImageFile instances are generated on the page-level, thus pypdf can have a simple property on it to indicate whether it is displayed on the current page or not.

I would need a bit of high-level guidance on this if possible

_get_inline_images should be renamed to make it obvious that it parses the content stream of the page for image-related data. Currently, it only populates inline images. The goal is to extend its operator handling in

pypdf/pypdf/_page.py

Lines 735 to 744 in 64a793b

    
           for param, ope in content.operations: 
        
               if ope == b"INLINE IMAGE": 
        
                   imgs_data.append( 
        
                       {"settings": param["settings"], "__streamdata__": param["data"]} 
        
                   ) 
        
               elif ope in (b"BI", b"EI", b"ID"):  # pragma: no cover 
        
                   raise PdfReadError( 
        
                       f"{ope!r} operator met whereas not expected, " 
        
                       "please share use case with pypdf dev team" 
        
                   )

to look for Do operators and record the corresponding identifiers (similar to your current approach). These have to be mapped to their respective ImageFile instances afterwards by a suitable approach.

andreasntr · 2026-04-27T21:18:23Z

This is what I came up with, please confirm before I commit to avoid useless commits.

is_displayed is a boolean property set at ImageFile creation time (not per-call method).

Content stream parsing (_parse_content_stream() -> former _get_inline_images()):

Scans page /Contents for BI/EI operators (inline) and Do operators (Do-referenced)
Extracts image data and creates ImageFile instances
Sets is_displayed=True for both inline and Do-referenced (as they're in the stream)
Stores cached ImageFile dict in self.inline_images for later reuse by _get_image() and _get_ids_image()

stefan6419846 · 2026-04-30T11:29:15Z

_parse_content_stream does not indicate that we are dealing with images, which we are indeed are. Thus including the image part in the name would help with future maintenance.

Additionally, please note that _get_inline_images currently only returns ImageFile instances for inline images. For "regular" images, the value needs to be propagated to the constructors for "regular" ImageFile instances accordingly.

andreasntr · 2026-05-07T08:02:33Z

A property or attribute displayed_images should not be required, as this can be determined from the new ImageFile instances.

This is not true because inline_images only holds inline images (I'm referring to the current main now), while displayed_images also manages xobject/do references.

I understand this. Thus I propose to either have

a list of ALL images maintained internally or

a list of all inline images maintained internally as well as a list of displayed image names to be able to create ImageFile instances accordingly.

Option 1 is what I implemented here, the issue lies in the name because:

we cannot deprecate inline_images
we can't have inline_images hold non-inline images because it would disrupt tests and functionalities

So we are in a dead spot

stefan6419846 · 2026-05-07T08:09:46Z

We do not have to rely on the inline_images attribute internally - this is just some arbitrary design decision chosen during the initial implementation.

andreasntr · 2026-05-07T09:08:29Z

We do not have to rely on the inline_images attribute internally - this is just some arbitrary design decision chosen during the initial implementation.

In theory we could simply do this:

restore inline_images
keep displayed_images (works as a cache just like inline_images) and let this be returned by images

In practice there is the caveat that images currently returns ALL images even if not displayed or not inline (through _get_ids_image) or just inline images based on whether inline_images is null, giving us the following alternatives:

have images rely on displayed_images directly, which breaks the current logic (returning only inline images if inline_images is not null or retuning also referened-but-not-displayed images) but allows easier filtering (by the user) on need
images can still be allowed to return only inline images in certain cases, but this invalidates the whole point of having Image.is_displayed for filtering on need

stefan6419846 · 2026-05-07T09:18:16Z

In practice there is the caveat that images currently returns ALL images even if not displayed or not inline (through _get_ids_image) or just inline images based on whether inline_images is null

inline_images is None by default and populated on the first usage. That fact that it has been publicly writable the whole time is a design flaw in my opinion.

My goal would be:

Have images always return all images referenced by the page.
Populate ImageFile.is_inline and ImageFile.is_displayed properly.
Make attributes required for caching purposes (due to having to read the whole content stream) internal.
Introduce a deprecation period for inline_images:
- Make it a property and emulate the current behavior with the new approach.
- Make the property setter emulate the current behavior which indirectly allowed to control the caching mechanism.

This ensures that there is exactly one public API for retrieving the images and with the object-oriented approach we are able to provide the user the most control without polluting the page API unnecessarily.

andreasntr · 2026-05-07T09:22:01Z

In practice there is the caveat that images currently returns ALL images even if not displayed or not inline (through _get_ids_image) or just inline images based on whether inline_images is null

inline_images is None by default and populated on the first usage. That fact that it has been publicly writable the whole time is a design flaw in my opinion.

My goal would be:

Have images always return all images referenced by the page.

Populate ImageFile.is_inline and ImageFile.is_displayed properly.

Make attributes required for caching purposes (due to having to read the whole content stream) internal.

Introduce a deprecation period for inline_images:

Make it a property and emulate the current behavior with the new approach.

Make the property setter emulate the current behavior which indirectly allowed to control the caching mechanism.

This ensures that there is exactly one public API for retrieving the images and with the object-oriented approach we are able to provide the user the most control without polluting the page API unnecessarily.

Makes sense, I'll work on this next week. Thank you!

… from _displayed_images

andreasntr · 2026-05-16T18:17:27Z

@stefan6419846 Updated as required. There are a couple of test failing locally:

a file in tests/test_reader.py::test_read_form_416 cannot be downloaded (unknown error handling), causing a ExceptionGroup: multiple unraisable exception warnings (3 sub-exceptions)
a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file
tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?
tests/test_images.py::test_separation_1byte_to_rgb_inverted: I tried looking at the mentioned PR but I couldn't understand why this should be raising a ValueError, can you help?

Please note that I also opened a PR for sample-files to add a test file for _displayed_images (actions emit a style error but it doesn't seem related to my addition)

stefan6419846 · 2026-05-18T09:57:47Z

-        lst.extend(list(self.inline_images.keys()))
-        return lst
+
+        # Removes duplicates and preserves order


Why do we need this? How has this been handled before?

~~previously this method returned only inline images.~~ Some tests relying on this method (for example when accessing images) require require inline images at the beginning of the list

EDIT: investigating more, i forgot the reason for the reordering

IMHO relying on this behavior is false and we might need to improve the tests in this regard.

Updated response
Ordering is used to return images in the same order as they appear in the page.
As for deduplication, xobjects can potentially be referenced by do-references multiple times: in this case we only return one reference, as the cache is implemented as a map between image name and image value

Do you have any other question on this?

I guess this part becomes obsolete anyway, as our goal is to not load regular images here any more and for inline images, the ordering should not change with the old and new implementation?

Yes, I'll report back on this once implemented

stefan6419846 · 2026-05-18T10:04:14Z

                )
+        # Process Do-referenced images first
        files = {}
+        xobjs: Optional[DictionaryObject] = None


Why do we need this processing here? Couldn't we just record the names and retrieve this data when accessing the image file only?

this way we would be populating the cache only for a subset of image types, excluding xobjects

If we want to process images lazily, this is what I would do:

set the type of _content_stream_images to Optional[dict[str, Optional[ImageFile]]]

when reading the page:

set all the keys immediately to understand which images are displayed and which are inline

set inline images value immediately as they are not expensive to process

keep any other image value to None

when accessing an image:

if cached (either inline or previously accessed), return the cached value

if value is None (xobjects), process the page content, retrieve the image content and cache it

This sounds like the better approach, although I am unsure if we really should introduce caching for regular image files which could be very large.

I probably would have used two internal attributes (_inline_images and _displayed_name) or relied on functools.cached_property instead of implementing our own caching directly.

How would those be used by the user? I mean currently the user can rely on the displayed and inline flags. As for the cache, i can investigate

My preference would still be to deprecate inline_images completely if we are able to and instead let the user use ImageFile.is_inline and ImageFile.is_displayed - IMHO this is the cleanest API. (One improvement would be to maybe make ImageFile a frozen dataclass to make the values final.)

If you think that your other branch is the way to go, please consider marking this PR as draft and open a new PR for the new changes to allow for a clean review process.

Honesly I'm not a software expert (i'm a computer scientist, i can code but i work in the ML field so much of the swe stuff is obscure to me) so I didn't know about frozen dataclasses etc: your guidance is precious here. Also, this is your repo after all so I will adapt to what you feel more polished for production. My other branch is based on keeping inline_images and removing extra ImageFile attributes so it wouldn't suit your current requirement.

As I said earlier, I'd prefer you picking a side here and I'll follow with the implementation

Sorry for not being specific here. In terms of clean design, having both images and inline_images available on the page sounds unnecessary if we can have a cleaner class-based approach by making is_inline a property of the ImageFile and only providing images as the API endpoint for retrieving page images.

As I said, my preference is the following:

Have the two new boolean attributes/properties on the ImageFile.

Parse the content stream of the page only once.

Store state data in an internal variable.

Only cache inline images.

Deprecate inline_images in favor of ImageFile.is_inline.

I will leave it up to you which of your current branches serves as the best starting point for resolving this.

Please indicate if some aspects remain unclear for you, so we can have a look at it. Otherwise, please request a review from my side as soon as you think that it makes sense to get this merged in this state.

I didn't know about frozen dataclasses etc

No worries. I have just seen the making it frozen will not work anyway due to how replace_image is implemented. Just ignore this part here.

Ok, thanks for the clarification.

Parse the content stream of the page only once.

Only cache inline images.

These two seem conflicting. I understand that inline images can be cached in the current internal property but what about the content stream? Do you mean that it must be parsed once just to get the names of the actually displayed images? We settled on not caching other images since they may be very large, so for those we would still process the content stream every time a user wants to get the image content, am I getting it right or do we want to migrate back to full caching (lazy however)?

Yes, just get the names of actually displayed images from the content stream and cache these values. All images which are not inline have their own stream object and do not need actual parsing of the stream internals, thus we can avoid caching their actual contents.

stefan6419846 · 2026-05-18T10:11:57Z

Updated as required. There are a couple of test failing locally:

a file in tests/test_reader.py::test_read_form_416 cannot be downloaded (unknown error handling), causing a ExceptionGroup: multiple unraisable exception warnings (3 sub-exceptions)

I guess this is just a flaky test.

a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file

tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?

I guess these are failing due to your rewrite, although I have not yet looked further into why. They should keep working without changing the tests itself.

tests/test_images.py::test_separation_1byte_to_rgb_inverted: I tried looking at the mentioned PR but I couldn't understand why this should be raising a ValueError, can you help?

Running this on a system with the main code gives ValueError: not enough image data. The new failure is most likely related to your changes as in the previous cases.

Please note that I also opened a PR for sample-files to add a test file for _displayed_images (actions emit a style error but it doesn't seem related to my addition)

We should update our mypy configuration to exclude the sample files repository.

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

…ces_raises_when_missing

andreasntr · 2026-05-18T18:12:41Z

a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file

tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?

I guess these are failing due to your rewrite, although I have not yet looked further into why. They should keep working without changing the tests itself.

tests/test_images.py::test_separation_1byte_to_rgb_inverted: I tried looking at the mentioned PR but I couldn't understand why this should be raising a ValueError, can you help?

Running this on a system with the main code gives ValueError: not enough image data. The new failure is most likely related to your changes as in the previous cases.

fixed both, the cache wasn't being invalidated after the images are manipulated

andreasntr · 2026-05-18T20:57:40Z

a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file

tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?

I guess these are failing due to your rewrite, although I have not yet looked further into why. They should keep working without changing the tests itself.

fixed by logging a warning instead of raising an exception as done in generic._image_xobject._xobj_to_image (it was /Im0 that had errors, it was not spotted previously because of the lazy loading)

andreasntr added 2 commits April 20, 2026 22:53

add is_displayed_on_page function with caching

57390d1

remove unneeded castings

ff16713

andreasntr changed the title ~~Check whether image is displayed on a given page~~ DEV: Check whether image is displayed on a given page Apr 20, 2026

andreasntr changed the title ~~DEV: Check whether image is displayed on a given page~~ ENH: Check whether image is displayed on a given page Apr 20, 2026

andreasntr added 3 commits April 20, 2026 23:43

comply with linter

cf232f9

comply with linter

4d1ca4e

remove example

672ad47

add minimal test

069d5c5

andreasntr added 2 commits April 20, 2026 23:58

comply with linter

ff80e2e

fix docstring and pdf path in test_is_xobject_image_displayed, add py…

dd2ac2e

…test decorator

andreasntr mentioned this pull request Apr 21, 2026

add sample pdf for test about ImageFile.is_displayed_on_page py-pdf/sample-files#37

Closed

andreasntr added 2 commits April 22, 2026 00:04

switch from page_number to page as is_displayed_on_page input

27fc2bb

temporarily remove is_displayed_on_page caching

5f59487

stefan6419846 requested changes Apr 22, 2026

View reviewed changes

andreasntr added 3 commits April 22, 2026 18:54

Merge branch 'main' into main

a09a6bb

switch display check to image constructor

58c75a6

fix tests to use the new is_displayed property

2966ee2

Merge branch 'main' into main

3bcf9a5

andreasntr added 3 commits May 16, 2026 19:12

Merge branch 'main' into main

69cb462

update sample files

54d6dd2

add _displayed_images test file

6db1389

andreasntr mentioned this pull request May 16, 2026

Add _displayed_images test image py-pdf/sample-files#40

Open

andreasntr added 4 commits May 16, 2026 19:45

make _displayed_images private, deprecate inline_images and derive it…

f0c7a72

… from _displayed_images

update _displayed_images references

18ebf94

update inline_images references

983022f

update some image paths

d6b7ff4

stefan6419846 requested changes May 18, 2026

View reviewed changes

andreasntr and others added 9 commits May 18, 2026 17:54

Merge branch 'main' into main

973f345

Update tests/test_images.py

6f0aa8b

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

rename _displayed_images to _content_stream_images

183e10f

remove wrong docstring

ccf4a9d

add deprecation notice to inline_images setter

42c1f81

remove unneeded cache setter

364ccbf

use regular mock instead of type

683d5d4

remove unneeded cache setter

70963f6

fix key error message in test_get_inline_image_without_xobject_resour…

439fab3

…ces_raises_when_missing

andreasntr added 2 commits May 18, 2026 20:13

invalidate cache after manipulating images

e4ea241

emit warnings for image read errors instead of crashing

38eebdb

remove abbreviations

bb11c8c

Conversation

andreasntr commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Backward compatibility

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andreasntr commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

andreasntr commented Apr 22, 2026

Uh oh!

stefan6419846 commented Apr 22, 2026

Uh oh!

andreasntr commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Apr 27, 2026

Uh oh!

andreasntr commented Apr 27, 2026

Uh oh!

stefan6419846 commented Apr 27, 2026

Uh oh!

andreasntr commented Apr 27, 2026

Uh oh!

stefan6419846 commented Apr 27, 2026

Uh oh!

andreasntr commented Apr 27, 2026

Uh oh!

stefan6419846 commented Apr 30, 2026

Uh oh!

andreasntr commented May 7, 2026

Uh oh!

stefan6419846 commented May 7, 2026

Uh oh!

andreasntr commented May 7, 2026

Uh oh!

stefan6419846 commented May 7, 2026

Uh oh!

andreasntr commented May 7, 2026

Uh oh!

andreasntr commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreasntr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

andreasntr commented Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading

andreasntr commented Apr 20, 2026 •

edited

Loading

andreasntr commented Apr 22, 2026 •

edited

Loading

andreasntr commented May 16, 2026 •

edited

Loading

andreasntr May 18, 2026 •

edited

Loading