ENH: Check whether image is displayed on a given page#3738
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3738 +/- ##
==========================================
- Coverage 97.66% 97.62% -0.05%
==========================================
Files 55 55
Lines 10291 10391 +100
Branches 1890 1920 +30
==========================================
+ Hits 10051 10144 +93
- Misses 135 138 +3
- Partials 105 109 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@stefan6419846 also added a new test, waiting for #37 to be merged to restart CI jobs (tested locally with pytest though) |
stefan6419846
left a comment
There was a problem hiding this comment.
As mentioned in the issue already, I think we should determine this value directly and save it into the ImageFile instead of doing the heavy lifting of possibly parsing all content streams for each image.
This means: When I get the ImageFile through PageObject.images, just accessing ImageFile.is_displayed should expose the desired value.
If I remember correctly, retrieving the inline images already parses the content stream of the page, thus hooking into it and populating the display values here should be possible without too much overhead.
Do you think this is possible?
|
This means shifting this logic to the pdf/page constructor I think, which implies that even user not interested in this feature would be impacted by the possible overhead (?) |
|
I would hope that a proper approach would not introduce way more overhead than we already require for parsing inline images from the content stream. |
|
@stefan6419846 switched to a bool field in the
Failing tests have been fixed as well |
Besides now saving it inside a dedicated attribute, we still parse the content stream for each image, thus it produces overhead. Instead, I would propose to generalize the logic inside |
How does this remove overhead? We still need to parse each image to compute whether it is displayed or not |
|
We do not need to create a |
Do you mean at document level then? I would need a bit of high-level guidance on this if possible |
Lines 735 to 744 in 64a793b Do operators and record the corresponding identifiers (similar to your current approach). These have to be mapped to their respective ImageFile instances afterwards by a suitable approach.
|
|
This is what I came up with, please confirm before I commit to avoid useless commits.
Content stream parsing (
|
|
Additionally, please note that |
Option 1 is what I implemented here, the issue lies in the name because:
So we are in a dead spot |
|
We do not have to rely on the |
In theory we could simply do this:
In practice there is the caveat that
|
My goal would be:
This ensures that there is exactly one public API for retrieving the images and with the object-oriented approach we are able to provide the user the most control without polluting the page API unnecessarily. |
Makes sense, I'll work on this next week. Thank you! |
|
@stefan6419846 Updated as required. There are a couple of test failing locally:
Please note that I also opened a PR for |
| lst.extend(list(self.inline_images.keys())) | ||
| return lst | ||
|
|
||
| # Removes duplicates and preserves order |
There was a problem hiding this comment.
Why do we need this? How has this been handled before?
There was a problem hiding this comment.
previously this method returned only inline images. Some tests relying on this method (for example when accessing images) require require inline images at the beginning of the list
EDIT: investigating more, i forgot the reason for the reordering
There was a problem hiding this comment.
IMHO relying on this behavior is false and we might need to improve the tests in this regard.
There was a problem hiding this comment.
Updated response
Ordering is used to return images in the same order as they appear in the page.
As for deduplication, xobjects can potentially be referenced by do-references multiple times: in this case we only return one reference, as the cache is implemented as a map between image name and image value
There was a problem hiding this comment.
Do you have any other question on this?
There was a problem hiding this comment.
I guess this part becomes obsolete anyway, as our goal is to not load regular images here any more and for inline images, the ordering should not change with the old and new implementation?
There was a problem hiding this comment.
Yes, I'll report back on this once implemented
| ) | ||
| # Process Do-referenced images first | ||
| files = {} | ||
| xobjs: Optional[DictionaryObject] = None |
There was a problem hiding this comment.
Why do we need this processing here? Couldn't we just record the names and retrieve this data when accessing the image file only?
There was a problem hiding this comment.
this way we would be populating the cache only for a subset of image types, excluding xobjects
There was a problem hiding this comment.
If we want to process images lazily, this is what I would do:
- set the type of
_content_stream_imagestoOptional[dict[str, Optional[ImageFile]]] - when reading the page:
- set all the keys immediately to understand which images are displayed and which are inline
- set inline images value immediately as they are not expensive to process
- keep any other image value to
None
- when accessing an image:
- if cached (either inline or previously accessed), return the cached value
- if value is
None(xobjects), process the page content, retrieve the image content and cache it
There was a problem hiding this comment.
This sounds like the better approach, although I am unsure if we really should introduce caching for regular image files which could be very large.
I probably would have used two internal attributes (_inline_images and _displayed_name) or relied on functools.cached_property instead of implementing our own caching directly.
There was a problem hiding this comment.
How would those be used by the user? I mean currently the user can rely on the displayed and inline flags. As for the cache, i can investigate
There was a problem hiding this comment.
My preference would still be to deprecate inline_images completely if we are able to and instead let the user use ImageFile.is_inline and ImageFile.is_displayed - IMHO this is the cleanest API. (One improvement would be to maybe make ImageFile a frozen dataclass to make the values final.)
If you think that your other branch is the way to go, please consider marking this PR as draft and open a new PR for the new changes to allow for a clean review process.
There was a problem hiding this comment.
Honesly I'm not a software expert (i'm a computer scientist, i can code but i work in the ML field so much of the swe stuff is obscure to me) so I didn't know about frozen dataclasses etc: your guidance is precious here. Also, this is your repo after all so I will adapt to what you feel more polished for production. My other branch is based on keeping inline_images and removing extra ImageFile attributes so it wouldn't suit your current requirement.
As I said earlier, I'd prefer you picking a side here and I'll follow with the implementation
There was a problem hiding this comment.
Sorry for not being specific here. In terms of clean design, having both images and inline_images available on the page sounds unnecessary if we can have a cleaner class-based approach by making is_inline a property of the ImageFile and only providing images as the API endpoint for retrieving page images.
As I said, my preference is the following:
- Have the two new boolean attributes/properties on the
ImageFile. - Parse the content stream of the page only once.
- Store state data in an internal variable.
- Only cache inline images.
- Deprecate
inline_imagesin favor ofImageFile.is_inline.
I will leave it up to you which of your current branches serves as the best starting point for resolving this.
Please indicate if some aspects remain unclear for you, so we can have a look at it. Otherwise, please request a review from my side as soon as you think that it makes sense to get this merged in this state.
I didn't know about frozen dataclasses etc
No worries. I have just seen the making it frozen will not work anyway due to how replace_image is implemented. Just ignore this part here.
There was a problem hiding this comment.
Ok, thanks for the clarification.
- Parse the content stream of the page only once.
- Only cache inline images.
These two seem conflicting. I understand that inline images can be cached in the current internal property but what about the content stream? Do you mean that it must be parsed once just to get the names of the actually displayed images? We settled on not caching other images since they may be very large, so for those we would still process the content stream every time a user wants to get the image content, am I getting it right or do we want to migrate back to full caching (lazy however)?
There was a problem hiding this comment.
Yes, just get the names of actually displayed images from the content stream and cache these values. All images which are not inline have their own stream object and do not need actual parsing of the stream internals, thus we can avoid caching their actual contents.
I guess this is just a flaky test.
I guess these are failing due to your rewrite, although I have not yet looked further into why. They should keep working without changing the tests itself.
Running this on a system with the main code gives
We should update our mypy configuration to exclude the sample files repository. |
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
…ces_raises_when_missing
fixed both, the cache wasn't being invalidated after the images are manipulated |
fixed by logging a warning instead of raising an exception as done in |
Addresses #3737
Code brainstormed with Qwen3.5 9B via OpenCode
What was changed
ImageFilenow has anis_displayed_on_page(page)method that:INLINE IMAGEoperators for inline images andDooperators for XObject imagesuses caching with pages + bools lists, so repeated checks are fasterremoved as not giving enough advantageBoth inline images and XObject references are supported:
Backward compatibility
Checks are performed lazily, so that if a user is not interested in the feature, there is no overhead while reading the PDF or interacting with it.