You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Any serializable object inside Request.cb_kwargs when JOBDIR is set, is not passed to request's callable. Instead, its deep copy is passed. This is the result of the serialization and deserialization process. This behavior should be noted in the documentation.
(after the paragraph about debug setting)
Keep in mind, that deserialization of requests creates deep copies of the objects passed to
dictionary cb_kwargs. Meaning, that any mutable object inside cb_kwargs (directly or as a part
of other object) does not make its way to the callback; the callback receives its exact copy. Any
changes made to the copy does not affect the original object.
Additionally, note can be added to cb_kwargs as well:
A dict with arbitrary data that will be passed as keyword arguments to the Request’s callback. Caution
One should not pass mutable objects to this dictionary if setting JOBDIR is set. Common mutable objects are lists, dicts, sets and bytearrays. Check pausing and resuming crawls documentation for more information.
Motivation
Making deep copies of cb_kwargs content when JOBDIR is set, is not obvious at a glance. Enabling JOBDIR is easy, and one can set it without realization that previously working code will stop working.
Additional context
In #6119 I also suggested that when JOBDIR is set, a warning can be emitted if there is a common mutable object (such as lists, dicts, sets, bytearrays and array from Python array module) inside cb_kwargs values. However, now I think this is unnecessary and can lead to redundant warnings. Especially in cases when mutable object is passed, but there is no intent in mutating it inside request's callback; for example, dictionaries does not have immutable analogue and types.MappingProxyType is not pickleable.
The text was updated successfully, but these errors were encountered:
In addition, I would be good to note, that to any custom object which cannot or is not meant to be serializable one should add pickle-specific method that immediately throws an error an pickle attempt.
For example:
class MyClass:
def __init__(self, /):
self._a = set() # mutable object
self._b = (10 ** i for i in range(10)) # unserializable object
def __reduce__(self, /):
raise TypeError(f'instances of {self.__class__} are unpickleable')
W/o __reduce__ pickle will attempt to pickle MyClass instances: it will start with _a and then fail and _b. If _a is increasing over time, then every attempt will take more time. Over time, the slow down will be noticeable, as I described in #6119.
Defining __reduce__ or any other pickle-specific method which is called upon serialization, with error raise forces pickle to fail immediately.
Summary
Any serializable object inside
Request.cb_kwargs
whenJOBDIR
is set, is not passed to request's callable. Instead, its deep copy is passed. This is the result of the serialization and deserialization process. This behavior should be noted in the documentation.Suggestion
As I suggested in #6119, the next note (or its variation) can added to jobs documentation:
Additionally, note can be added to
cb_kwargs
as well:Motivation
Making deep copies of
cb_kwargs
content whenJOBDIR
is set, is not obvious at a glance. EnablingJOBDIR
is easy, and one can set it without realization that previously working code will stop working.Additional context
In #6119 I also suggested that when
JOBDIR
is set, a warning can be emitted if there is a common mutable object (such as lists, dicts, sets, bytearrays and array from Pythonarray
module) insidecb_kwargs
values. However, now I think this is unnecessary and can lead to redundant warnings. Especially in cases when mutable object is passed, but there is no intent in mutating it inside request's callback; for example, dictionaries does not have immutable analogue andtypes.MappingProxyType
is not pickleable.The text was updated successfully, but these errors were encountered: