Loading Pydantic models from JSON without running out of memory Hackernews Viewer

Loading Pydantic models from JSON without running out of memory

134 points by itamarst 22 May 2025 | 45 comments

Comments

scolvin 23 May 2025

Pydantic author here. We have plans for an improvement to pydantic where JSON is parsed iteratively, which will make way for reading a file as we parse it. Details in https://github.com/pydantic/pydantic/issues/10032.

Our JSON parser, jiter (https://github.com/pydantic/jiter) already supports iterative parsing, so it's "just" a matter of solving the lifetimes in pydantic-core to validate as we parse.

This should make pydantic around 3x faster at parsing JSON and significantly reduce the memory overhead.

fidotron 22 May 2025

Having only recently encountered this, does anyone have any insight as to why it takes 2GB to handle a 100MB file?

This looks highly reminiscent (though not exactly the same, pedants) of why people used to get excited about using SAX instead of DOM for xml parsing.

jmugan 22 May 2025

My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?

deepsquirrelnet 23 May 2025

Alternatively, if you had to go with json, you could consider using jsonl. I think I’d start by evaluating whether this is a good application for json. I tend to only want to use it for small files. Binary formats are usually much better in this scenario.

dgan 22 May 2025

i gave up on python dataclasses & json. Using protobufs object within the application itself. I also have a "...Mixin" class for almost every wire model, with extra methods

Automatic, statically typed deserialization is worth the trouble in my opinion

fjasdfas 22 May 2025

So are there downsides to just always setting slots=True on all of my python data types?

thisguy47 22 May 2025

I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.

zxilly 22 May 2025

Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.

kayson 23 May 2025

How does the speed of the dataclass version compare?

m_ke 22 May 2025

Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/