Recently, I’ve been working with some kind of GIS application. The main purpose is to automate a very cumbersome repetitive task. I have been developing sceneries for a flight simulation platform. Buildings, landmarks, houses and all these stuff should be place in real geographic locations. My python based program should auto generate this scenery based on a given GeoJson Data. This GeoJson data is a json file consist of millions of data for a specific region on earth. So I have to process millions of data.
As you can see, we have analyzed & processed 1.5M of data. That is just 1 single tile on the earth.
Python has its own built-in json package that parses a json file. The problem with this package architecture is it loads the whole json data in its memory at once and then let’s you iterate on it. Imagine, millions of data being loaded to memory? This is a big problem and we get this “MemoryError” during parse.
I found a brilliant 3rd party open source json package, which is ijson
This library uses streaming technique to iterate data on a given key. Instead of loading all data at once in its memory, it instead gives you a generator function which you can iterate in a streaming behavior. So it does not need to load all data at once which will probably throw a Memory Error.
If you have a json data which have a structure like this:
"features": [ millions of data here.. ]
You will iterate it using ijson like this:
with open(self.geojson_path, "r", encoding="utf8") as geo_data_file:
items = ijson.items(geo_data_file, 'features.item')
for item in items:
# more process here..
As you can see, iterating is simple, if you have a key say “features” in a json structure which contains an array of millions of data, you can target it using “feature.item” and it returns a generator which let’s you iterate on the data in a streaming fashion.
DOWNSIDE: because it is a generator and iterates in a streaming fashion, len() function does not work here. But there are always workaround on it. What is important is we are able to parse our large json file.