Tutorial: Retrieve large data sets with more than 2500 results

We all know Big Data is sexy...and Kimono is a great tool to get your hands on that beautiful, sexy big data. But by default, Kimono will only allow you to retrieve 2500 rows at a time. But you want that BIG data! Don't worry, we've got you covered, and this tutorial will walk you through how to do it.

What you'll need:

  • A Kimono API with 2500+ rows of results. 
  • Programming Language of choice
  • Room to store a file on your harddrive (won't be too big)

The basic mechanism at play here is a special parameter we add to our API calls called Kimoffset. Kimoffset tells our API to offset our results by a certain number. So for instance calling kimonolabs.com/api/MYAPIID/....&kimoffset=2500 will (by default) return results 2501-5000. The basic pattern for getting all of our data then becomes:

Call our API starting with Kimoffset=0 (the first 2500 results)
Increment Kimoffset by 2500 and call API again
Add results together
Repeat until there are no more results!

Great. Let's take a look at this in code. We'll be using Python here, but you can use any programming language you please.

 

The Code:

 

import json
import urllib offset = 0
all_items = []
while True:
#loop until we run out of data (when we hit offset that returns no results) results = json.load(urllib.urlopen("https://www.kimonolabs.com/api/3wfsq53s?apikey=sHlv8VajTMCK5ePLUPL1Yg2nbf9iOs9H&kimoffset=%d" % offset))
if results['results']:
    print "incrementing offset"
    print "current length of all_items " , len(all_items)
    offset += 2500
    all_items.extend(results['results']['collection1'])

else:
    #break when we run out of results...if you have a lot of results will     take A LONG TIME to run
    break
print "end loop"  
#write to file named data.txt..
with open('data.txt', 'w') as outfile:
     json.dump(all_items, outfile)

Let's break it down:

import json
import urllib

We start by importing the json and urllib packages (you could use requests instead). Pretty standard--if you're using another language you may need their equivalents. Json is a package to help you do json operations, and urllib allows you to make http requests (for calling the API).

offset = 0
all_items = []

Here we initialize our offset value at 0 and an empty array (called lists in Python) that will store all of our data.


Next, we're going to run a loop that does the heavy lifting for this script. I'm adding comments in the code itself to make it a little clearer what's going on

while True:
#loop until we run out of data (when we hit offset that returns no results) #call the api with the current offset value (we'll increment offset below)
results = json.load(urllib.urlopen("https://www.kimonolabs.com/api/3wfsq53s?apikey=sHlv8VajTMCK5ePLUPL1Yg2nbf9iOs9H&kimoffset=%d" % offset)) #if there are any results (ie we returned nonzero amount of data), add the data to our all_items array and increment the offset by 2500
if results['results']:
    print "incrementing offset"
    print "current length of all_items " , len(all_items)
    offset += 2500
    all_items.extend(results['results']['collection1']) #if there are no more results, we break the loop
else:
    #break when we run out of results...if you have a lot of results will     take A LONG TIME to run
   break


So that's actually pretty straightforward. We're looping through our results, incrementing the offset each time. Once we run out of results, we break the loop and exit.

Finally, let's write the data to a file. This is probably more python specific than the rest of the code, but the gist should remain the same. All we're doing is writing our array called all_items to a file called data.txt (watch out for overwriting!)

#write to file named data.txt..
with open('data.txt', 'w') as outfile:
    json.dump(all_items, outfile)

 

Powered by Zendesk