How to ingest rest api response in table in lakehouse

Jun 25 (edited) in Technical

Since I have not much experience with API requests I don't know how to proceed further. What I am trying to do is to use PySpark Notebooks for getting first of all the response and then convert it correctly into a dataframe. This needs to work well on scale, because I need to ingest approximately 3 million rows. I can perform simple requests with the request python library, but don't know how I can translate that to creating a solution for big data. Therefore I need to perform paging/looping. The API I am using does only support paging through offset and limit parameters. So I need to loop till all items are retrieved while I still need to ensure that it's not causing overhead and running parallel.

However, the output is nested which doesn't make it easier for me. I have issues with loosing data while converting it into a dataframe due to objects hold other objects or a new array with objects and somewhere in between the schema doesn't properly convert. Below I do have an example of all the levels in my json output.

"company": {

"companyId": "932xxx5stest",

"companyCode": "TEST",

"_links": [

{

"rel": "self",

"href": "https://api.test.com/v1/companies/932xxx5stest"

}

]

}

The Copy Data Activity in a pipeline doesn't really work for me because my api doesn't provide really well the count of total items. So I can't extract that by using a pipeline which means i have to do things manually. That's why I prefer the notebook.

Any ideas/usefull resources or (your) best practices are welcome! Thanks in advance. If you need more information, please ask and i'll provide some more context.

15 comments