Since I have not much experience with API requests I don't know how to proceed further. What I am trying to do is to use PySpark Notebooks for getting first of all the response and then convert it correctly into a dataframe. This needs to work well on scale, because I need to ingest approximately 3 million rows. I can perform simple requests with the request python library, but don't know how I can translate that to creating a solution for big data. Therefore I need to perform paging/looping. The API I am using does only support paging through offset and limit parameters. So I need to loop till all items are retrieved while I still need to ensure that it's not causing overhead and running parallel.
However, the output is nested which doesn't make it easier for me. I have issues with loosing data while converting it into a dataframe due to objects hold other objects or a new array with objects and somewhere in between the schema doesn't properly convert. Below I do have an example of all the levels in my json output.
"company": {
"companyId": "932xxx5stest",
"companyCode": "TEST",
"_links": [
{
"rel": "self",
}
]
}
The Copy Data Activity in a pipeline doesn't really work for me because my api doesn't provide really well the count of total items. So I can't extract that by using a pipeline which means i have to do things manually. That's why I prefer the notebook.
Any ideas/usefull resources or (your) best practices are welcome! Thanks in advance. If you need more information, please ask and i'll provide some more context.