I'm working on a Microsoft Fabric Data Pipeline where my use case is to get the latest files from a specific folder to process them incrementally. I have implemented the following logic:
1. Get Metadata Activity (activity name: Get Meta data_PF) – This activity retrieves the list of files inside the parent folder along with their properties:
- Child Items
- Exists
- Item Name
- Item Type
- Last Modified
2. ForEach Activity – I use a ForEach loop to iterate over the items returned by the Get Meta data_PF activity and apply the "Filter by Last Modified" option. The filtering uses:
- Start Time (UTC): @variables('start_date')
- End Time (UTC): @variables('end_date')
This pipeline runs daily and pulls files incrementally based on the date range. The files are fetched from an FTP server and copied to the Lakehouse for further processing. Currently, I process two files per day (e.g., 2024-10-21_Find.csv and 2024-10-21_Auto.csv), but I’m concerned about scalability. If the folder eventually contains thousands or even 100k files, iterating through all of them to find the required ones could be a performance bottleneck.
My Current Idea:
Since each file follows a date-based naming convention (e.g., 2024-10-21_Find.csv and 2024-10-21_Auto.csv), my idea is to:
1. Fetch the last successful run date from an audit table (e.g., 2024-10-19).
2. Find the difference between the last run date and the current date (e.g., 2024-10-22 - 2024-10-19 = 3 days).
3. Generate an array containing the expected filenames for these days:
["2024-10-20","2024-10-20","2024-10-21","2024-10-21"]
Then transform it into:
["2024-10-20_Find.csv","2024-10-20_Auto.csv","2024-10-21_Find.csv","2024-10-21_Auto.csv"]
4. Use this array in the Copy Activity to fetch only the relevant files without iterating through all files in the folder.
Senior’s Suggestion:
A senior colleague suggested that instead of using ForEach to iterate over all files, I could:
1. Use Lookup activity to fetch the file list and order it by the "Last Modified" timestamp in descending order.
2. Use this ordered list to identify the latest files and pass them directly to the loop, reducing unnecessary iterations.
My Question:
1. Is my current approach (generating an array of filenames based on the date difference) a good idea? Will it scale well as the number of files grows?
2. Is there a better way to handle this scenario? Specifically, how can I avoid iterating through all files when I only need the latest ones?
3. How can I efficiently use the Lookup activity with an ordered file list to minimize the number of iterations and improve performance? Any examples or best practices would be appreciated.
Looking forward to any recommendations or improvements from the community! Thank you!
SCREENSHOT: