Seeking Efficient Method to Retrieve Latest Incremental Files in Microsoft Fabric Data Pipeline
I'm working on a Microsoft Fabric Data Pipeline where my use case is to get the latest files from a specific folder to process them incrementally. I have implemented the following logic: 1. Get Metadata Activity (activity name: Get Meta data_PF) – This activity retrieves the list of files inside the parent folder along with their properties: - Child Items - Exists - Item Name - Item Type - Last Modified 2. ForEach Activity – I use a ForEach loop to iterate over the items returned by the Get Meta data_PF activity and apply the "Filter by Last Modified" option. The filtering uses: - Start Time (UTC): @variables('start_date') - End Time (UTC): @variables('end_date') This pipeline runs daily and pulls files incrementally based on the date range. The files are fetched from an FTP server and copied to the Lakehouse for further processing. Currently, I process two files per day (e.g., 2024-10-21_Find.csv and 2024-10-21_Auto.csv), but I’m concerned about scalability. If the folder eventually contains thousands or even 100k files, iterating through all of them to find the required ones could be a performance bottleneck. My Current Idea: Since each file follows a date-based naming convention (e.g., 2024-10-21_Find.csv and 2024-10-21_Auto.csv), my idea is to: 1. Fetch the last successful run date from an audit table (e.g., 2024-10-19). 2. Find the difference between the last run date and the current date (e.g., 2024-10-22 - 2024-10-19 = 3 days). 3. Generate an array containing the expected filenames for these days: ["2024-10-20","2024-10-20","2024-10-21","2024-10-21"] Then transform it into: ["2024-10-20_Find.csv","2024-10-20_Auto.csv","2024-10-21_Find.csv","2024-10-21_Auto.csv"] 4. Use this array in the Copy Activity to fetch only the relevant files without iterating through all files in the folder. Senior’s Suggestion: A senior colleague suggested that instead of using ForEach to iterate over all files, I could: 1. Use Lookup activity to fetch the file list and order it by the "Last Modified" timestamp in descending order.