Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The cloudFiles.useManagedFileEvents option with Auto Loader enables efficient file discovery.
How does Auto Loader with file events work?
Auto Loader with file events uses file event notifications functionality provided by cloud vendors. You can configure cloud storage containers to publish notifications upon file events such as new file creation and modification. For example, with Amazon S3 event notifications, a new file arrival can trigger a notification to an Amazon SNS topic (see the Amazon S3 notification content structure for details). You can then subscribe an Amazon SQS queue to the SNS topic for asynchronous processing of the event.

Azure Databricks file events is a service that sets up cloud resources to listen for file events. Alternatively, you can set up the cloud resources yourself and provide your own storage queue.
After you configure the cloud resources, the service processes file event notifications and caches file metadata. Auto Loader uses this cache to discover files when it is run with cloudFiles.useManagedFileEvents set to true.

When a stream runs for the first time with cloudFiles.useManagedFileEvents set to true, Auto Loader performs a full directory listing of the load path to discover all files and get current with the file events cache (secure a valid read position in the cache and store it in the stream's checkpoint). Subsequent runs of Auto Loader discover new files by reading directly from the file events cache using the stored read position and do not require directory listing.
Databricks recommends running your Auto Loader streams at least once every seven days to take advantage of incremental file discovery from the cache. If you don’t run Auto Loader at least this often, the stored read position becomes invalid and Auto Loader must perform a full directory listing to synchronize with the file events cache.
File events mode vs. classic file notification mode
This diagram compares file events mode and classic file notification mode.

In file events mode, a single managed file events service connects to customer cloud storage. It creates one shared SNS topic, SQS queue, and SNS-to-SQS subscription that serves multiple consumers, including Auto Loader and Triggers. In classic file notifications mode, each consumer requires its own event subscription and queue, resulting in multiple separate notification pipelines per bucket.
File events mode has several advantages compared to classic file notification mode. Primarily, it requires only one queue for all Auto Loader streams on a bucket, helping you avoid the per-bucket notifications limit. For more information, see File notification mode with and without file events enabled on external locations.
When does Auto Loader with file events use directory listing?
Auto Loader performs a full directory listing when:
- Starting a new stream.
- Migrating a stream from directory listing or classic file notifications.
- Auto Loader with file events is not run for more than seven days.
- You make updates to the external location that invalidate Auto Loader's read position. Examples include when you turn file events off and on again, when you change the external location's path, or when you provide a different queue for the external location.
Auto Loader always performs a full listing on the first run, even when includeExistingFiles is set to false. This flag enables you to ingest all files that were created after the stream's start time. Auto Loader lists the entire directory to discover all files created after the stream's start time, establishes a read position in the file events cache, and stores it in the checkpoint. Subsequent runs read directly from the file events cache and do not require a directory listing.
The Azure Databricks file events service also performs full directory listings on the external location to verify that it has not missed any files (for example, if the provided queue is misconfigured). The first full directory listing begins as soon as file events are enabled on the external location. Each subsequent listing occurs 24 hours after the last full scan as long as there is at least one Auto Loader stream using file events to ingest data.
Best practices for Auto Loader with file events
Follow these best practices to optimize performance and reliability when using Auto Loader with file events.
Use volumes for optimal file discovery
For enhanced performance, Databricks recommends creating an external volume for each path or subdirectory that Auto Loader loads data from and supplying volume paths (for example, /Volumes/someCatalog/someSchema/someVolume) to Auto Loader instead of cloud paths (for example, s3://bucket/path/to/volume). This optimizes file discovery because Auto Loader can list the volume using an optimized data access pattern.
Consider file arrival triggers for event-driven pipelines
For event-driven data processing, consider using a file arrival trigger instead of a continuous pipeline. File arrival triggers automatically start your pipeline when new files arrive, providing better resource utilization and cost efficiency because your cluster only runs when there are new files to process.
Configure appropriate intervals with continuous triggers
Databricks recommends using file arrival triggers to process files as soon as they arrive. However, if your use case requires lower latency using continuous triggers like Trigger.ProcessingTime, Databricks recommends configuring the trigger intervals to 1 minute or higher. In Lakeflow Spark Declarative Pipelines, set this value using pipelines.trigger.interval. This lowers the polling frequency to check if new files have arrived and allows a higher number of streams to run concurrently from your workspace.
For very low-latency requirements, consider classic file notification mode instead. File events introduces an additional caching hop between cloud storage and Auto Loader, which can add latency compared to reading directly from the cloud queue.
Limitations of Auto Loader with file events
Auto Loader does not support path rewrites. Path rewrites apply when multiple buckets or containers are mounted under DBFS, which is a deprecated usage pattern.
For a general list of file events limitations, see File events limitations.