At Influential.co, we have a humongous Qdrant vector database with two relevant collections.
- The first collection (
network_accounts
) contains one point for each influencer we have across all platforms (Instagram, Facebook, Twitter, Snapchat, and Tiktok). We have almost 1 million influencers tracked since we have criteria for which influencers to track in our database. - The second collection (
network_posts
) contains one point for each post we have for each influencer we have innetwork_accounts
. We have nearly 1 billion posts for all of our tracked influencers.
Each point in each collection has a unique set of attributes called a payload. We use the payload
Databricks Apps
Databricks Apps is a feature released in GA as of May 13th 2025.
Databricks Apps is now generally available (GA). This feature lets you build and run interactive full-stack applications directly in the Databricks workspace. Apps run on managed infrastructure and integrate with Delta Lake, notebooks, ML models, and Unity Catalog. - May 2025 Release Notes
Can build apps that run directly in Databricks environment or using external tools/IDEs like PyCharm and VSCode.
It supports common industry-standard frameworks like Plotly Dash, Gradio, and Streamlit. There are also prebuilt Python templates to use.
A Production Ready Experience
Databricks Apps don’t require additional custom infrastructure layers to be built and maintained. By default they run on automatically-provisioned serverless compute resources, allowing for seamless deployment.
Additionally, they can be developed from within Databricks workspace or your favorite IDE.
Built-in Governance
Granular access controls out of the box. Automatically managed service principals for secure application-to-application communication. Automatic user authentication using OIDC.OAuth 2.0 and SSO.
Integration with Unity Catalog speaks for itself. Inherits networking protections of the workspace.
Use Cases
Can have interactive dashboards, data exploration tools, customized reporting interfaces, and much more.
“Common Use Cases:
- Interactive data visualizations and embedded Business Intelligence (BI) dashboards
- Retrieval-Augmented Generation (RAG) chat apps powered by Genie
- Custom configuration interfaces for Lakeflow
- Data entry forms backed by Databricks SQL
- Business process automation combining multiple Databricks services
- Custom ops tools for alert triage and response”
Limitations
- A Databricks workspace can host up to 50 apps.
- App files can’t exceed 10 MB. If any file in the app directory exceeds this limit, deployment fails with an error.
- Databricks Apps isn’t compliant with HIPAA, PCI, or FedRAMP standards.
- Databricks deletes app logs when the compute resource running the app is terminated. See View logs for your Databricks app.
- If you grant consent to an app through user authorization, you can’t revoke that consent later.
Databricks Apps System Environment
- Operating System: Ubuntu 22.04 LTS
- Python environment: Python 3.11.0 in a dedicated virtual environment. All
dependencies isolated in that environment including ones from
requirements.txt
and pre-installed libraries. - System resources: 2 virtual CPUs and 6 GB of memory. If those are exceeded, Databricks might restart it.
Resources and Experience Building a Databricks App for Text-to-Qdrant
Develop Databricks Apps
The Databricks Apps environment automatically sets several environment variables, such as the URL of the Databricks workspace running the app and values required for authentication. Many apps also need custom configuration, such as a specific command to run the app or parameters for accessing a SQL warehouse. Use the app.yaml file to define these settings.
Workflow:
- Build and test your app in your preferred IDE
- Run the app locally at the command line and preview it in your browser
- When complete and tested, move the code and required files to Databricks workspace
NOTE
I’ve discovered a few quirks with Developing databricks apps:
- The sync utility is not reliable with major refactors
- You have to manually add requirements.txt
- They seem to use python 3.11.0 instead of 3.11.11 for some reason
Configure Databricks app execution with app.yaml
By default, Databricks runs the app using the command app.py
. If the
application needs a different command-line command or entrypoint, it needs to be
defined in an app.yaml
file, which must be located in the root of the
repository.
There are some supported settings that can be configured via the app.yaml. Here are some examples for apps built with different frameworks.
Streamlit:
command: ["streamlit", "run", "app.py"]
env:
- name: "DATABRICKS_WAREHOUSE_ID"
value: "quoz2bvjy8bl7skl"
- name: "STREAMLIT_GATHER_USAGE_STATS"
value: "false"
Flask:
command:
- gunicorn
- app:app
- -w
- 4
env:
- name: "VOLUME_URI"
value: "/Volumes/catalog-name/schema-name/dir-name"
Here are examples of defining environment variables in a Databricks app. There are default ones (for example for Stramlit), and you can define custom ones.
Manage dependencies for a Databricks app
Dependencies are handled via a requirements.txt
file defined in the root of
the repository.
NOTE
If any listed packages match pre-installed ones, the versions in your file override the defaults.
Here’s an example:
# Override default version of dash
dash==2.10.0
# Add additional libraries not pre-installed
requests==2.31.0
numpy==1.24.3
# Specify a compatible version range
scikit-learn>=1.2.0,<1.3.0
Here is a list of pre-installed Python libraries
WARNING
Keep the following in mind when you define dependencies:
- Overriding pre-installed packages may cause compatibility issues if your specified version differs significantly from the pre-installed one.
- Always test your app to ensure that package version changes don’t introduce errors.
- Pinning explicit versions in requirements.txt helps maintain consistent app behavior across deployments (best practice)
Add resources to a Databricks app
As part of the Databricks consistent developer experience, Databricks Apps can integrate with various other platform features such as Databricks SQL for querying data, Jobs, Mosaic Model Serving, Databricks secrets, etc. These are referred to as resources .
“To keep apps portable and secure, avoid hardcoding resource IDs. For example, instead of embedding a fixed SQL warehouse ID in your app.yaml file, configure the SQL warehouse as a resource through the Databricks Apps UI or in
databricks.yaml
.”
The Databricks UI for configuring resources is pretty straightforward. Here’s an
example of databricks.yaml
:
resources:
sql_warehouses:
sql_warehouse: # resource key
name: my-warehouse
secrets:
secret: # resource key
scope: my-scope
key: my-key
These resources can be used in the app configuration (app.yaml
) via the
valueFrom
field.
Example app.yaml
snippet:
env:
- name: WAREHOUSE_ID
valueFrom: sql_warehouse
- name: SECRET_KEY
valueFrom: secret
Use as usual from the app code as environment variables:
import os
warehouse_id = os.getenv["WAREHOUSE_ID"]
secret_value = os.getenv["SECRET_KEY"]
Pricing
Refer to this calculator: Compute for Apps Pricing Calculator
At Influential, we’re on the Enterprise plan in US West California, on AWS.
There is a price of 50 cents per “App Capacity Hour”.