Speedrun: Data Science Fundamentals
This document introduces a "Speedrun" guide to Data Science Fundamentals, likely a course or resource for quickly learning core concepts in Palantir's data science ecosystem. It references a link to the Palantir Learn platform for more details.
Speedrun: Data Science Fundamentals
https://learn.palantir.com/speedrun-data-science-fundamentals
Executive Overview
This executive brief summarizes an end-to-end enterprise data science workflow implemented within a unified data platform environment. The solution demonstrates how structured and unstructured data can be integrated, enriched with AI, analyzed, visualized, and deployed as a governed, shareable application.
Business Value Delivered
• Unified integration of structured clinical datasets and unstructured PDF feedback.
• AI-powered semantic enrichment using LLM-based sentiment analysis and entity extraction.
• Rapid transformation and feature engineering via no-code pipeline authoring.
• Interactive analytics dashboard for stakeholders. • Full traceability and governance through Data Lineage visualization.
• Version-controlled and reproducible data science workflows.
Architecture Overview
The architecture follows a layered enterprise design pattern:
• Data Ingestion Layer – Structured datasets and unstructured PDF documents.
• Transformation Layer – Batch pipelines for cleaning, joining, and AI enrichment.
• Analytics Layer – Jupyter-based exploratory and statistical analysis.
• Application Layer – Streamlit-based interactive dashboard.
• Governance Layer – End-to-end lineage, version control, and deployment management
Analyzing clinical study data to uncover insights into patient demographics and adverse events linked to drug consumption.
Performing data analysis
Creating a report that your colleagues can use for future data science initiatives
Challenge of working with data from various sources, including structured tables and unstructured PDFs, necessitating data preprocessing and reconciliation before analysis.
Using Pipeline Builder, you'll enhance your data preprocessing efficiency with optimized operations, including PDF text extraction using AIP.
For data analysis, you can choose to use RStudio to conduct visualizations and analyses, generating a comprehensive report. This part is optional, depending on your access to an RStudio Workbench license. Alternatively, you can utilize Jupyter to perform analyses and publish an interactive dashboard with Streamlit.
Skills using Foundry and AIP toolkit for Data Science.
access to the following platform components:
- Pipeline Builder
- Data Lineage
- AIP features
- Jupyter Workspace
- [optional] RStudio Workspace
Permissions Required | Workaround if permissions are not available |
Create Learning Project | Platform Admin team will need to create Learning Project, in which you may work in your 'Working Folder'. |
Upload Files | Platform Admin will need to Upload Files and provide instructions to copy them to the Learning Course Folder. |
Edit Learning Project | None. Permission needs to be granted. |
Create Foundry Resources (Pipeline Builder, Data Lineage, Jupyter Workspaces) | None. Permission needs to be granted. |
Create Foundry Resources (RStudio Workspaces) | None. Permission needs to be granted, including the RStudio license. However, given that RStudio Workbanch license is not accessible for all environments, and may depend on your organziation setup, the RStudio part of this course is optional and you may skip it. |
Use AIP Developer Capabilities (e.g. LLM block in Pipeline Builder) | None. Permission needs to be granted. |
The Use Case: work on analyzing clinical study data to gain insights into patient demographics and adverse events that might have happened as a result of the drug consumption during the clinical study.
You colleagues are mainly interested in what type of adverse events occurred at this particular clinical trial and how this varied by age of the patients. They are also curious to hear whether patients had any insights themselves, or feedback about the whole experience participating in the clinical trial.
Data Processing
- Data Transformation in Pipeline Builder - Supercharge your data pre-processing speed by combining the hundreds of optimized, common data operations, including PDF text extraction using AIP.
Data Analysis
- Optional Data Analysis Using RStudio
- Perform data visualization and analysis within RStudio Workspace.
- Generate and publish a report based on the analysis conducted in RStudio.
- Data Analysis Using Jupyter
- Perform data visualization and analysis in a Jupyter environment.
- Publishes an interactive dashboard using Streamlit Application.
Setting Up Project & Folder
All of the resources you create in Foundry need to live inside a Project. For production use cases, these resources likely need to be shared across your organization and it is recommended you create a Project for each stage of the workflow, as noted in our online documentation.
Create a Foundry Learning Project
Step 1: Create a new project
- Click on New project in the top right
Step 2: Set your project's details
Create a Course-Specific Training Folder
Note: not able to access and create folder, therefore, created access request (instruction outdated in the training course)
Uploading the Data
In this section you will set up a Project, where you'll develop your workstream. You will also populate it with your tabular datasets, as well as unstructured input data, stored in a Media Set.
Upload Tabular Data
Step 1: Start file upload
- Click on New in your course folder, inside your personal Learning Project
- Click on Upload files...
Step 2: Select Files
- Select the files downloaded earlier (two .csv files)
- Make sure that Upload as a structured dataset (recommended) is selected
- Click on Upload
Upload PDFs
Step 1: Start file upload
- Click on New in your course folder, inside your personal Learning Project
- Click on Upload files...
- Select the files downloaded earlier, 5 PDFs
- Make sure that Upload to a new media set is selected
- Media sets allow us to interact with the file in media format and, in this use-case, is preferable to rows in a dataset or raw data
Step 3: Finish upload
- Select Transactionless as the write mode
- Click Upload
Step 4: Rename your Media Set
- Right-click on your Media Set
- Rename it to "Patient Feedback"
[Optional] Deploy Datasets via Marketplace
Only complete these steps if you are unable to download datasets from this site or are unable to upload datasets to your Foundry instance due to lack of permissions.
Understand the Data
Step 1: Understand the Tabular Dataset DM
- Open the DM_Demographics dataset
- One row in this dataset represents one patient participating in a clinical study, including their demographic characteristics
- Click on the arrow next to the ARM column
Step 2: Understand the Tabular Dataset AE
- Open the AE_Adverse Events dataset
- One row in this dataset represents one adverse events that happened to patients, including information about the type of the adverse events
3. Click on the arrow next to the AEDECOD column, which represents the ‘decoded value of the adverse event’
Step 3: Understand the Mediaset with PDFs
1. Open the Patient Feedback mediaset
Summary
You now have a Project set up to develop your workflow. You've also successfully uploaded your initial data asset to Foundry. In the following sections, you will process this data and make it available to both human and AIP agents to use.
Although we added data manually this time, Foundry includes a fully fledged Data Connection toolkit to manage recurring batch and streaming ingestions. It comes with 200+ bespoke connectors for the most common systems and a flexible plugin architecture to cover the rest. Both tabular and unstructured data assets are natively handled by the platform. These can be operated on using the broad Foundry ecosystem, as you'll see later in this course.
To learn more about data ingestion, you might follow this course up with the Data Connection Deep Dive, or by taking a look at the documentation.
Foundry has data security measures baked into every aspect of the platform. Projects, like the one you created earlier, are the atomic units of "discretionary" access controls. People with the Owner role on the Project can share the contents with other users, or groups of users, by setting them as Editors or Viewers.
Data Processing in Pipeline Builder
Introduction
clean and prepare the data by formatting it into a version you can use downstream for visualizations. This will include cleaning the tabular structured data, but also processing unstructured data in the PDFs using AIP
Create a New Pipeline
Step 1: Start Pipeline Creation
Step 2: Finish Pipeline Creation
- Rename the pipeline to “Data Processing”
- Select the pipeline type as a Batch pipeline
- Click on Create pipeline
Work on a New Branch
Step 1: Create a New Branch
- Click on the Main button
- Click on Create new branch
Step 2: Name your Branch
- If prompted, select Pipeline Builder branch
- Name your branch data-processing
- Click Create
Going forward you can see that you are working on your branch
Takeaways
Pipeline Builder offers out-of-the box version control backed by git, which allows you to seamlessly collaborate with your colleagues on the same pipeline, proposing and merging changes.
Add Data
Step 1: Open the "Add data" Window
- Click on Add Foundry data
- You can find this Add Data dialog in the top left of the Pipeline Builder screen if you want to add more data later. It might be particularly useful for setting up mock datasets (or configuration datasets) using the Manually Enter Data option.
Step 2: Select and Add Data
- Add the Patient Feedback media set to your selection using the + button, as well as our two tabular datasets
- Click Add data in the bottom right to confirm selection
Process PDFs
Step 1: Start New Transformation
- Back on the main Pipeline Builder graph, select the new Patient Feedback transform node
- Click the Transform button
- This will create a transform Node, where transformation steps can be chained and a resulting dataset can be visualized
Step 2: Rename Transform
- Rename the new transform node to "Process PDFs" in the top left corner
Step 2: Rename Transform
Rename the new transform node to "Process PDFs" in the top left corner
You can see a preview of the output of your current transformation at the bottom of the page. It gets updated each time you add a new transform to your node. You can chain transforms by selecting them from the search bar at the bottom of the current chain of operations (empty at the moment). Please keep adding transform blocks, one below the other, until instructed otherwise.
Step 3: Add First Transform Block
Search for and add a Extract text from PDF transform block
- Select the mediaReference column in the Media reference dropdown
- Set the output column name to "pdf_text_extraction"
- Ensure that Skip recomputing rows is toggled on
- Click Apply
The "Extract text from PDF" transform block extracts text elements from the PDF documents in the Media Set you uploaded earlier. In some cases, the text is stored as images (for example if the document was photo-scanned). If you are in this situation, you should change the "Extract Method" to OCR aka Optical Character Recognition.
Step 4: Add Second Transform Block
- Search for and add a Join array transform block
- For Separator configure one space, by simply pressing spacebar-key in the textbox, followed by pressing enter-key.
- Once you do so, you should see 1 space appearing as the value, as presented on the picture below.
- Once you do so, you should see 1 space appearing as the value, as presented on the picture below.
- For Array to join configure column pdf_text_extraction
- Rename the output to also be pdf_text_extraction
- Click Apply
Step 5: Finish Transformation
1. Click Close in the upper right hand corner.
Takeaways
The text you extracted from the PDF files is simpler to interact with than the raw PDF files you've uploaded. This makes it possible to easily run data transformations and AIP interactions on it.
If you take a look at the Selection preview tab at the bottom of your screen, you should see something like this:
Use LLM
Step 1: Start an LLM Call
- Select the Process PDFs transform node
- Click the Use LLM button
Note: If you do not see the 'Use LLM' option in your Foundry instance, you may not have access to AIP features. Contact your organization's Foundry administrator or your Palantir representative for assistance.
Step 2: Choose Prompt Type
1. Click on Empty prompt
Step 3: Configure LLM Call
- Keep the default name "Use LLM"
- Add the following prompt to the instructions field:
In the context of patient feedback on their participation in a clinical trial, your job is to rate the following user message on a scale of 1 to 5, 1 being the most negative and 5 being the most positive.
In addition, you also need to extract the patient ID which is mentioned in the beginning of the document.
Finally, provide one sentence summary of what the user said.
Provide no further explanation, only these three things (patient_id, sentiment_score, summary).
- Type forward-slash "/" in the input data field and pick the pdf_text_extraction column
- Set the output type to be a struct with three strings: “patient_id”, "sentiment_score", "summary"
- Change the type to be Integer for “patient_id”, "sentiment_score"
- Rename the output to llm_sentiment_analysis
If the output for llm_sentient_analysis appears as 'null' for you, try changing the model that you are using. GPT-5 nano has been known to have issues here, however Gemini works as expected
Step 4: Finish Configuring LLM Call
- Click Apply
- Click Close
Takeaways
Pipeline Builder takes your prompt template, resolves the placeholders (adds the actual text from the pdf_text_extraction column in this case), and sends it to AIP for each row of input. Under the hood, it takes care of optimizing performance by issuing parallel calls, while respecting the various models' rate limits.
Pipeline Builder will choose a model by default to produce a response. In this example it was Gemini 2.0 Flash. If you wish, you can choose alternative models such as Grok or Mixtral 7B, by clicking on the Show configurations button in the Model section of the UI while setting up the prompt.
If you take a look at the Selection preview tab at the bottom of your screen, you should see something like this:
Note, first time, GPT 5.0 nano was not working, switched to gemini 2.5 worked, then switch back to gpt, which worked 2nd try.
Organize LLM Response
Step 1: Start New Transformation
- Select the Use LLM transform node
- Click the Transform button
Rename the transform node to "Organize LLM result" in the top left
Step 2: Add First Transform Block
- Search for and add a Extract many struct fields transform block
- Set the llm_sentiment_analysis column as the struct to process
- Set patient_id, sentiment_score, summary as the properties to extract into new columns with corresponding names
- Click Apply
Step 3: Add Second Transform Block
- Search for and add the Select columns transform block
- Select patient_id, sentiment_score, summary as the columns to keep
- Click Apply in the upper right hand corner, then Close
Takeaways
By the end of this step, you'll have a dataset where each row corresponds to a PDF, and each column provides specific information about it, such as sentiment analysis, a summary of the feedback, and the ID of the patient who provided the feedback.
We utilized Pipeline Builder as a user-friendly, no-code tool to efficiently extract this information. In the subsequent steps, you'll process additional tabular datasets and ultimately integrate this information with the data from this step.
Join Datasets
Step 1: Start the First Join
- Select the DM_Demographics node
- Click the Join button
- This will prompt you with a dialogue to initiate a join between two datasets. Click on AE_Adverse_events
- Click Start
Step 2: Configure the First Join
- Rename the join node to "Join DM and AE" in the top left
Select the following option: ‘Left join’ as the Join Type, and ‘USUBJID is equal to USUBJID’ as Match Condition. Leave everything else as already preselected.
Click Apply in the upper right hand corner, then Close
Step 3: Start the Second Join
- Select the “Join DM and AE” node
- Click the Join button
- This will prompt you with a dialogue to initiate a join between two datasets. Click on “Organize LLM result”
- Click Start
Step 4: Configure the Second Join
- Rename the join node to "Join with PDF extraction" in the top left
- Select the following option: ‘Left join’ as the Join Type, and ‘SUBJID is equal to patient_id’ as Match Condition. Leave everything else as already preselected.
- Click Apply in the upper right hand corner, then Close
Takeaways
Pipeline Builder enables seamless table merging with a no-code approach. In this exercise, we used a Left Join, but you can also choose from various join options, including KNN join and geometry-based joins.
Visualize Adverse Event Occurrence
Step 1: Start New Transformation
- Select the “Join with PDF extraction” transform node
- Click the Transform button
- Rename the transform node to "Adverse Event Occurrence" in the top left
Step 2: Add First Transform Block
- Search for and add a Case transform block
- In ‘When’, set the AEDECOD column as the column to process
- Specify that the output value should be an integer of 0 if this column is null, and 1 otherwise
- Rename the output column to adverse_event_happened
- Click Apply
Step 3: Add First Visualization Block
- Search for and add the Chart visualization block
Step 4: Configure the Visualization Block
- Make sure the Configure tab is selected
- Set the AGE column for X-Axis
- Select Sum as the aggregation for the Y-Axis
- In the aggregation field, select column adverse_event_happened
- Click on Segment by, and select ARM column
- Click Apply
Step 5: Format the Visualization Block
- Make sure the Format tab is selected
- Set the Adverse Event Occurrence vs. Age for the Title
- Set Age for the X-Axis label
- Set Adverse Event Occurrence for the Y-Axis label
- Click Apply
Step 6: Preview the Chart Results
- Click on the Legend
- Preview the results of your chart.
- Hint: You may move right on the x-axis, by clicking the right arrow next to Show results
Note: You may notice that the graphic looks slightly different from the image above. This is because the input data may be sampled differently in Pipeline Builder when previewing, resulting in variations from the example shown.
Takeaways
Pipeline Builder offers a quick and simple way of validating your data using histogram visualizations, in a no-code fashion.
Foundry also provides a no-code point-and-click user interface to perform data analysis and visualizations — Contour. Contour allows you perform data analysis on tables at scale. These analyses can be used to visualize data using various charts and create interactive dashboards.
To learn more about Contour, take a look at the documentation, or another course which offers deep-dive on this tool.
Add Output of your Data Processing
Step 1: Create New Output Dataset
- Select the Adverse Event Occurrence transform node
- Click the Add output button
Click the New dataset button
Step 2: Rename Output Dataset
Merge your Branch
Step 1: Create a New Merge Proposal
- Click Save to save all your changes (unless you saved it already)
- Click Propose button
- Give name to your proposal and description
- Click Create proposal
Step 2: Review Changes
- Select Changes tab in the newly created proposal
- Review the changes compared to the current state of the Main branch
Step 3: Merge the Proposal
- Select Overview tab in the newly created proposal
- Click Merge proposal
Step 4: Observe the Changes
- Select the Graph tab
- Observe that the Main branch now includes merged changes from the branch that you were working on.
Deploy Pipeline
- Click Save in the top right corner, if possible
- Click Deploy in the top right corner
In case you get a warning about resources
- Click Import resources in the warning pop-up
- Click Add references in the import pop-up
- You may need to click Deploy pipeline once more
Summary
You have now processed both tabular and unstructured data from PDFs, producing a dataset that combines all provided information, ready for data analysis and insights generation.
Pipeline Builder is Foundry's no-code solution for authoring data transformations. It comes with hundreds of optimized data operations out of the box, as well as support for core software engineering concepts such as version control, scheduling, and incremental execution. It is well suited for the majority of data-pipeline building tasks, and provides a boilerplate free entry point when rapid development is priority over complex technical configuration.
To learn more about Pipeline Builder, you may take the Pipeline Builder Deep Dive course, or read more about it in the documentation.
Foundry also provides a high-code transform authoring environment called Code Repositories. It allows you to leverage the fine controls of Python and Java to create performant transforms, and rely on popular libraries for specialized functionality.
To learn more about Transform authoring in Code Repositories, you make take the Code Repository Deep Dive course, or read more about it in the documentation.
Insight Generation → JupyterLab Workspace
In this section you will learn how to utilize JupyterLab Workspaces for data analysis, generate insights based on your data, version reports, and share them with your colleagues.
Create a New Jupyter Workspace
Step 1: Start Jupyter Workspace Creation
- Navigate to your course folder, inside your training project
- Right-click inside the folder and pick New
- Alternatively, you can click New in the top right corner
- Select Jupyter workspace from the drop-down
Step 2: Finish Jupyter Workspace Creation
1. Name your Workspace Jupyter - Data Analysis
2. Click Continue
Takeaways
Foundry offers out-of-the-box integration with JupyterLab, where data scientists can work and collaborate in a secure and reproducible environment writing notebooks and python scripts.
Load the Data to your Workspace
Step 1: Create New Jupyter Notebook
Under Notebook, click on Python [user default]
- Right-click on top of the newly created notebook
- Click Rename Notebook...
Observe the newly created notebook in your Jupyter Workspace
Step 2: Start Loading the Dataset
- Click on the Data tab on the left-hand side
- Click on Add data
- Select Read data
- Find the output dataset adverse_events_processed
- Click on the dataset
- Click Select
Step 3: Finish Loading the Dataset
- Click Add dataset
The dataset is now only referenced inside the Jupyter Workspace. To load the full content of the dataset you need to run code that loads it.
Click on Copy to clipboard
- Go to your analysis.ipynb notebook, and paste the content in the cell, by clicking CTRL-V
- Click the "Run" button
Step 4: Confirm the Dataset is Loaded
- Reference the variable that contains your data, by writing adverse_events_processed.head(), in a new cell
Takeaways
JupyterLab runs in a container within Foundry. To load the data into this container, and thus use it in Jupyter Workspace, you needed to first reference it and then load it using the code snippet provided to you in the Data tab.
Install Libraries
Step 1: Confirm you are on the Managed Conda Environment
- Click on the Settings tab at the top of the screen
- Click on Advanced Features
- If the managed Conda environment is Enabled, you are good to you!
- If it is not enabled, click on Enable managed Conda environments. This will prompt your Workspace to restart, and after a few seconds you can continue working on it.
Step 2: Install matplotlib Library
- Go back to the Code page and click on the Libraries tab
- Within the Libraries tab, select the Conda tab
Takeaways
For this exercise you installed one Python library: matplotlib, which you will need for the remainder of this exercise.
Foundry offers built-in support for reproducible data science. To ensure reproducibility, Foundry provides a managed Conda environment solution. This solution guarantees that environments, including all installed packages, are consistently reproducible. For further details, please refer to this documentation page.
Write Initial Python Code for Data Analysis
Step 1: Import Libraries
- Navigate back to your analysis.ipynb notebook
- Import pandas and matplotlib
- You can use the code below.
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Create First Histogram
- Plot the first histogram, showing age distribution of patients, based on whether they took the Miracle drug or were in the Placebo group.
Step 3: Create Second Histogram
- Plot the second histogram, showing age distribution of patients, based on the category of adverse events that happened to them.
Step 4: Save your Notebook
- Click the Save icon
Takeaways
The JupyterLab integration with Foundry allows you to write and run Python notebooks in the same fashion you would interact with Jupyter outside of the platform. On top of Jupyter, Foundry offers a safe collaboration space to access and analyze data on the platform and guarantees reproducibility.
Sync Changes to the Backing Repository
Step 1: Navigate to the Backing Repo
- Click on Code Repositories butto
- Click Open without syncing
- This will open a new tab in your browser where you can see the backing repository. Currently your analysis.ipynb notebook is not there.
Step 3: Observe Synced Changes
- Navigate back to the tab with the Code Repository open
- You should now see the blue banner on top of the page. If you don’t, click refresh button on your browser.
- Click Update to most recent version
- Observe your analysis.ipynb notebook is now present in the repository
Takeaways
All Code Workspaces are backed by a Code Repository. This enables Code Workspaces to have industry-standard version control features like branching, merging, and commit history, allowing other users to view the code and safely operate in the same Workspace.
By syncing changes you made sure that the code you wrote in your Workspace are saved to the backing repository. For further details please refer to this documentation page.
Create a Streamlit App
Step 1: Prepare the Environment for the Streamlit App (Note: need go back to the 1st tab, code workspace from last step that was leading creating a new page for code backing repository)
- Click on Application tab on the left-hand side of the Workspace
- Select Streamlit from the dropdown menu
- You will be prompted with a notification that creating a streaming app requires additional libraries installed
- Click on run command
- Click on run command
- This will start the installation of the required libraries, and might take a few seconds. Wait until you observe in the Jupyter Console that the library is installed, and the banner for additional libraries required is removed.
Step 2: Create a New Streamlit Application
- Click Publish Application
- Name the application StreamlitApp
- Click Publish and Sync
After a few seconds, your application will be created.
- streamlitapp.py file is created.
- You can observe your initial app on the right side of your Workspace.
Step 3: Import Libraries and Change the Title
- Import pandas and matplotlib libraries.
- You can use the code below.
import pandas as pd
import matplotlib.pyplot as plt
- Provide name StreamlitApp: “Adverse Event Age Distribution Explorer”
- You can use the code below
st.title("Adverse Event Age Distribution Explorer")
Step 4: Load the Data
- Navigate to the Data tab
- Click Expand on the dataset reference
Paste the code to the streamlitapp.py file, by clicking CTRL-V
Step 5: Create an Interactive Histogram Plotting
- Create a dropdown menu for selecting an adverse event
Plot the age distribution based on the select value in the dropdown menu
Provide feedback summary of patients, for the given selection of the adverse event.
Click File >> Save All
Step 6: Observe your App & Sync Changes
- Click Refresh
- Observe and test your Streamlit app
Takeaways
Through Jupyter workspace integration you are able to create interactive dashboards.
Apart from Streamlit — which you’ve seen in this course — you can also create Tensorboard and Dash applications in a similar fashion. For more information you can refer to these documentation pages, about Tensorboard integration, and about Dash integration.
Note: ModuleNotFoundError: No module named 'altair.vegalite.v4' error in Streamlit apps running on Palantir Foundry, how it was diagnosed and fixed, and key differences from debugging in local or standard environments. See note Streamlit_Altair_Fix_JobAid for step by step solution
Inspect and Share Published Streamlit Application
Step 1: Open the Published Streamlit Application
- Navigate to the Applications tab
- You will see that your Streamlit application is automatically published.
- Click on the link to the application to open it in new tab
View the application, and interact with it in the new tab.
- The application will likely need a few seconds to load.
Step 2: Navigate Branches and Versions of the Report
- Click on the Latest dropdown.
- Select the previous version to view the previous version of the report
- Click on master dropdown.
- Currently you will not see any other branch in the dropdown, but if you did work on a different branch, you would be able to effortlessly navigate between these versions.
Step 3: Share the Application with your Colleagues
- Copy the URL to your Streamlit application
- [Optional] You can share this URL to the colleagues, and they will be able to see the application and interact with it, provided that they have access to the Foundry environment and Project where you are working.
Takeaways
You have created and published a Streamlit application that can be easily shared with anyone in your organization who has access to the Foundry environment by simply sharing the application’s URL.
Additionally, you can effortlessly navigate between different versions of the application. For example, we were able to observe the report that initially “Hello world” version of the application, as well as an updated full version of it.
Code Workspaces also provides version control, so you can work and develop your application on a different branch, without affecting the one on the main branch. For further details, please refer to this documentation page.
Summary
You have now produced your first data insights using JupyterLab integration to Foundry. You have then created an interactive Streamlit application for a crisp overview of these insights. Finally, you observed this application being published, which now exists on Foundry, and is therefore accessible by your colleagues who have access to the environment, project and data you are working with.
Code Workspaces brings the JupyterLab, RStudio Workbench, and VS Code third-party IDEs to Palantir Foundry, enabling users to boost their productivity and accelerate their data science and statistics workflows by using their preferred tools.
To learn more about Code Workspaces, including the JupyterLab Workspaces, take a look at the documentation.
Connecting it All Together with Data Lineage
Now that you're finished with both data processing and analysis, you will learn how to visualize all your work using Data Lineage. Data Lineage allows you to observe your data pipeline and how data flows through the platform.
Create Data Lineage
Step 1: Navigate Back to your Working Folder
- Navigate to your working folder
- By now, you should have a few resources created (e.g. Pipeline Builder, Streamlit application, etc.)
- Open the Streamlit app, the last resource you created, by clicking on it
- Once the Streamlit app is open: click on Actions >> Explore data lineage
- This will open a new application, called Data Lineage
Step 2: Expand the Lineage
- Data Lineage is an interactive tool that facilitates a holistic view of how data flows through the Foundry platform.
- One rectangle represents one resource
- Arrange the graph so it looks like on the image below
- Expand the lineage by clicking on right / left arrows on the rectangles
- To delete a node in the graph, simply select it, and click backspace button on your keyboard
- To re-order nodes, simply select them and move your mouse around
Explore Data Lineage
Step 1: Preview Nodes
- Select adverse_events_processed node
- Click Preview on bottom of the screen
- You can now preview the dataset
Click Code tab
- This allows you inspect the pseudo-code of the Pipeline Builder transformation you created
Step 2: Save the Lineage
- Click Save
- Provide name “Full Data Lineage”
- Click Save
Takeaways
Data Lineage is an interactive tool that facilitates a holistic view of how data flows through the Foundry platform.
Apart from the resources you visualized on Data Lineage in this exercise, you are able to visualize many other Foundry resources as well (e.g. datasets produces via code repositories).
With Data Lineage, you can:
- Easily find and discover datasets
- Explore pipelines through a powerful interface
- Expand or hide ancestors and descendants of datasets
- Visualize your graph through coloring, incl. making custom legends
- Collaborate with teammates
- Create pipeline snapshots to share with colleagues
For more information, refer to this documentation page.
Next Steps: Machine Learning
In this course, you explored the fundamentals of data analysis and how these processes can be carried out on Foundry.
As a natural progression, you may want to learn how to develop and integrate machine learning (ML) models within the Foundry platform.
- To get started with training ML models on Foundry, refer to this step-by-step tutorial: Train a Machine Learning Model in a Jupyter Notebook.
- Foundry also offers out-of-the-box Modeling tooling that streamlines the process of building, deploying, and managing ML models. You can learn more about these capabilities here: Foundry Modeling Overview.
- In summary, Foundry allows you to train models directly within the platform, deploy them anywhere else in Foundry, or even bring in models trained externally.
- For a practical example, review this "Build with AIP" use case, which demonstrates model training in a Jupyter environment on Foundry: Build with AIP Example.
By following these resources, you can continue to build on your data analysis skills and begin leveraging machine learning workflows on Foundry.
Conclusion
Well done! You're finished with your first data analysis on Foundry!
In this course:
- You learned how to synthesize information from multiple data assets related to a clinical trial. You combined tabular data—such as patient demographics and records of adverse events—with unstructured data from PDFs, using AIP for natural language processing, to gather valuable insights from patient feedback and opinions.
- You then performed initial data analysis in both RStudio and Jupyter Notebook, including plotting the age distribution of patients while controlling for adverse events.
- To communicate your findings, you created an R Markdown report and developed an interactive Streamlit application, both of which you published for easy sharing with colleagues.
- Finally, you visualized the entire data flow of your analysis, clearly mapping out how data was collected, processed, analyzed, and presented.