FOUNDRYFoundry Data Science TrainingUploadMar 1, 202629 views

Speedrun: Data Science Fundamentals

#data science#fundamentals#Palantir Learn#speedrun

✦ AI SUMMARY

This document introduces a "Speedrun" guide to Data Science Fundamentals, likely a course or resource for quickly learning core concepts in Palantir's data science ecosystem. It references a link to the Palantir Learn platform for more details.

Speedrun: Data Science Fundamentals

https://learn.palantir.com/speedrun-data-science-fundamentals

Executive Overview

This executive brief summarizes an end-to-end enterprise data science workflow implemented within a unified data platform environment. The solution demonstrates how structured and unstructured data can be integrated, enriched with AI, analyzed, visualized, and deployed as a governed, shareable application.

Business Value Delivered

• Unified integration of structured clinical datasets and unstructured PDF feedback.

• AI-powered semantic enrichment using LLM-based sentiment analysis and entity extraction.

• Rapid transformation and feature engineering via no-code pipeline authoring.

• Interactive analytics dashboard for stakeholders. • Full traceability and governance through Data Lineage visualization.

• Version-controlled and reproducible data science workflows.

Architecture Overview

The architecture follows a layered enterprise design pattern:

• Data Ingestion Layer – Structured datasets and unstructured PDF documents.

• Transformation Layer – Batch pipelines for cleaning, joining, and AI enrichment.

• Analytics Layer – Jupyter-based exploratory and statistical analysis.

• Application Layer – Streamlit-based interactive dashboard.

• Governance Layer – End-to-end lineage, version control, and deployment management

Analyzing clinical study data to uncover insights into patient demographics and adverse events linked to drug consumption.

Performing data analysis

Creating a report that your colleagues can use for future data science initiatives

Challenge of working with data from various sources, including structured tables and unstructured PDFs, necessitating data preprocessing and reconciliation before analysis.

Using Pipeline Builder, you'll enhance your data preprocessing efficiency with optimized operations, including PDF text extraction using AIP.

For data analysis, you can choose to use RStudio to conduct visualizations and analyses, generating a comprehensive report. This part is optional, depending on your access to an RStudio Workbench license. Alternatively, you can utilize Jupyter to perform analyses and publish an interactive dashboard with Streamlit.

Skills using Foundry and AIP toolkit for Data Science.

access to the following platform components:

Pipeline Builder
Data Lineage
AIP features
Jupyter Workspace
[optional] RStudio Workspace

Permissions Required	Workaround if permissions are not available
Create Learning Project	Platform Admin team will need to create Learning Project, in which you may work in your 'Working Folder'.
Upload Files	Platform Admin will need to Upload Files and provide instructions to copy them to the Learning Course Folder.
Edit Learning Project	None. Permission needs to be granted.
Create Foundry Resources (Pipeline Builder, Data Lineage, Jupyter Workspaces)	None. Permission needs to be granted.
Create Foundry Resources (RStudio Workspaces)	None. Permission needs to be granted, including the RStudio license. However, given that RStudio Workbanch license is not accessible for all environments, and may depend on your organziation setup, the RStudio part of this course is optional and you may skip it.
Use AIP Developer Capabilities (e.g. LLM block in Pipeline Builder)	None. Permission needs to be granted.

The Use Case: work on analyzing clinical study data to gain insights into patient demographics and adverse events that might have happened as a result of the drug consumption during the clinical study.

You colleagues are mainly interested in what type of adverse events occurred at this particular clinical trial and how this varied by age of the patients. They are also curious to hear whether patients had any insights themselves, or feedback about the whole experience participating in the clinical trial.

Data Processing

Data Transformation in Pipeline Builder - Supercharge your data pre-processing speed by combining the hundreds of optimized, common data operations, including PDF text extraction using AIP.

Data Analysis

Optional Data Analysis Using RStudio
- Perform data visualization and analysis within RStudio Workspace.
- Generate and publish a report based on the analysis conducted in RStudio.
Data Analysis Using Jupyter
- Perform data visualization and analysis in a Jupyter environment.
- Publishes an interactive dashboard using Streamlit Application.

Setting Up Project & Folder

All of the resources you create in Foundry need to live inside a Project. For production use cases, these resources likely need to be shared across your organization and it is recommended you create a Project for each stage of the workflow, as noted in our online documentation.

Create a Foundry Learning Project

Step 1: Create a new project

Click on New project in the top right

Step 2: Set your project's details

Create a Course-Specific Training Folder

Note: not able to access and create folder, therefore, created access request (instruction outdated in the training course)

Uploading the Data

In this section you will set up a Project, where you'll develop your workstream. You will also populate it with your tabular datasets, as well as unstructured input data, stored in a Media Set.

Upload Tabular Data

Step 1: Start file upload

Click on New in your course folder, inside your personal Learning Project
Click on Upload files...

Step 2: Select Files

Select the files downloaded earlier (two .csv files)
Make sure that Upload as a structured dataset (recommended) is selected
Click on Upload

Upload PDFs

Step 1: Start file upload

Click on New in your course folder, inside your personal Learning Project
Click on Upload files...
Select the files downloaded earlier, 5 PDFs
Make sure that Upload to a new media set is selected
1. Media sets allow us to interact with the file in media format and, in this use-case, is preferable to rows in a dataset or raw data

Step 3: Finish upload

Select Transactionless as the write mode
Click Upload

Step 4: Rename your Media Set

Right-click on your Media Set
Rename it to "Patient Feedback"

[Optional] Deploy Datasets via Marketplace

Only complete these steps if you are unable to download datasets from this site or are unable to upload datasets to your Foundry instance due to lack of permissions.

Understand the Data

Step 1: Understand the Tabular Dataset DM

Open the DM_Demographics dataset
- One row in this dataset represents one patient participating in a clinical study, including their demographic characteristics
Click on the arrow next to the ARM column

Step 2: Understand the Tabular Dataset AE

Open the AE_Adverse Events dataset
- One row in this dataset represents one adverse events that happened to patients, including information about the type of the adverse events

3. Click on the arrow next to the AEDECOD column, which represents the ‘decoded value of the adverse event’

Step 3: Understand the Mediaset with PDFs

1. Open the Patient Feedback mediaset

Summary

You now have a Project set up to develop your workflow. You've also successfully uploaded your initial data asset to Foundry. In the following sections, you will process this data and make it available to both human and AIP agents to use.

Although we added data manually this time, Foundry includes a fully fledged Data Connection toolkit to manage recurring batch and streaming ingestions. It comes with 200+ bespoke connectors for the most common systems and a flexible plugin architecture to cover the rest. Both tabular and unstructured data assets are natively handled by the platform. These can be operated on using the broad Foundry ecosystem, as you'll see later in this course.

To learn more about data ingestion, you might follow this course up with the Data Connection Deep Dive, or by taking a look at the documentation.

Foundry has data security measures baked into every aspect of the platform. Projects, like the one you created earlier, are the atomic units of "discretionary" access controls. People with the Owner role on the Project can share the contents with other users, or groups of users, by setting them as Editors or Viewers.

Data Processing in Pipeline Builder

Introduction

clean and prepare the data by formatting it into a version you can use downstream for visualizations. This will include cleaning the tabular structured data, but also processing unstructured data in the PDFs using AIP

Create a New Pipeline

Step 1: Start Pipeline Creation

Step 2: Finish Pipeline Creation

Rename the pipeline to “Data Processing”
Select the pipeline type as a Batch pipeline
Click on Create pipeline

Work on a New Branch

Step 1: Create a New Branch

Click on the Main button
Click on Create new branch

Step 2: Name your Branch

If prompted, select Pipeline Builder branch
Name your branch data-processing
Click Create

Going forward you can see that you are working on your branch

Takeaways

Pipeline Builder offers out-of-the box version control backed by git, which allows you to seamlessly collaborate with your colleagues on the same pipeline, proposing and merging changes.

Add Data

Step 1: Open the "Add data" Window

Click on Add Foundry data
- You can find this Add Data dialog in the top left of the Pipeline Builder screen if you want to add more data later. It might be particularly useful for setting up mock datasets (or configuration datasets) using the Manually Enter Data option.

Step 2: Select and Add Data

Add the Patient Feedback media set to your selection using the + button, as well as our two tabular datasets
Click Add data in the bottom right to confirm selection

Process PDFs

Step 1: Start New Transformation

Back on the main Pipeline Builder graph, select the new Patient Feedback transform node
Click the Transform button
- This will create a transform Node, where transformation steps can be chained and a resulting dataset can be visualized

Step 2: Rename Transform

Rename the new transform node to "Process PDFs" in the top left corner

Step 2: Rename Transform

Rename the new transform node to "Process PDFs" in the top left corner

You can see a preview of the output of your current transformation at the bottom of the page. It gets updated each time you add a new transform to your node. You can chain transforms by selecting them from the search bar at the bottom of the current chain of operations (empty at the moment). Please keep adding transform blocks, one below the other, until instructed otherwise.

Step 3: Add First Transform Block

Search for and add a Extract text from PDF transform block

Select the mediaReference column in the Media reference dropdown
Set the output column name to "pdf_text_extraction"
Ensure that Skip recomputing rows is toggled on
Click Apply

The "Extract text from PDF" transform block extracts text elements from the PDF documents in the Media Set you uploaded earlier. In some cases, the text is stored as images (for example if the document was photo-scanned). If you are in this situation, you should change the "Extract Method" to OCR aka Optical Character Recognition.

Step 4: Add Second Transform Block

Search for and add a Join array transform block

For Separator configure one space, by simply pressing spacebar-key in the textbox, followed by pressing enter-key.
- Once you do so, you should see 1 space appearing as the value, as presented on the picture below.
For Array to join configure column pdf_text_extraction

Rename the output to also be pdf_text_extraction
Click Apply

Step 5: Finish Transformation

1. Click Close in the upper right hand corner.

Takeaways

The text you extracted from the PDF files is simpler to interact with than the raw PDF files you've uploaded. This makes it possible to easily run data transformations and AIP interactions on it.

If you take a look at the Selection preview tab at the bottom of your screen, you should see something like this:

Use LLM

Step 1: Start an LLM Call

Select the Process PDFs transform node
Click the Use LLM button

Note: If you do not see the 'Use LLM' option in your Foundry instance, you may not have access to AIP features. Contact your organization's Foundry administrator or your Palantir representative for assistance.

Step 2: Choose Prompt Type

1. Click on Empty prompt

Step 3: Configure LLM Call

Keep the default name "Use LLM"
Add the following prompt to the instructions field:

In the context of patient feedback on their participation in a clinical trial, your job is to rate the following user message on a scale of 1 to 5, 1 being the most negative and 5 being the most positive.

In addition, you also need to extract the patient ID which is mentioned in the beginning of the document.

Finally, provide one sentence summary of what the user said.

Provide no further explanation, only these three things (patient_id, sentiment_score, summary).

Type forward-slash "/" in the input data field and pick the pdf_text_extraction column
Set the output type to be a struct with three strings: “patient_id”, "sentiment_score", "summary"
Change the type to be Integer for “patient_id”, "sentiment_score"
Rename the output to llm_sentiment_analysis

If the output for llm_sentient_analysis appears as 'null' for you, try changing the model that you are using. GPT-5 nano has been known to have issues here, however Gemini works as expected

Step 4: Finish Configuring LLM Call

Click Apply
Click Close

Takeaways

Pipeline Builder takes your prompt template, resolves the placeholders (adds the actual text from the pdf_text_extraction column in this case), and sends it to AIP for each row of input. Under the hood, it takes care of optimizing performance by issuing parallel calls, while respecting the various models' rate limits.

Pipeline Builder will choose a model by default to produce a response. In this example it was Gemini 2.0 Flash. If you wish, you can choose alternative models such as Grok or Mixtral 7B, by clicking on the Show configurations button in the Model section of the UI while setting up the prompt.

If you take a look at the Selection preview tab at the bottom of your screen, you should see something like this:

Note, first time, GPT 5.0 nano was not working, switched to gemini 2.5 worked, then switch back to gpt, which worked 2^nd try.

Organize LLM Response

Step 1: Start New Transformation

Select the Use LLM transform node
Click the Transform button

Rename the transform node to "Organize LLM result" in the top left

Step 2: Add First Transform Block

Search for and add a Extract many struct fields transform block
Set the llm_sentiment_analysis column as the struct to process
Set patient_id, sentiment_score, summary as the properties to extract into new columns with corresponding names
Click Apply

Step 3: Add Second Transform Block

Search for and add the Select columns transform block
Select patient_id, sentiment_score, summary as the columns to keep
Click Apply in the upper right hand corner, then Close

Takeaways

By the end of this step, you'll have a dataset where each row corresponds to a PDF, and each column provides specific information about it, such as sentiment analysis, a summary of the feedback, and the ID of the patient who provided the feedback.

We utilized Pipeline Builder as a user-friendly, no-code tool to efficiently extract this information. In the subsequent steps, you'll process additional tabular datasets and ultimately integrate this information with the data from this step.

Join Datasets

Step 1: Start the First Join

Select the DM_Demographics node
Click the Join button
This will prompt you with a dialogue to initiate a join between two datasets. Click on AE_Adverse_events
Click Start

Step 2: Configure the First Join

Rename the join node to "Join DM and AE" in the top left

Select the following option: ‘Left join’ as the Join Type, and ‘USUBJID is equal to USUBJID’ as Match Condition. Leave everything else as already preselected.

Click Apply in the upper right hand corner, then Close

Step 3: Start the Second Join

Select the “Join DM and AE” node
Click the Join button
This will prompt you with a dialogue to initiate a join between two datasets. Click on “Organize LLM result”
Click Start

Step 4: Configure the Second Join

Rename the join node to "Join with PDF extraction" in the top left
Select the following option: ‘Left join’ as the Join Type, and ‘SUBJID is equal to patient_id’ as Match Condition. Leave everything else as already preselected.
Click Apply in the upper right hand corner, then Close

Takeaways

Pipeline Builder enables seamless table merging with a no-code approach. In this exercise, we used a Left Join, but you can also choose from various join options, including KNN join and geometry-based joins.

Visualize Adverse Event Occurrence

Step 1: Start New Transformation

Select the “Join with PDF extraction” transform node
Click the Transform button
Rename the transform node to "Adverse Event Occurrence" in the top left

Step 2: Add First Transform Block

Search for and add a Case transform block
In ‘When’, set the AEDECOD column as the column to process
Specify that the output value should be an integer of 0 if this column is null, and 1 otherwise

Rename the output column to adverse_event_happened
Click Apply

Step 3: Add First Visualization Block

Search for and add the Chart visualization block

Step 4: Configure the Visualization Block

Make sure the Configure tab is selected
Set the AGE column for X-Axis
Select Sum as the aggregation for the Y-Axis
In the aggregation field, select column adverse_event_happened
Click on Segment by, and select ARM column
Click Apply

Step 5: Format the Visualization Block

Make sure the Format tab is selected
Set the Adverse Event Occurrence vs. Age for the Title
Set Age for the X-Axis label
Set Adverse Event Occurrence for the Y-Axis label
Click Apply

Step 6: Preview the Chart Results

Click on the Legend
Preview the results of your chart.
- Hint: You may move right on the x-axis, by clicking the right arrow next to Show results

Note: You may notice that the graphic looks slightly different from the image above. This is because the input data may be sampled differently in Pipeline Builder when previewing, resulting in variations from the example shown.

Takeaways

Pipeline Builder offers a quick and simple way of validating your data using histogram visualizations, in a no-code fashion.

Foundry also provides a no-code point-and-click user interface to perform data analysis and visualizations — Contour. Contour allows you perform data analysis on tables at scale. These analyses can be used to visualize data using various charts and create interactive dashboards.

To learn more about Contour, take a look at the documentation, or another course which offers deep-dive on this tool.

Add Output of your Data Processing

Step 1: Create New Output Dataset

Select the Adverse Event Occurrence transform node
Click the Add output button

Click the New dataset button

Step 2: Rename Output Dataset

Merge your Branch

Step 1: Create a New Merge Proposal

Click Save to save all your changes (unless you saved it already)
Click Propose button
Give name to your proposal and description
Click Create proposal

Step 2: Review Changes

Select Changes tab in the newly created proposal
Review the changes compared to the current state of the Main branch

Step 3: Merge the Proposal

Select Overview tab in the newly created proposal
Click Merge proposal

Step 4: Observe the Changes

Select the Graph tab
Observe that the Main branch now includes merged changes from the branch that you were working on.

Deploy Pipeline

Click Save in the top right corner, if possible
Click Deploy in the top right corner

In case you get a warning about resources

Click Import resources in the warning pop-up
Click Add references in the import pop-up
You may need to click Deploy pipeline once more

Summary

You have now processed both tabular and unstructured data from PDFs, producing a dataset that combines all provided information, ready for data analysis and insights generation.

Pipeline Builder is Foundry's no-code solution for authoring data transformations. It comes with hundreds of optimized data operations out of the box, as well as support for core software engineering concepts such as version control, scheduling, and incremental execution. It is well suited for the majority of data-pipeline building tasks, and provides a boilerplate free entry point when rapid development is priority over complex technical configuration.

To learn more about Pipeline Builder, you may take the Pipeline Builder Deep Dive course, or read more about it in the documentation.

Foundry also provides a high-code transform authoring environment called Code Repositories. It allows you to leverage the fine controls of Python and Java to create performant transforms, and rely on popular libraries for specialized functionality.

To learn more about Transform authoring in Code Repositories, you make take the Code Repository Deep Dive course, or read more about it in the documentation.

Insight Generation → JupyterLab Workspace

In this section you will learn how to utilize JupyterLab Workspaces for data analysis, generate insights based on your data, version reports, and share them with your colleagues.

Create a New Jupyter Workspace

Step 1: Start Jupyter Workspace Creation

Navigate to your course folder, inside your training project
Right-click inside the folder and pick New
- Alternatively, you can click New in the top right corner
Select Jupyter workspace from the drop-down

Step 2: Finish Jupyter Workspace Creation

1. Name your Workspace Jupyter - Data Analysis

2. Click Continue

Takeaways

Foundry offers out-of-the-box integration with JupyterLab, where data scientists can work and collaborate in a secure and reproducible environment writing notebooks and python scripts.

Load the Data to your Workspace

Step 1: Create New Jupyter Notebook

Under Notebook, click on Python [user default]

Right-click on top of the newly created notebook
Click Rename Notebook...

Observe the newly created notebook in your Jupyter Workspace

Step 2: Start Loading the Dataset

Click on the Data tab on the left-hand side
Click on Add data
Select Read data
Find the output dataset adverse_events_processed
Click on the dataset
Click Select

Step 3: Finish Loading the Dataset

Click Add dataset
The dataset is now only referenced inside the Jupyter Workspace. To load the full content of the dataset you need to run code that loads it.

Click on Copy to clipboard

Go to your analysis.ipynb notebook, and paste the content in the cell, by clicking CTRL-V
Click the "Run" button

Step 4: Confirm the Dataset is Loaded

Reference the variable that contains your data, by writing adverse_events_processed.head(), in a new cell

Takeaways

JupyterLab runs in a container within Foundry. To load the data into this container, and thus use it in Jupyter Workspace, you needed to first reference it and then load it using the code snippet provided to you in the Data tab.
Install Libraries

Step 1: Confirm you are on the Managed Conda Environment

Click on the Settings tab at the top of the screen
Click on Advanced Features
If the managed Conda environment is Enabled, you are good to you!
If it is not enabled, click on Enable managed Conda environments. This will prompt your Workspace to restart, and after a few seconds you can continue working on it.

Step 2: Install matplotlib Library

Go back to the Code page and click on the Libraries tab
Within the Libraries tab, select the Conda tab

Takeaways

For this exercise you installed one Python library: matplotlib, which you will need for the remainder of this exercise.

Foundry offers built-in support for reproducible data science. To ensure reproducibility, Foundry provides a managed Conda environment solution. This solution guarantees that environments, including all installed packages, are consistently reproducible. For further details, please refer to this documentation page.

Write Initial Python Code for Data Analysis

Step 1: Import Libraries

Navigate back to your analysis.ipynb notebook
Import pandas and matplotlib
- You can use the code below.

import pandas as pd
import matplotlib.pyplot as plt

Step 2: Create First Histogram

Plot the first histogram, showing age distribution of patients, based on whether they took the Miracle drug or were in the Placebo group.

Step 3: Create Second Histogram

Plot the second histogram, showing age distribution of patients, based on the category of adverse events that happened to them.

Step 4: Save your Notebook

Click the Save icon

Takeaways

The JupyterLab integration with Foundry allows you to write and run Python notebooks in the same fashion you would interact with Jupyter outside of the platform. On top of Jupyter, Foundry offers a safe collaboration space to access and analyze data on the platform and guarantees reproducibility.

Sync Changes to the Backing Repository

Step 1: Navigate to the Backing Repo

Click on Code Repositories butto
Click Open without syncing
This will open a new tab in your browser where you can see the backing repository. Currently your analysis.ipynb notebook is not there.

Step 3: Observe Synced Changes

Navigate back to the tab with the Code Repository open
You should now see the blue banner on top of the page. If you don’t, click refresh button on your browser.
Click Update to most recent version
Observe your analysis.ipynb notebook is now present in the repository
Takeaways

All Code Workspaces are backed by a Code Repository. This enables Code Workspaces to have industry-standard version control features like branching, merging, and commit history, allowing other users to view the code and safely operate in the same Workspace.

By syncing changes you made sure that the code you wrote in your Workspace are saved to the backing repository. For further details please refer to this documentation page.

Create a Streamlit App

Step 1: Prepare the Environment for the Streamlit App (Note: need go back to the 1^st tab, code workspace from last step that was leading creating a new page for code backing repository)

Click on Application tab on the left-hand side of the Workspace
Select Streamlit from the dropdown menu

You will be prompted with a notification that creating a streaming app requires additional libraries installed
- Click on run command
This will start the installation of the required libraries, and might take a few seconds. Wait until you observe in the Jupyter Console that the library is installed, and the banner for additional libraries required is removed.

Step 2: Create a New Streamlit Application

Click Publish Application
Name the application StreamlitApp

Click Publish and Sync

After a few seconds, your application will be created.

streamlitapp.py file is created.
You can observe your initial app on the right side of your Workspace.

Step 3: Import Libraries and Change the Title

Import pandas and matplotlib libraries.
- You can use the code below.

import pandas as pd
import matplotlib.pyplot as plt

Provide name StreamlitApp: “Adverse Event Age Distribution Explorer”
- You can use the code below

st.title("Adverse Event Age Distribution Explorer")

Step 4: Load the Data

Navigate to the Data tab
Click Expand on the dataset reference

Paste the code to the streamlitapp.py file, by clicking CTRL-V
Step 5: Create an Interactive Histogram Plotting

Create a dropdown menu for selecting an adverse event

Plot the age distribution based on the select value in the dropdown menu

Provide feedback summary of patients, for the given selection of the adverse event.

Click File >> Save All

Step 6: Observe your App & Sync Changes

Click Refresh
Observe and test your Streamlit app

Takeaways

Through Jupyter workspace integration you are able to create interactive dashboards.

Apart from Streamlit — which you’ve seen in this course — you can also create Tensorboard and Dash applications in a similar fashion. For more information you can refer to these documentation pages, about Tensorboard integration, and about Dash integration.

Note: ModuleNotFoundError: No module named 'altair.vegalite.v4' error in Streamlit apps running on Palantir Foundry, how it was diagnosed and fixed, and key differences from debugging in local or standard environments. See note Streamlit_Altair_Fix_JobAid for step by step solution

Inspect and Share Published Streamlit Application

Step 1: Open the Published Streamlit Application

Navigate to the Applications tab
- You will see that your Streamlit application is automatically published.
Click on the link to the application to open it in new tab

View the application, and interact with it in the new tab.

The application will likely need a few seconds to load.

Step 2: Navigate Branches and Versions of the Report

Click on the Latest dropdown.
Select the previous version to view the previous version of the report
Click on master dropdown.
1. Currently you will not see any other branch in the dropdown, but if you did work on a different branch, you would be able to effortlessly navigate between these versions.

Step 3: Share the Application with your Colleagues

Copy the URL to your Streamlit application
[Optional] You can share this URL to the colleagues, and they will be able to see the application and interact with it, provided that they have access to the Foundry environment and Project where you are working.

Takeaways

You have created and published a Streamlit application that can be easily shared with anyone in your organization who has access to the Foundry environment by simply sharing the application’s URL.

Additionally, you can effortlessly navigate between different versions of the application. For example, we were able to observe the report that initially “Hello world” version of the application, as well as an updated full version of it.

Code Workspaces also provides version control, so you can work and develop your application on a different branch, without affecting the one on the main branch. For further details, please refer to this documentation page.

Summary

You have now produced your first data insights using JupyterLab integration to Foundry. You have then created an interactive Streamlit application for a crisp overview of these insights. Finally, you observed this application being published, which now exists on Foundry, and is therefore accessible by your colleagues who have access to the environment, project and data you are working with.

Code Workspaces brings the JupyterLab, RStudio Workbench, and VS Code third-party IDEs to Palantir Foundry, enabling users to boost their productivity and accelerate their data science and statistics workflows by using their preferred tools.

To learn more about Code Workspaces, including the JupyterLab Workspaces, take a look at the documentation.

Connecting it All Together with Data Lineage

Now that you're finished with both data processing and analysis, you will learn how to visualize all your work using Data Lineage. Data Lineage allows you to observe your data pipeline and how data flows through the platform.

Create Data Lineage

Step 1: Navigate Back to your Working Folder

Navigate to your working folder
- By now, you should have a few resources created (e.g. Pipeline Builder, Streamlit application, etc.)
Open the Streamlit app, the last resource you created, by clicking on it
Once the Streamlit app is open: click on Actions >> Explore data lineage
- This will open a new application, called Data Lineage

Step 2: Expand the Lineage

Data Lineage is an interactive tool that facilitates a holistic view of how data flows through the Foundry platform.
- One rectangle represents one resource
Arrange the graph so it looks like on the image below
- Expand the lineage by clicking on right / left arrows on the rectangles
- To delete a node in the graph, simply select it, and click backspace button on your keyboard
- To re-order nodes, simply select them and move your mouse around

Explore Data Lineage

Step 1: Preview Nodes

Select adverse_events_processed node
Click Preview on bottom of the screen
- You can now preview the dataset

Click Code tab

This allows you inspect the pseudo-code of the Pipeline Builder transformation you created

Step 2: Save the Lineage

Click Save
Provide name “Full Data Lineage”
Click Save

Takeaways

Data Lineage is an interactive tool that facilitates a holistic view of how data flows through the Foundry platform.

Apart from the resources you visualized on Data Lineage in this exercise, you are able to visualize many other Foundry resources as well (e.g. datasets produces via code repositories).

With Data Lineage, you can:

Easily find and discover datasets
Explore pipelines through a powerful interface
- Expand or hide ancestors and descendants of datasets
Visualize your graph through coloring, incl. making custom legends
Collaborate with teammates
- Create pipeline snapshots to share with colleagues

For more information, refer to this documentation page.

Next Steps: Machine Learning

In this course, you explored the fundamentals of data analysis and how these processes can be carried out on Foundry.

As a natural progression, you may want to learn how to develop and integrate machine learning (ML) models within the Foundry platform.

To get started with training ML models on Foundry, refer to this step-by-step tutorial: Train a Machine Learning Model in a Jupyter Notebook.
Foundry also offers out-of-the-box Modeling tooling that streamlines the process of building, deploying, and managing ML models. You can learn more about these capabilities here: Foundry Modeling Overview.
- In summary, Foundry allows you to train models directly within the platform, deploy them anywhere else in Foundry, or even bring in models trained externally.
For a practical example, review this "Build with AIP" use case, which demonstrates model training in a Jupyter environment on Foundry: Build with AIP Example.

By following these resources, you can continue to build on your data analysis skills and begin leveraging machine learning workflows on Foundry.

Conclusion

Well done! You're finished with your first data analysis on Foundry!

In this course:

You learned how to synthesize information from multiple data assets related to a clinical trial. You combined tabular data—such as patient demographics and records of adverse events—with unstructured data from PDFs, using AIP for natural language processing, to gather valuable insights from patient feedback and opinions.
You then performed initial data analysis in both RStudio and Jupyter Notebook, including plotting the age distribution of patients while controlling for adverse events.
To communicate your findings, you created an R Markdown report and developed an interactive Streamlit application, both of which you published for easy sharing with colleagues.
Finally, you visualized the entire data flow of your analysis, clearly mapping out how data was collected, processed, analyzed, and presented.

◇ Generate Flashcards (8 existing)✎ Edit + Add Related Note