Use LLM in Pipeline Builder to parse text
This document explains how to leverage Large Language Models (LLMs) within Palantir Foundry's Pipeline Builder. It demonstrates a practical application of LLMs for parsing unstructured text data directly within the data transformation environment.
Use LLM in Pipeline Builder to parse text
Step 1: Extract text from PDF
Sample Preview:
Step 2: Join Array
Preview:
Step 3: LLM
- Click Back to Graph in the top-right corner, and then double-click on either the Summarize node or Extract Cost and Categories node. Take a look at the prompts we are using for the LLM.
- Swap over to the Input table tab in the bottom pane. Try highlighting one row and clicking Use rows for trial run. What happens?
Prompt example:
when the media item was added and deduplication by file path and name.
* Extract the raw text from each PDF. Pass it the media reference column obtained through the last step, and the page range that you would like to extract from.
* Finally, join the text extraction array using spaces as a separator
The next step shows how the LLM board can summarize the text extracted from these office supply receipts. The model also corrects any misspellings in the text while capturing the important information from the original text.
Example 2:
In the context of office supply receipts, your job is to summarize the text in the following user message in 3 Sentences.
There might be mistakes in the extracted text, so use your judgement to correct words and spelling. For example, use "Palantir" instead of "Palan%r" or "Palan&r"
Adhere to these limitations strongly while still retaining the main and overarching points of the text.
Example 3:
Given the following receipt text, identify the category of office supplies purchased and the total cost of the receipt as USD. Please provide the information in the format:
{ "Category": [category],
"Cost": [total cost]}
Use one of the following categories:
- Writing Instruments, Paper Products, Desk Accessories, Filing & Storage, Office Electronics, Mailing & Shipping Supplies, Presentation Materials, Office Furniture, Cleaning Supplies, and Breakroom Supplies.
For example:
{"Category": "Paper Products",
"Cost": "201"}
Output only a single category to which the prompt most closely fits. No explanation.
- Double-click on the Transform node to understand how to manipulate the LLM output into a structured dataset.
What is happening in this block?
The last step used the LLM to categorize our data and return it with a particular structure. Since that step extracted the categories and costs from the PDFs, this board transforms the LLM data into a structured dataset that can be used in the Ontology. To do this:
First, rename columns that need clearer names (here we just renamed the extracted text column)
Next, parse the JSON output from the LLM into a schema with a struct
Then, parse that struct to create a separate column for each property (here we create a new column for each locator.-- "Category" and "Cost")
Check the types of each property! The "Cost" column needs to be cast from a String type to a Double type.
Finally, select only the columns that should appear in your Ontology object. **Note: if there are any columns that are not selected, a yellow warning will appear at the top of the output configuration at the next step. Update the output schema in the right hand side bar to only include the columns that are selected.
- Finally, double click on the Office Supply Receipts node to see the output configuration in the right pane. What columns are included in the output?
- Notice the yellow warning at the top of the output configuration: 1 dropped. What is this referencing?
Next Steps
This is just one of many extraction, summarization and classification use cases you can unlock today. With this, you could build use cases like:
- Alerting workflows: When do expenses in a certain category exceed a cost threshold?
- Spending trends across receipts: which categories of office supply are driving the most expenses?
Treat this pipeline as a jumping off point. Some things to try next:
- Try editing the Instructions field of the prompt. What additional business context might help the LLM make better decisions?
- Try replacing Receipts with different data. What other data might be useful to classify into problems and solutions?
- Try changing the Provide input data field (eg. if you substitute the receipts with email data, process email_body instead of subject). Does the LLM give the same results?