This post is a follow-up to that one. The highlights of this one are:
Also, since my last post, I’ve refactored the code base in its current state as both a Python package and a command-line tool, which you can find in this link. This is what I used to run the experiments shown in this page. A sample notebook is provided here and the underlying code is < 200 lines long!
A few days ago, Google released Gemma, a family of lightweight models the same size as Mistral-7b, the one I was already using.
My first attempt at loading Gemma-7b was through Hugging Face. However, it needs extra configuration to fit in my memory, i.e. float16/bfloat16 quantization with PyTorch, or 8bit/4bit quantization with bitsandbytes. Also, the model files are huge, totalling 20 GB.
For now, let’s familiarize ourselves with Ollama, which is simpler, and leave the Hugging Face ecosystem for later.
It turns out a very widespread way of running models locally is through Ollama, which is very simple to install on Linux.
curl -fsSL https://ollama.com/install.sh | sh
Ollama has a Docker-like interface (it was made by an ex-Docker employee) and models can be downloaded using their names and some basic tags. Downloaded models can be executed via python using their official library.
With that, I decided to stop using models from multiple libraries and frameworks and run all of them in Ollama. The commands below downloaded Mistral-7b, Gemma-7b, and a text embedding model from Nomic.AI.
ollama pull mistral:7b-instruct
ollama pull gemma:7b-instruct
ollama pull nomic-embed-text
The model files downloaded by Ollama are considerably smaller (each 7b model fit in a single 5GB file). They also loaded into my GPU memory, alongside the embedding model.
It turns out the prompt templates, especially for Instruct models, need to use the correct tokens and that is not done automatically by LangChain or any other library.
I went after the correct templates on the Ollama website. They are shown below. Contrast with my last post, in which I copied an internal prompt from LangChain, which was not suitable for any specific model.
While tweaking the contents of the prompt template, I realized how much variability in the responses I got from only minor tweaks. For example, if I told Gemma something like “if you don’t know the answer, just say you don’t know and not make anything up”, the model would just refuse to answer any questions, even if the information was present in the retrieved text chunks. To make the comparison fair, I also removed the same section from the mistral prompt template, which I used in the previous post.
Sources: Ollama and Hugging Face Hub.
<s>[INST] Use the following pieces of context to answer the question at the end. Present a well-formatted answer, using Markdown if possible. Don't go over three paragraphs when answering.
---
{context}
---
Question: {question} [/INST]
Sources: Ollama and Hugging Face Hub.
<start_of_turn>user
Use the following pieces of context to answer the question at the end. Present a well-formatted answer, using Markdown if possible. Don't go over three paragraphs when answering.
---
{context}
---
Question: {question}<end_of_turn>
<start_of_turn>model
One problem I was having was that performing maximal marginal similarity on a small number of chunks (e.g. 3) tended to result only a single helpful text chunk. To fix that I
Some final tweaking that made the output of both models better was using the following parameters:
Since my last post, I’ve refactored the code base in its current state as both a Python package and a command-line tool, which you can find in this link. This is what I used in the construction of this post.
The snippet below (which is also available as a Jupyter Notebook) showcases how to instantiate and talk to the LLM-powered Q&A bot:
from pathlib import Path
from langsidian import ChatBot, DocumentBase
bot = ChatBot(
docs_path=(Path.home("Documents", "Obsidian")),
vectorstore_db_path=Path("docs/chroma"),
document_type=DocumentBase.OBSIDIAN,
model_type="mistral:7b-instruct",
embeddings="nomic",
)
q = "What is zero-shot learning?"
print(f"**Question: {q}**\n\n Answer: {bot.answer(q)}", end="\n\n---\n\n")
It can also be executed as a standalone module. The document path and model name can be selected with command-line arguments:
python -m langsidian
[Prompt]: Who were the authors of the Deep Q-Network paper?
[Assistant]: The authors of the Deep Q-Network paper were Mnih, Vaswani, Graves, et al. (2015). You can find their paper here: <https://arxiv.org/abs/1312.5602>.
Below, I asked the same questions to both Mistral-7b-it and Gemma-7b-it, giving them access to my Obsidian document base, whose chunks are now embedded using Nomic Embed.
Analyzing the output qualitatively, I found that Mistral tends to give more informative and helpful answers. It also makes better use of the information in the chunks provided by the RAG procedure, creating richer answers, whereas Gemma tends to repeat itself or cut answers short when it hits a dead-end such as in the answer for this question.
Also, in the first question, Gemma explained terms that may have been in the context chunks recovered for RAG (such as $a’$ and $\pi_*$), but they did not appear in its final answer, which would definitely confuse users in a conversation.
The equation for the action-value function update using the n-step return is as follows:
\[Q_{t+n}(S_t, A_t) \leftarrow Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - \gamma^n Q_{t+n-1}(S_t, A_t)]\]This equation is a form of the Bellman equation for the action-value function. Let’s break down its terms:
This update rule aims to minimize the difference between the current and new estimates of $Q(S_t,A_t)$, gradually improving the estimation as more samples are collected.
Equation for Action Value Function Update:
\[Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - \gamma^n Q_{t+n-1}(S_t, A_t)]\]Explanation of Terms:
Note: This equation is for the action-value function update in Q-learning, not the state-value function update.
The Deep Q-Network (DQN) architecture uses a deep neural network to approximate the Q* function in reinforce learning. The input to the network are 4 frames of 84x84 pixel screenshots from Atari 2600 games, only using the luminance values of the pixels. The network then uses three convolutional layers to learn visual representations of the game states. The output of the neural network is the Q values of all possible actions for each state, ranging from 4 to 18.
This architecture builds upon the Q-learning algorithm, which aims to find the optimal policy by iteratively improving the Q values based on observed state-action pairs and their corresponding rewards. However, directly applying this algorithm to deep neural networks with large input spaces is infeasible due to the computational demands.
To overcome this challenge, DQN uses experiences from multiple frames and applies a target network for more stable learning. The target network is updated periodically to match the weights of the online network, ensuring that the Q values in the target network do not change too frequently during training. Additionally, the authors use experience replay, where they store transitions in a buffer and sample them randomly to train the network, which helps in reducing correlation between consecutive samples and improving stability.
These architectural enhancements allowed DQN to learn complex visual representations of Atari games, achieve human-level performance on several benchmark tasks, and set a new standard for deep reinforce learning research.
The text describes the architecture of Deep Q-Networks, which uses a deep neural network to approximate the Q* function. The neural network has three convolution layers to learn visual representations of the game states.
The text does not describe the architecture of the convolutional layers, therefore I cannot answer the question.
Causal and masked language modeling are two different methods used for pre-training language models.
In causal language modeling, also known as autoregressive language modeling, the language model is trained to predict the next token in a sequence given the context of the previous tokens. The model has access to the entire sequence during training and prediction, and its task is to generate the next token based on the context provided. However, since there is no ground truth available during inference, this can lead to a discrepancy between training and inference, known as Exposure Bias. This issue can be addressed through techniques like Curriculum learning.
In masked language modeling, also known as Masked Language Modeling (MLM), tokens are randomly masked at training time and the model is trained to reconstruct the original text by predicting the masked tokens. For instance, in the sentence “the teacher teaches the student”, the word “teaches” might be masked as “
Sources: DeepLearning.AI Natural Language Processing Specialization, Generative AI with Large Language Models.
Causal language modeling and masked language modeling are two methods for pre-training language models.
Causal language modeling allows the model to see the first tokens of the sequence and its task is to predict the next token. Masked language modeling, on the other hand, randomly masks tokens at training time and the model is trained to reconstruct the original text by predicting the masked tokens.
The main difference between causal language modeling and masked language modeling is that in causal language modeling, the model has access to the first tokens of the sequence and its task is to predict the next token, while in masked language modeling, the model is trained to reconstruct the original text by predicting the masked tokens.
Zero-shot learning is a concept used in machine learning, specifically in the context of Generative AI and Large Language Models (LLMs), to describe the capability of an LLM to predict classes it was not explicitly trained on without being given any explicit examples during training or at inference time. In other words, the model makes predictions based on its existing knowledge, rather than learning from new examples provided in real-time. This is in contrast to few-shot learning, where the model is given a few examples of the task it’s expected to perform before being presented with the actual task. Zero-shot learning requires a large and diverse dataset for the model to generalize effectively to unseen classes or concepts.
One limitation of zero-shot learning is that it may not be as effective as few-shot learning, especially when dealing with complex tasks or rare classes. Another limitation is that it relies heavily on the quality and representativeness of the training data for making accurate predictions. However, since zero-shot learning allows models to execute new tasks without being given any explicit examples, it can be beneficial in scenarios where acquiring labeled data is difficult, costly, or time-consuming.
Regarding your additional question: In Q-learning, importance sampling is not used because it is an off-policy reinforce learning algorithm that learns from a sequence of state-action pairs (trajectories) generated by a policy π, which may not be the optimal one. The goal is to estimate the action-value function Q(s,a), and importance sampling would introduce additional complexity without significant benefits in this context.
Zero-shot learning is the capability of a model to predict classes it was not explicitly trained on. In the context of Generative AI and LLMs, zero-shot learning refers to the capability of an LLM to execute new tasks without being given any explicit examples.
In zero-shot learning, the model is given no examples of the task it is expected to perform, and it must generalize from its previous experiences to complete the task. This contrasts with few-shot learning, which involves providing the model with a few examples of the task it is expected to perform.
For example, if you ask a model to classify a review as positive or negative, and it has never seen a review with that sentiment before, zero-shot learning would involve the model making a generalization based on its previous experiences to determine whether the review is positive or negative.
Bucketing in Recurrent Neural Networks (RNNs) is a technique used for batching sequences while minimizing padding. The main goal is to process sequences of varying lengths efficiently and maintain a consistent batch size during training. Here’s a step-by-step explanation:
m
buckets based on sequence lengths. Each bucket represents a specific range of sequence lengths.Bucketing is an essential technique when dealing with sequence data, as it helps to minimize padding and improve training efficiency. However, it might require more computational resources due to the need to process multiple buckets for generating batches over time.
It’s important to note that this method assumes that sequences within a bucket are independent of one another. If there is any dependence between sequences in a bucket, this assumption would not hold true, and other data processing techniques might be required.
Bucketing in RNNs is a technique for batching sequences in a way that minimizes padding. The idea is to group sequences of similar length into buckets and then process the buckets separately.
The process of bucketing involves dividing the data into buckets, sampling sequences from a single bucket, and padding all sequences in a batch to the size of the largest sequence in the batch. This technique is useful for RNNs because it allows for the efficient processing of sequences of varying lengths.
The technique is illustrated in the provided text with two images. The first image shows the process of dividing the data into buckets, and the second image shows the process of padding the sequences in a batch.
A named entity is a real-world object that can be denoted with a proper name, examples include persons, locations, organizations, products, geographical entities, geopolitical entities, time indicators, artifacts, events, and natural phenomena. Named Entity Recognition (NER) systems are used to automatically locate named entities from unstructured text. This process is important for various applications such as improving search engine efficiency by scanning websites and storing named entities for later matching with user queries, and enhancing recommendation systems by extracting named entities from a user’s search history to identify their interests.
For instance, in the sentence “Sharon flew to Miami on Friday,” the named entities are “Sharon” (person), “Miami” (location), and “Friday” (time indicator). NER systems can accurately identify and tag these named entities within the text. The remaining tokens that do not represent named entities are typically tagged as filler words or O
.
A named entity is a real-world object that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include people, locations, organizations, and products.
Named entities are extracted from text and are used to identify and classify named objects in a sentence. They are used in various applications, such as search engine efficiency, recommendation systems, and information extraction.
In the text provided, the sentence “Sharon flew to Miami on Friday” would have the named entities “Sharon,” “Miami,” and “Friday.”
On this post, I show you how I locally executed an LLM-powered Q&A chatbot that performs RAG on an Obsidian, using Mistral-7b and Gemma-7b instruction fine-tuned models as well as Nomic-Embed, all provided by Ollama.
I showed you how I did and you can:
I am going to run all experiments locally, using a 7th gen i7, an NVIDIA GTX 1070 and 32 GBs of RAM.
A lot of the heavy lifting will be done by the LangChain package, which I am on the process of learning to use.
On the road to building this Q&A bot, we will be introduced to many concepts:
So that no one gets lost, the following diagram explains how the whole pipeline to get to our final Q&A bot (and how this notebook) works:
You should download a model from the GPT4All website and save it on ./models/my_little_llm.gguf
. The one below is the one I used.
wget https://gpt4all.io/models/gguf/mistral-7b-openorca.Q4_0.gguf -O models/my_little_llm.gguf
Here I used the ObsidianLoader
document loader and point it to the directory that contains all my notes in Markdown format.
We can see I have ~500 text files.
from pathlib import Path
docs_path = (Path.home() / "Documents" / "Obsidian").absolute()
from langchain_community.document_loaders import ObsidianLoader
loader = ObsidianLoader(docs_path, collect_metadata=True, encoding="UTF-8")
docs = loader.load()
print(f"Loaded {len(docs)} docs")
Encountered non-yaml frontmatter
Loaded 499 docs
Let’s take a peek at one of the documents. We can see it has the textual content itself, as well as some metadata. The ObsidianLoader
includes file properties from Obsidian documents, such as tags, dates and aliases, as part of the metadata.
docs[7]
Document(page_content='A method for pre-training [[language model]]s in which the model has access to the first tokens of the sequence and its task is to predict the next token.\n\nThe following examples depict how a single sequence can be turned into multiple training examples:\n\n1. `<START>` → `the`\n1. `<START> the` → `teacher`\n1. `<START> the teacher` → `teaches`\n1. `<START> the teacher teaches` → `the`\n1. `<START> the teacher teaches the` → `student`\n1. `<START> the teacher teaches the student` → `<END>`\n\nModels trained using this method have access to the full sequence of tokens at inference time, making them appropriate for non-generative tasks that revolve around processing a sequence of tokens as a whole, for example:\n\n- [[Sentiment Analysis]]\n- [[Named entity recognition]]\n- [[Word classification]]\n\n[[Bidirectional Encoder Representation from Transformers|BERT]] is an example of a masked language model. Example from [[Bidirectional Encoder Representation from Transformers|BERT]]: Choose 15% of the tokens at random: mask them 80% of the time, replace them with a random token 10% of the time, or keep as is 10% of the time.\n\n## Sources\n\n- [[DeepLearning.AI Natural Language Processing Specialization]]\n- [[Generative AI with Large Language Models]]', metadata={'source': 'Causal language modeling.md', 'path': '/home/dodo/Documents/Obsidian/Causal language modeling.md', 'created': 1700448369.2719378, 'last_modified': 1700448369.2719378, 'last_accessed': 1708267659.2105181, 'tags': 'area/ai/nlp/llm', 'date': '2023-11-19 23:41'})
Obsidian documents have some of their own Markdown flavor, like [[Graph Neural Network|GNNs]]
, where Graph Neural Network
is the name of a document and GNNs
is what appears on the text. In cases like these, we want to keep only the second part.
It also has full-on links, such as [[grid world]]
, in which case we want to remove the double brackets.
# !pip install google-re2
import re2
docus = []
insane_pattern = r"\[\[([^\]]*?)\|([^\[]*?)\]\]"
for doc in docs:
s = re2.search(insane_pattern, doc.page_content)
if s is not None:
new_doc = re2.sub(insane_pattern, r"\2", doc.page_content)
docus.append(
(
doc.page_content,
new_doc,
)
)
doc.page_content = new_doc
doc.page_content = doc.page_content.replace("[[", "").replace("]]", "")
sorted(docus, key=lambda x: len(x[1]))[0]
('- [[Intersection over Union|IoU]]', '- IoU')
This step splits the documents loaded in the previous step into smaller chunks.
LangChain provides its own Markdown text splitter, which we are going to use.
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size=400, chunk_overlap=50)
splits = splitter.split_documents(docs)
len(splits)
1510
Let’s take a peek at a chunk. They inherit the metadata of their parent document.
splits[542]
Document(page_content='# epsilon-soft policies\n\nAn $\\epsilon$-soft policy is a stochastic policy that always assigns a non-zero $\\frac{\\epsilon}{|A|}$ probability to all actions. These policies always perform some exploration.\n\nThe uniform random policy is an $\\epsilon$-soft policy. The epsilon-greedy policy also is.', metadata={'source': 'epsilon-soft policies.md', 'path': '/home/dodo/Documents/Obsidian/epsilon-soft policies.md', 'created': 1680669506.8282943, 'last_modified': 1680669506.8282943, 'last_accessed': 1708267660.780534, 'tags': 'area/ai/rl project/rl-spec', 'aliases': 'epsilon-soft policy', 'date': '2021-05-24 18:32'})
To quickly search for text chunks, it is useful to precompute an embedding vector for each chunk and store it for future use.
An embedding vector is a numerical vector that represents the text chunk. It allows us to compare chunks in the embedding space. Chunks with similar semantic meaning tend to have similar embedding vectors. This similarity can be computed using e.g. cosine similarity.
My choice for embedding generator was SentenceTransformers, provided by Hugging Face, which runs locally.
# !pip install sentence_transformers
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings()
/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Computed embedding vectors can be stored in vector stores. The one we will use in this project is Chroma. It is free, runs locally and is perfect for our small document base.
# !pip install chromadb
from langchain.vectorstores import Chroma
persist_directory = "docs/chroma/"
!rm -rf ./docs/chroma # remove old database files if any
vectordb = Chroma.from_documents(
documents=splits, embedding=embedding, persist_directory=persist_directory
)
vectordb._collection.count()
1510
Retrieval is the act of retrieving text chunks from our vector store, given an input prompt.
Basic retrieval is performed by comparing the prompt embedding with those of the text chunks. More complex retrieval techniques involve calls to an LLM.
Let’s first test a retrieval technique based on similarity search in the vector store. Given a prompt, the procedure should return the most similar or relevant chunks in the vector database.
The question below will be used as a test for everything else below in the notebook. It is related to reinforcement learning, an area in which I have a few hundred documents written on Obsidian. You can find more about the question here to see if our retrieval methods actually nail the answer.
question = "What is the definition of the action value function?"
The first example of retrieval is similarity search, which will convert the prompt into an embedding vector and compute the cosine similarity between the prompt embedding and the embeddings of all chunks in the vector store, returning the k most similar chunks.
retrieved_docs = vectordb.similarity_search(question, k=8)
for doc in retrieved_docs:
print(doc.page_content, end="\n\n---\n\n")
The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy.
$$q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_t|S_t=s,A_t=a]$$
where $G_t$ is the Expected sum of future rewards.
---
A value function maps states, or state-action pairs, to expected returns.
- State-value function
- Action-value function
---
The state-value function represents the expected return from a given state, possibly under a given policy.
$$v(s)=\mathbb{E}[G_t|S_t=s]$$
$$v_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s]$$
where $G_t$ is the Expected sum of future rewards.
---
The same goes for the Action-value function.
$$\begin{align}
q_*(s,a) & = \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma \sum_{a'} \pi_*(a'|s') q_*(s',a')] \\
& = \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma \max_{a'} q_*(s',a')]
\end{align}$$
---
Let's say we have a policy $\pi_1$ that has a value function $v_{\pi_1}$. If we use $v_{\pi_1}$ to evaluate states but, instead of following $\pi_1$, we actually always select the actions that will take us to the future state $s'$ with highest $v_{\pi_1}(s')$, we will end up with a policy $\pi_2$ that is equal to or better than $\pi_1$.
---
$$\begin{align}
v_*(s) & = \sum_a \pi_*(a|s) & \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma v_*(s')] \\
& = \max_a & \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma v_*(s')]
\end{align}$$
where $\pi_*$ is the Optimal policy.
The same goes for the Action-value function.
---
It's a function that dictates the probability the state will find itself in an arbitrary state $s'$ and the agent will receive reward $r$, given the current state the environment finds itself in, $s$, and the action chosen by the agent in $s$, depicted as $a$. It is usually denoted as $p(s',r|s,a)$.
Some properties of this function:
---
Policy evaluation is the task of finding the state-value function $v_{\pi}$, given the policy $\pi$. ^1b9b46
---
Plain similarity search has a drawback. It tends to recover chunks which are very similar or even identical, diminishing the overall amount of information present in the retrieved chunks.
To solve this, LangChain provides a method called maximal marginal relevance search, which works by “[…] finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.” [source]
retrieved_docs = vectordb.max_marginal_relevance_search(question, k=8)
for doc in retrieved_docs:
print(doc.page_content, end="\n---\n")
The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy.
$$q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_t|S_t=s,A_t=a]$$
where $G_t$ is the Expected sum of future rewards.
---
A value function maps states, or state-action pairs, to expected returns.
- State-value function
- Action-value function
---
A generalization of Sarsa which employs the n-step return for the action value function,
!n-step return#^205a30 ^68659e
This estimate is then used in the following update rule for the action-value of the state-action pair at time $t$.
$$Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - \gamma^n Q_{t+n-1}(S_t, A_t)]$$ ^ca04db
---
- if the agent exploits without having a good estimate of the action-value function, it will most likely be locked in suboptimal behavior, not being able to gather information from unknown transitions which might bring it more return.
---
Some properties of this function:
It maps states and actions to states and rewards, so its cardinality is $$p:S \times R \times S \times A \to [0;1]$$
It is a probability, so the sum over all possible combinations of states and rewards must be one,
$$\sum_{s' \in S} \sum_{r \in R} p(s',r|s,a) = 1, \forall s \in S, a \in A(s)$$
---
# Factored value functions in cooperative multi-agent reinforcement learning
<iframe width="560" height="315" src="https://www.youtube.com/embed/W_9kcQmaWjo?start=684" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
VDN was the first one and the one I used in my Doctorate.
---
- Exploitation: select the greedy action with relation to the action-value function.
- Exploration: select a non-greedy action.
---
Given the following MDP:
!Pasted image 20210523192818.png
The Bellman equation allows the value function to be expressed and solved as a system of linear equations: ^c06dd9
!Bellman equation for the state-value function#^a65ad4
---
Some retrieval techniques require an underlying language model to be performed. The LLM may be used to, e.g. summarize or make chunks more coherent before returning them.
The LLM I chose is Mistral-7B-OpenOrca, provided by GPT4All.
# !pip install gpt4all
# !pip install lark
# !wget https://gpt4all.io/models/gguf/mistral-7b-openorca.Q4_0.gguf -O models/my_little_llm.gguf
# !wget https://gpt4all.io/models/gguf/nous-hermes-llama2-13b.Q4_0.gguf -O models/my_little_llm.gguf
from langchain_community.llms.gpt4all import GPT4All
llm = GPT4All(model="models/my_little_llm.gguf", device="gpu")
llama.cpp: using Vulkan on NVIDIA GeForce GTX 1070
Self-query is a technique in which an LLM is specifically prompted to output a structured query. It also allows it to take document/chunk metadata into consideration, as long as we describe each attribute in the metadata with a textual description.
Under the hood, self-query performs some pretty convoluted modifications to the original prompt and I advise you look at the documentation to understand what’s going on. [Source]
As we have seen when inspecting our splits, we can see that our data includes metadata taken from the file properties of Obsidian documents. We will go ahead and described them as attributes for the self-query retriever.
from langchain.chains.query_constructor.base import AttributeInfo
metadata_field_info = [
AttributeInfo(
name="source",
description="The name of the Markdown file that contained the chunk. If you ignore the .md extension, it is the name of the article the chunk came from.",
type="string",
),
AttributeInfo(
name="aliases",
description="Other names for the article the chunk came from, if any.",
type="string",
),
AttributeInfo(
name="tags",
description="A series of comma-separated tags that categorize the article the chunk came from. When a tags starts with 'area', it denotes a broad area of knowledge. When it starts with 'project', it describes a specific project with beginning and end.",
type="string",
),
AttributeInfo(
name="authors",
description="When the document summarizes a scientific paper, this attribute holds a comma-separated list of author names.",
type="string",
),
AttributeInfo(
name="year",
description="When the document summarizes a scientific paper, this attribute contains the year of the publication.",
type="integer",
),
]
document_content_description = "A collection of study notes in Markdown format written by a single author, mostly about artificial intelligence topics."
The self-query retriever can also be configured to use maximal marginal relevance search, as you can see in the base_retriever
argument below.
from langchain.retrievers.self_query.base import SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectordb,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
base_retriever=vectordb.as_retriever(search_type="mmr", k=8),
)
retriever.invoke(question)
[Document(page_content='!Pasted image 20231129031306.png', metadata={'created': 1708307272.9665868, 'date': '2023-11-29 01:34', 'last_accessed': 1708307272.9699202, 'last_modified': 1708307272.9665868, 'path': '/home/dodo/Documents/Obsidian/Single linkage.md', 'source': 'Single linkage.md', 'tags': 'area/ai/ml/clustering'}),
Document(page_content='!Pasted image 20230317051147.png', metadata={'created': 1680667926.0942817, 'date': '2023-03-17 04:33', 'last_accessed': 1708267663.323892, 'last_modified': 1680667926.0942817, 'path': '/home/dodo/Documents/Obsidian/Comparing feature vectors in NLP.md', 'source': 'Comparing feature vectors in NLP.md', 'tags': 'area/ai/nlp project/nlp-spec'}),
Document(page_content='!Pasted image 20230325081439.png', metadata={'created': 1679742881.9000912, 'last_accessed': 1708267661.4972079, 'last_modified': 1679742881.9000912, 'path': '/home/dodo/Documents/Obsidian/Text cleaning.md', 'source': 'Text cleaning.md'}),
Document(page_content='!_attachments/Pasted image 20210523185724.png', metadata={'created': 1680669713.8148472, 'date': '2023-04-05 01:41', 'last_accessed': 1708267662.710553, 'last_modified': 1680669713.8148472, 'path': '/home/dodo/Documents/Obsidian/Iterative policy evaluation.md', 'source': 'Iterative policy evaluation.md', 'tags': 'area/ai/rl project/rl-spec'})]
As we can see, without more informative metadata (or better preprocessing of the text documents), the retrieved chunks are not very useful. It only retrieved chunks related to figures.
As a final test on retrieval, we will implement a “contextual compression retriever”.
From the LangChain documentation:
The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.
In our case:
Our hope is that the small, irrelevant chunks returned by the self-query retriever will be dropped and more relevant chunks will be summarized and returned.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=vectordb.as_retriever(search_type="mmr")
)
compressed_docs = compression_retriever.get_relevant_documents(question)
compressed_docs
/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
warnings.warn(
/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
warnings.warn(
/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
warnings.warn(
/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
warnings.warn(
[Document(page_content='The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy.', metadata={'created': 1680669936.852964, 'date': '2023-04-05 01:45', 'last_accessed': 1708267660.0105264, 'last_modified': 1680669936.852964, 'path': '/home/dodo/Documents/Obsidian/Action-value function.md', 'source': 'Action-value function.md', 'tags': 'area/ai/rl project/rl-spec'}),
Document(page_content='Action-value function', metadata={'created': 1680669672.8131003, 'date': '2023-04-05 01:41', 'last_accessed': 1708267659.7238567, 'last_modified': 1680669672.8131003, 'path': '/home/dodo/Documents/Obsidian/Value functions.md', 'source': 'Value functions.md', 'tags': 'area/ai/rl project/rl-spec'}),
Document(page_content='*NO_OUTPUT*\n\nThe definition of the action value function is not mentioned in this context.', metadata={'created': 1633628586.5949209, 'date': '2021-03-02 23:01', 'last_accessed': 1708267661.2005382, 'last_modified': 1632030179.7187316, 'path': '/home/dodo/Documents/Obsidian/Factored value functions in cooperative multi-agent reinforcement learning.md', 'source': 'Factored value functions in cooperative multi-agent reinforcement learning.md', 'tags': 'None'}),
Document(page_content='Action-Value Function Definition: Not mentioned in the context.', metadata={'created': 1680669515.4254303, 'date': '2023-04-05 01:38', 'last_accessed': 1708267661.453874, 'last_modified': 1680669515.4254303, 'path': '/home/dodo/Documents/Obsidian/Exploration-exploitation tradeoff.md', 'source': 'Exploration-exploitation tradeoff.md', 'tags': 'area/ai/rl project/rl-spec'})]
These results seem much better than the previous ones, but they are still just a collection of chunks. When interacting with LLMs and chatbots in general, we expect a more direct response.
In this example, we will perform retrieval augmented generation for question-answering in an Obsidian document database.
To summarize what we already have for this step:
This method of Q&A uses the prompt to find relevant chunks in the vector store. These chunks are called the context of the prompt and they are concatenated to the prompt, which is then passed directly to the LLM.
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm=llm, retriever=vectordb.as_retriever(search_type="mmr")
)
We can see which arguments the chain expects by inspecting the input_keys list
.
qa_chain.input_keys
['query']
result = qa_chain({"query": question})
/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
The result of prompting the overall system can be seen below. If you remember the definition of the action-value function [source], our Q&A bot has pretty much nailed it!
result
{'query': 'What is the definition of the action value function?',
'result': ' The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy.'}
Under the hood, the RetrievalQA
object uses a prompt template into which it replaces the context and the question before sending the full text prompt to the LLM. We can see it by inspecting the object’s graph.
qa_chain.get_graph().nodes
{'7eac904b44594e20852d8f0519ef0c3e': Node(id='7eac904b44594e20852d8f0519ef0c3e', data=<class 'pydantic.v1.main.ChainInput'>),
'8e824ec8c0654d0db3b83b56bd66b619': Node(id='8e824ec8c0654d0db3b83b56bd66b619', data=RetrievalQA(combine_documents_chain=StuffDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['context', 'question'], template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=GPT4All(model='models/my_little_llm.gguf', device='gpu', client=<gpt4all.gpt4all.GPT4All object at 0x776da5594320>)), document_variable_name='context'), retriever=VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x776cf1b45be0>, search_type='mmr'))),
'4b55a6602f9542fe8d583ac66c6ae722': Node(id='4b55a6602f9542fe8d583ac66c6ae722', data=<class 'pydantic.v1.main.ChainOutput'>)}
The example below shows how to edit the prompt template used by the chain, albeit, in this case, with limited success. This is due to the limited performance of the LLM being used.
from langchain.prompts import PromptTemplate
# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. At the end of the response, say \"over and out\".
{context}
Question: {question}
Helpful Answer:"""
qa_chain_prompt = PromptTemplate.from_template(template)
# Run chain
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vectordb.as_retriever(search_type="mmr"),
return_source_documents=True,
chain_type_kwargs={"prompt": qa_chain_prompt},
)
Let’s ask a few questions to our Q&A bot and render the output as some nice Markdown.
Note that we can also output the documents that were retrieved during RAG and used to compose the answer, but that would pollute the output too much, so I left it commented out.
from IPython.display import Markdown
questions = [
"Given me the equation for the action value function update.",
"What is the overall architecture of the Deep Q-Networks?",
"What is the difference between causal language modelling and masked language modelling?",
"What is zero-shot learning?",
"Explain to me the concept of bucketing in RNNs.",
"What is a named entity in the concept of NLP?",
]
for q in questions:
result = qa_chain({"query": q})
display(Markdown(f"**Question: {result["query"]}**\n\n Answer: {result['result']}"))
# source_docs = "\n\n".join(d.page_content for d in result["source_documents"])
# print(
# f"Source documents\n\n{source_docs}"
# )
Question: Given me the equation for the action value function update.
Answer: The equation for the action-value function update is given by:
\[q_{\pi}(s,a) = R(s,a) + \<dummy32001>{ \gamma V_\pi (s') | s' \in S'}\]where $R(s,a)$ is the reward received when taking action a in state s and $\gamma$ is the discount factor.
Question: What is the overall architecture of the Deep Q-Networks?
Answer: The overall architecture of a Deep Q-Network (DQN) consists of an input layer, multiple hidden layers with nonlinear activation functions, and an output layer. It uses experience replay to store past experiences for training purposes, and employs target networks to stabilize the learning process. over and out
Question: What is the difference between causal language modelling and masked language modelling?
Answer: Causal language modeling refers to a method where the model predicts the next token in a sequence based on the previous tokens. In contrast, masked language modeling involves randomly masking some tokens during training time and then trains the model to reconstruct the original text by predicting the masked tokens.
Question: What is zero-shot learning?
Answer: Zero-shot learning refers to a model’s ability to perform new tasks without being explicitly trained on those specific tasks or examples. In the context of large language models, it means that an AI can execute new tasks without needing any explicit training data for those tasks.
Question: Explain to me the concept of bucketing in RNNs.
Answer: Bucketing in RNNs refers to grouping or organizing input sequences into fixed-sized groups, called “buckets”, before processing them with an RNN model. This technique helps improve training efficiency and reduce padding by ensuring that each bucket contains a sufficient amount of randomness and variability while preventing it from being too large so as not to introduce excessive padding.
Question: What is a named entity in the concept of NLP?
Answer: In the context of Natural Language Processing (NLP), a named entity refers to a real-world object that can be denoted with a proper name. Examples are a person, location, organization, product. It can be abstract or have a physical existence.
In some answers, the model has actually followed the instructions from the new prompt, but we need a much more powerful LLM, or the employment of techniques such as few-shot learning, to get better instruction-following results.
This notebook presented a proof-of-concept on how to create a question-answering bot powered by an LLM and with knowledge extracted from actual documents, more specifically, a collection of notes from Obsidian.
We were able to run all experiments locally, using a 7th gen i7, an NVIDIA GTX 1070 and 32 GBs of RAM.
We were also introduced to many concepts on the road to building this Q&A bot, such as:
In future work, let’s build an actual chatbot that remembers previous answers and can keep up a lengthier conversation.
]]>While I have used LaTeX Workshop for a long time, I started experiencing a few issues with it that, over time, got really annoying. The main issue is simply the Enter key not working, which was then replaced by a substantial delay between pressing the Enter key and actually inserting a new line character. Fixing one apparently introduced the other, so, after many tries, I just gave up on LaTeX Workshop.
The TeXLab VS Code extension also provides other useful features, as long as the appropriate programs are installed and in your Linux PATH
(or whatever the equivalent in Mac/Windows is). These assorted programs are usually included in TeX Live, but can be installed through your distribution’s package manager, if you are on Linux.
latexindent
)latexmk
)chktex
)Two plus two is four, minus one, that’s three. Quick maths.
When I started my PhD, I didn’t know at first if my research would take me in a path whose foundations I was comfortable with, such as algorithms, logic and other discrete areas, or if fate would take me towards the more mathematical path. As I started to get really interested in the applications of neural networks for reinforcement learning, I realized little by little that I would not only need to remember all the mathematics I had learned before in my life, but also learn and get comfortable with some new stuff.
In this article, I go over:
I recommend the free algebra book by James Brennan and the free precalculus book by Stitz and Zaeger, which I found out about in the very helpful /r/learnmath subreddit. My recommendation would be to evaluate which topics you remember and just skim over them, while giving yourself more time over the stuff you don’t remember or have never learned.
Personally, I made a lot of progress by just following both books until I felt I was comfortable with my knowledge. I tried not to skip anything, but I also felt like I could work fast as I had already learned most of this stuff before.
To study calculus, I used the books by Stewart, 8th edition. I knew I was prepared to start calculus when I had studied enough algebra to fulfill the prerequisites at the start of the book. I also worked through the two review lists available in Stewart’s website [1] [2]. Those lists helped me to prioritize what I needed to focus on my aforementioned precalculus quest.
I decided to start studying linear algebra alongside calculus. I had an easier time with linear algebra as I already had some familiarity with matrix operations, not only because I was already working with neural networks for quite some time, but I also implemented my own matrix class in C++ at one point.
I started with a Brazilian undergrad text book from 1986, but also searched many concepts on YouTube, since I did not get the intuition behind some of them. I ended up ditching the book after getting acquainted with Larson’s book, which presented the material in a much nore contemporary fashion.
I haven’t gotten here yet, but after learning all the basic stuff about matrices, vector spaces, linear transforms and eigenthingies, I was recommended the book Linear Algebra Done Right, by Sheldon Axler to move on to more advanced stuff.
numpy.linalg
. You can load .mat
files into Python variables using SciPy’s loadmat
function.It came a point in which I started having the kind of questions that I couldn’t quite google my way to the answer. When I started to look for a place to post my questions, I found the extremely active and welcoming /r/learnmath subreddit. The people there are really nice and welcoming of noobs and they are a lot less strict than other sites, such as the Math Stack Exchange.
Another place I suggest people to look are Discord servers, where people gather to talk about homework. An example is Homework Help. There are also study servers on Discord in which people can keep you company in mute video calls. I highly recommend the Study Lions, but there are others.
Overall, my advice with regards to communities is to both seek and provide help. Seeing people ask questions to which I knew the answer to made me realize how much I had actually learned in my journey, which really motivated me. Also, I could help a random stranger online, which is nice.
Humble time. Since I didn’t know where to start, I decided to enroll in online university courses and let the lectures guide me. I had had a great experience with Coursera and the machine learning course by Andrew Ng, so I thought I’d have an equally positive experience with Coursera’s math courses1.
However, unlike the previous courses I took, I couldn’t watch the lectures or complete the assignments before or after the correct dates, which really frustrated me.
Another problem I faced was that I just couldn’t follow the professors in some of the lectures. Sometimes, the exercises after a lesson seemed to have nothing to do with the lesson itself. Since I had no other material to follow, I felt stuck in some lessons and couldn’t meet the deadlines.
Don’t force it: Studying mathematics ended up becoming a hobby of mine, one that I could spend an entire day doing, after I had started. There were days that I would eat up over 20 pages of a book, a pretty decent number if you ask me, since I’d be able to end a 400 page book in 20 days (not that I did that). My main problem was, and still is, consistency. I can’t really do the whole “study everyday for 30 minutes” thing. I usually studied for the whole day, twice a week.
I’ll admit I did not study any of the books from cover to cover and I am still working through them, but I never felt hindered or delayed by the method I chose to study mathematics. My only limitations are time, since I have other responsibilities, and motivation.
Find the format that is best for you: since I did very well in online AI courses [1] [2], I thought the same would be true for online math courses. However, when it came to mathematics, I felt like books were a much better medium, as they present the material in both a rigorous and linear fashion, delving into the theory and definitions and following up with lots of examples and exercises which could mostly be completed with the material presented right before. Text books are also good when I just wanted to review some old algebra, in which case I am able to quicly skip to whatever concept I am interested in, while videos would be much harder to sift through.
YouTube is your friend: I tend to forget basic stuff very often, such as how to complete the square, trigonometric identities (there are lots of them) or the binomial theorem. In these cases, looking up a YouTube video is much faster than searching for what you need in books. Math videos usually go directly to the point and you don’t waste any time.
This video my the Math Sorcerer lists books from every area of Mathematics, starting from basic logic and algebra and going all the way up to the crazy stuff.
This video by Aleph 0 has some book recommendations for undergraduate level courses in pure mathematics, with accompanying video lectures on YouTube.
.bib
file which is always kept updated. Zotero is a great reference manager that has incorporated lots of much needed features along the years. Two things that I feel Zotero is still missing is
.bib
file which can be referenced in all my projects.Here, I teach you how to circumvent these problems with two addons.
Zotfile is an addon that helps in managing PDF files inside Zotero. It allows for automatic and batch moving and renaming of file attachments.
In order to sync all your PDF files among multiple computers, you can point Zotfile’s “PDF directory” setting into a folder inside your cloud sync service of choice (Mega, Dropbox, Google Drive, OneDrive etc.) and configure the addon to automatically move all new attachments to this folder. Zotero will then only keep links to these files in its database and will synchronize only these links, while the files themselves will be synchronized by your cloud sync application.
JabRef introduced me to the as-of-yet unknown habit of keeping a single giant .bib
file, which I would just link all my papers to and use the autocomplete function of my TeX editor of choice to search for entry keys.
To emulate this behavior in Zotero, there is an addon called Better Bibtex. It allows me to export my entire library into a .bib
file, which is then kept updated as I change information in my Zotero library. I actually kept two .bib
files, one in BibTeX format for all my papers, and the other in BibLaTeX format, for my thesis.
O trecho abaixo é um comentário do Andrej Karpathy, da Tesla, neste link.
If it makes you feel any better, I’ve been doing this for a while and it took me last ~6 weeks to get a from-scratch policy gradients implementation to work 50% of the time on a bunch of RL problems. And I also have a GPU cluster available to me, and a number of friends I get lunch with every day who’ve been in the area for the last few years. Also, what we know about good CNN design from supervised learning land doesn’t seem to apply to reinforcement learning land, because you’re mostly bottlenecked by credit assignment / supervision bitrate, not by a lack of a powerful representation. Your ResNets, batchnorms, or very deep networks have no power here. SL wants to work. Even if you screw something up you’ll usually get something non-random back. RL must be forced to work. If you screw something up or don’t tune something well enough you’re exceedingly likely to get a policy that is even worse than random. And even if it’s all well tuned you’ll get a bad policy 30% of the time, just because. Long story short your failure is more due to the difficulty of deep RL, and much less due to the difficulty of “designing neural networks”.
Fonte: http://amid.fish/reproducing-deep-rl
Ao trabalhar com métodos programados no computador, nós frequentemente nos deparamos com problemas cuja solução não é óbvia. Quando o método não necessita de treinamento, é mais rápido e fácil ter várias ideias do que pode solucionar um problema e testar tudo em segundos do que realmente pensar na origem do problema. No caso de um algoritmo de RL, cada novo teste leva um ou mais dias, então é necessário pensar no problema para selecionar a solução mais provável, ou encontrar uma solução melhor, mesmo que não seja óbvia.
Ao reproduzir o artigo de outra pessoa:
Algumas dicas que o autor tentou martelar nos leitores:
Outro ponto interessante do relato. O autor pensou que iria levar 3 meses pra terminar a reprodução do paper, mas levou 8 meses. A maior parte do tempo é desprendida fazendo o algoritmo funcionar num exemplo simples. Depois, a demora está em instrumentar testes. Seguem abaixo os tempos que o autor demorou para:
Fonte: https://www.alexirpan.com/2018/02/14/rl-hard.html
Esse artigo lista diversos motivos pra métodos de DRL não funcionarem.
Algumas características de problemas nos quais aplicar DRL pode ser produtivo:
se a recompensa for modelada (reward shaping), tentar fazê-la ser rica: no ambiente que eu trabalhei durante o doutorado (SMAC), os agentes recebiam:
o autor também aponta que, quanto menor o intervalo entre uma ação e a recompensa associada àquela ação, mais fácil para o agente aprender.
Este notebook continua uma série na qual foram introduzidas a base de dados Iris e o método de treinamento e avaliação de redes neurais [1] e o PCA [2].
No primeiro notebook, foram mencionadas duas coisas que serão testadas aqui.
Neste notebook, uma rede consideravelmente menor será treinada para classificar a base de dados Iris. Não só isso, como ela usará a primeira componente principal retornada pelo PCA, trabalhando tanto com menos pesos como menos dados.
Além disso, treinaremos a rede neural original, utilizada nos notebooks anteriores, com regularização L2 e observaremos a ausência de overfitting quando a rede é treinada por longos períodos de tempo.
As funções abaixo aplicam o PCA numa base de dados e avaliam uma rede neural, retornando as medidas para serem exibidas em gráficos posteriormente.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import tensorflow.keras as keras
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
def apply_PCA(X, n_components):
pca = PCA(n_components=n_components)
new_X = pca.fit_transform(iris_X)
print('Tamanho da base de dados antes: ', X.shape)
print('Tamanho da base de dados depois: ', new_X.shape)
print('Porcentagem da variância explicada por cada coluna:', pca.explained_variance_ratio_)
print('Variância acumulada nas colunas remanescentes:', sum(pca.explained_variance_ratio_))
return new_X
def evaluate_model(model, X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.2,
random_state=123)
y_train_onehot = keras.utils.to_categorical(y_train, num_classes = 3)
y_test_onehot = keras.utils.to_categorical(y_test, num_classes = 3)
model.save_weights('weights.hdf5')
history = model.fit(X_train, y_train_onehot, validation_split=.1, epochs=2500, verbose=0)
y_pred = model.predict(X_test)
y_pred_onehot = y_pred.argmax(axis=1)
print(classification_report(y_test, y_pred_onehot))
v1 = history.history['categorical_accuracy']
v2 = history.history['loss']
v3 = history.history['val_categorical_accuracy']
v4 = history.history['val_loss']
model.load_weights('weights.hdf5')
history = model.fit(X_train, y_train_onehot, epochs=150, verbose=0)
v5 = history.history['categorical_accuracy']
v6 = history.history['loss']
y_pred = model.predict(X_test)
y_pred_onehot = y_pred.argmax(axis=1)
print(classification_report(y_test, y_pred_onehot))
return v1, v2, v3, v4, v5, v6
Usaremos o pacote scikit-learn para carregar a base de dados Iris. Vamos geraruma versão da base com o menor número possível de componentes principais (1).
iris_X, iris_y = load_iris(return_X_y=True)
iris_X_tiny = apply_PCA(iris_X, 1)
Tamanho da base de dados antes: (150, 4)
Tamanho da base de dados depois: (150, 1)
Porcentagem da variância explicada por cada coluna: [0.92461872]
Variância acumulada nas colunas remanescentes: 0.9246187232017271
Em termos matemáticos, o overfitting ocorre quando os pesos de uma rede neural assumem magnitudes (positivas ou negativas) muito altas, permitindo que as funções que essas redes neurais modelam se tornem demasiadamente complexas para a função de erro que tentam minimizar.
Na imagem abaixo, ambas as funções azul e verde se ajustam aos dados vermelhos, porém a função azul assumiu uma forma desnecessariamente complexa. Caso um novo ponto vermelho seja adicionado, é possível que a função verde esteja mais próxima dele do que a função azul, indicando que a função verde generaliza melhor os dados observados, mesmo sendo mais simples.
A diferença entre as duas funções está no fato da função verde ter sido regularizada. Em sua forma mais simples, a regularização é feita somando-se os pesos da rede neural à função de erro que a rede tenta minimizar. Imaginando que $L(\theta)$ é a função de erro que usamos até agora, $R(\theta)$ é o novo termo de regularização que incluímos.
\[J(\theta)=L(\theta) + \lambda R(\theta)\]
Uma regularização comumente usada é a L2, na qual somamos o quadrado dos pesos da rede neural à função de erro. Isso tem o efeito de punir demasiadamente pesos muito grandes, ignorando pesos pequenos. O termo $\lambda$ é um coeficiente que indica o quão relevante o termo de regularização é na fórmula total do erro. Ele será utilizado na declaração da rede neural que faremos em Keras.
\[J(\theta)=L(\theta) + \lambda \sum_w^{\theta}w^2\]
Com a regularização, a rede neural deve alcançar o menor erro possível com os menores pesos possíveis, evitando o overfitting.
As funções abaixo vão criar as rede neurais.
def create_large_model(input_dim):
model = Sequential()
model.add(Dense(10, activation='tanh', input_dim=input_dim))
model.add(Dense(8,activation='tanh'))
model.add(Dense(6,activation='tanh'))
model.add(Dense(3,activation='softmax'))
model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy'])
return model
def create_regularized_model(input_dim):
model = Sequential()
model.add(Dense(10, activation='tanh', kernel_regularizer='l2', bias_regularizer='l2', input_dim=input_dim))
model.add(Dense(8,activation='tanh', kernel_regularizer='l2', bias_regularizer='l2'))
model.add(Dense(6,activation='tanh', kernel_regularizer='l2', bias_regularizer='l2'))
model.add(Dense(3,activation='softmax', kernel_regularizer='l2', bias_regularizer='l2'))
model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy'])
return model
def create_tiny_model(input_dim):
model = Sequential()
model.add(Dense(3, activation='tanh', input_dim=input_dim))
model.add(Dense(2,activation='tanh'))
model.add(Dense(3,activation='softmax'))
model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy'])
return model
large_net = create_large_model(iris_X.shape[1])
reg_net = create_regularized_model(iris_X.shape[1])
tiny_net = create_tiny_model(iris_X_tiny.shape[1])
large_net.summary()
tiny_net.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 50
_________________________________________________________________
dense_1 (Dense) (None, 8) 88
_________________________________________________________________
dense_2 (Dense) (None, 6) 54
_________________________________________________________________
dense_3 (Dense) (None, 3) 21
=================================================================
Total params: 213
Trainable params: 213
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_8 (Dense) (None, 3) 6
_________________________________________________________________
dense_9 (Dense) (None, 2) 8
_________________________________________________________________
dense_10 (Dense) (None, 3) 9
=================================================================
Total params: 23
Trainable params: 23
Non-trainable params: 0
_________________________________________________________________
Vamos utilizar nossa função evaluate_model()
para avaliar as redes. As redes maiorws serão treinadas na base original e a rede menor, na 1ª componente principal dessa base.
acc1_large, l1_large, acc_v_large, lv_large, acc2_large, l2_large = evaluate_model(large_net, iris_X, iris_y)
acc1_reg, l1_reg, acc_v_reg, lv_reg, acc2_reg, l2_reg = evaluate_model(reg_net, iris_X, iris_y)
acc1_tiny, l1_tiny, acc_v_tiny, lv_tiny, acc2_tiny, l2_tiny = evaluate_model(tiny_net, iris_X_tiny, iris_y)
precision recall f1-score support
0 1.00 1.00 1.00 13
1 0.75 1.00 0.86 6
2 1.00 0.82 0.90 11
accuracy 0.93 30
macro avg 0.92 0.94 0.92 30
weighted avg 0.95 0.93 0.93 30
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.83 0.91 6
2 0.92 1.00 0.96 11
accuracy 0.97 30
macro avg 0.97 0.94 0.96 30
weighted avg 0.97 0.97 0.97 30
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.83 0.91 6
2 0.92 1.00 0.96 11
accuracy 0.97 30
macro avg 0.97 0.94 0.96 30
weighted avg 0.97 0.97 0.97 30
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.83 0.91 6
2 0.92 1.00 0.96 11
accuracy 0.97 30
macro avg 0.97 0.94 0.96 30
weighted avg 0.97 0.97 0.97 30
precision recall f1-score support
0 1.00 1.00 1.00 13
1 0.67 1.00 0.80 6
2 1.00 0.73 0.84 11
accuracy 0.90 30
macro avg 0.89 0.91 0.88 30
weighted avg 0.93 0.90 0.90 30
precision recall f1-score support
0 1.00 1.00 1.00 13
1 0.83 0.83 0.83 6
2 0.91 0.91 0.91 11
accuracy 0.93 30
macro avg 0.91 0.91 0.91 30
weighted avg 0.93 0.93 0.93 30
São exibidos os valores para cada rede treinada na base de treinamento por 2500 épocas e depois por 150 eṕocas. Espera-se que o desempenho das redes neurais que sofram de overfitting seja inferior nos valores impresso primeiro.
Os valores podem mudar aleatoriamente, porém é possível perceber através dos valores de precisão, recall, F1-score e pela matriz de confusão que o desempenho de ambas as redes é comparável e, muitas vezes, idêntico.
Vamos gerar gráficos das 6 medidas coletadas dos dois modelos.
fig, axes = plt.subplots(3,2, False,figsize=(20, 15), squeeze=True)
axes[0][0].plot(acc1_large)
axes[0][0].plot(acc1_reg)
axes[0][0].plot(acc1_tiny)
axes[0][0].set_title('Acurácia Treino')
axes[0][1].plot(l1_large)
axes[0][1].plot(l1_reg)
axes[0][1].plot(l1_tiny)
axes[0][1].set_title('Erro Treino')
axes[1][0].plot(acc_v_large)
axes[1][0].plot(acc_v_reg)
axes[1][0].plot(acc_v_tiny)
axes[1][0].set_title('Acurácia Val.')
axes[1][1].plot(lv_large)
axes[1][1].plot(lv_reg)
axes[1][1].plot(lv_tiny)
axes[1][1].set_title('Erro Val.')
axes[2][0].plot(acc2_large)
axes[2][0].plot(acc2_reg)
axes[2][0].plot(acc2_tiny)
axes[2][0].set_title('Acurácia total')
axes[2][1].plot(l2_large)
axes[2][1].plot(l2_reg)
axes[2][1].plot(l2_tiny)
axes[2][1].set_title('Erro total')
legenda = ['Grande', 'Regularizada', 'Pequena']
xlabel = 'Épocas'
axes[0][0].legend(legenda)
axes[1][0].legend(legenda)
axes[2][0].legend(legenda)
axes[0][1].legend(legenda)
axes[1][1].legend(legenda)
axes[2][1].legend(legenda)
axes[0][0].set_xlabel(xlabel)
axes[1][0].set_xlabel(xlabel)
axes[2][0].set_xlabel(xlabel)
axes[0][1].set_xlabel(xlabel)
axes[1][1].set_xlabel(xlabel)
axes[2][1].set_xlabel(xlabel)
plt.show()
Ambas as redes foram treinadas por 2500 épocas e avaliadas no conjunto de validação para constatar a presença de overfitting (4 primeiros gráficos). Depois, foram treinadas novamente por 150 épocas na totalidade dos dados de treinamento (2 últimos gráficos).
O gráfico de erro no conjunto de treinamento (topo, direita) nos mostra que a rede menor leva mais épocas para aprender. Seu erro diminui lentamente. A rede regularizada não consegue diminuir seu erro tanto quanto as redes não regularizadas devido à restrição em seus pesos. Isso pode parecer ruim, mas veremos um efeito positivo da regularização a seguir.
O gráfico de erro no conjunto de validação (centro, direita) demonstra que o erro da rede grande não-regularizada começa a crescer linearmente após aproximadamente 200 épocas. Um período de treinamento muito extenso apenas prejudica esta rede. A rede menor, apesar de ser menos afetada pelo overfitting, ainda demonstra aumenta do erro após muitas épocas de treinamento. Porém, este aumento é mais lento que na rede neural grande. Por último, a rede grande regularizada não demonstra aumento do erro no conjunto de validação, independente da quantidade de épocas de treinamento.
Essa é uma característica interessante no mundo real, quando não sabemos exatamente a topologia exata da rede neural que solucionará nosso problema em questão, ou por quantas épocas de treinamento é necessário treinar o modelo.
Este notebook demonstrou que uma rede neural consideravelmente menor que a utilizada até o momento é capaz de realizar a classificação das flores na base de dados Iris, utilizando 1/4 dos valores de entrada, 3/4 do número de camadas e aproximadamente 1/10 dos pesos treináveis.
]]>Este notebook exemplifica a redução de dimensionalidades de uma base de dados e a posterior classificação de seus itens utilizando uma rede neural. A redução de dimensionalidades é executada utilizando análise de componentes principais, disponível no pacote scikit-learn e a rede neural é um perceptron multi-camadas, construído utilizando Keras.
Este notebook é uma continuação de um outro no qual a base de dados utilizada (Iris) e o método de treinamento e avaliação da rede neural já foram descritos. O notebook anterior está disponível aqui.
Vamos utilizar scikit-learn e Keras. Também vamos criar algumas funções para simplificar o treinamento e avaliar de uma rede neural, que já foram descritos no notebook anterior.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import tensorflow.keras as keras
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
def create_model(input_dim):
model = Sequential()
model.add(Dense(10, activation='tanh', input_dim=input_dim))
model.add(Dense(8,activation='tanh'))
model.add(Dense(6,activation='tanh'))
model.add(Dense(3,activation='softmax'))
model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy'])
return model
def plot_metrics(history):
fig, axes = plt.subplots(2,1, True,figsize=(8,12))
# "Accuracy"
axes[0].plot(history.history['categorical_accuracy'])
axes[0].set_title('Acurácia')
# "Loss"
axes[1].plot(history.history['loss'])
axes[1].set_title('Erro')
if 'val_loss' in history.history.keys():
axes[0].plot(history.history['val_categorical_accuracy'])
axes[1].plot(history.history['val_loss'])
axes[0].legend(['Treino', 'Validação'])
axes[1].legend(['Treino', 'Validação'])
plt.xlabel('Épocas')
plt.show()
def do_everything(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.2,
random_state=123)
y_train_onehot = keras.utils.to_categorical(y_train, num_classes = 3)
y_test_onehot = keras.utils.to_categorical(y_test, num_classes = 3)
model=create_model(X.shape[1])
print(model.summary())
history = model.fit(X_train, y_train_onehot, validation_split=.1, epochs=500, verbose=0)
plot_metrics(history)
model=create_model(X.shape[1])
history = model.fit(X_train, y_train_onehot, epochs=150, verbose=0)
plot_metrics(history)
y_pred = model.predict(X_test)
y_pred_onehot = y_pred.argmax(axis=1)
print(classification_report(y_test, y_pred_onehot))
print(confusion_matrix(y_test, y_pred_onehot))
A análise de componentes principais (PCA, do inglês principal component analysis) é uma técnica matemática que detecta correlações nas colunas de uma matriz $X$ e realiza uma transformação linear em $X$ de forma a reduzir tais correlações.
Em linhas gerais, dados altamente correlacionadas são pouco informativos, inclusive para o treinamento de um classificador. Considere uma base de dados hipotética de cidades que possui as seguintes informações para cada cidade:
É possível perceber que os dados 1 e 2 são redundantes, por armazenarem a mesma informação, apenas em escalas diferentes, assim como os dados 3 e 4. Não só isso, como a temperatura de uma cidade é altamente correlacionada com sua distância do Equador. Uma técnica como o PCA encontraria as correlações nesta matriz de 4 colunas e transformaria as colunas de forma a reduzir (ou zerar) tais correlações.
Esta transformação pode ter como efeito colateral a concentração da informação das colunas na matriz original em menos colunas na matriz transformada. Por consequência, seria possível remover as colunas menos informativas, reduzindo a complexidade dos dados, mas retendo a informação necessária para se treinar um classificador, por exemplo.
A redução de dimensionalidade de dados proveniente do PCA também é útil para fins de visualização de dados multidimensionais.
Usaremos o pacote scikit-learn para carregar a base de dados Iris. Repare que ela possui 150 flores e 4 características por flor. No notebook anterior, utilizamos todas as 4 características para treinar nossa rede neural. Neste notebook, tentaremos realizar a mesma classificação com um menor número de características.
iris_X, iris_y = load_iris(return_X_y=True)
print('Qtd. itens no conj. de treinamento:', len(iris_X))
print('Classes:', iris_y)
print('\nExemplos das medidas de 5 flores:')
print(iris_X[:5])
Qtd. itens no conj. de treinamento: 150
Classes: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Exemplos das medidas de 5 flores:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
O PCA está implementado no scikit-learn e pode ser aplicado como no exemplo abaixo. Vamos criar uma função que aplica o PCA a um conjunto de dados, exibe informações de sua aplicação e retorna o conjuntos de dados transformados.
def apply_PCA(X, n_components):
pca = PCA(n_components=n_components)
new_X = pca.fit_transform(iris_X)
print('Tamanho da base de dados antes: ', X.shape)
print('Tamanho da base de dados depois: ', new_X.shape)
print('Porcentagem da variância explicada por cada coluna:', pca.explained_variance_ratio_)
print('Variância acumulada nas colunas remanescentes:', sum(pca.explained_variance_ratio_))
return new_X
Vamos aplicar o PCA nas medidas de nossas flores e observar a remoção da correlação das colunas e concentração das informações em menos colunas.
iris_X_4cp = apply_PCA(iris_X, 4)
Tamanho da base de dados antes: (150, 4)
Tamanho da base de dados depois: (150, 4)
Porcentagem da variância explicada por cada coluna: [0.92461872 0.05306648 0.01710261 0.00521218]
Variância acumulada nas colunas remanescentes: 1.0
Perceba como nossa base dados ainda possui o mesmo número de características (4). Porém, 92% da informação das colunas originais (representada pela variância) está agora concentrada na primeira coluna. Isso é um forte indicativo de que podemos classificar nossos dados utilizando menos características.
O PCA costuma ordenar as colunas da nova base de dados em ordem decrescente de quantidade de informação, então, para garantir que estamos utilizando as colunas mais informativas de nossa base, basta ignorarmos as últimas colunas da matriz.
Agora vamos aplicar o PCA e remover a coluna menos informativa. Isso tem o mesmo efeito de utilizar as 3 primeiras colunas da matriz gerada anteriormente, mas vamos aplicar o método de novo de qualquer forma.
Repare como as colunas remanescentes ainda mantêm 99,4% da informação da base de dados original.
iris_X_3cp = apply_PCA(iris_X, 3)
Tamanho da base de dados antes: (150, 4)
Tamanho da base de dados depois: (150, 3)
Porcentagem da variância explicada por cada coluna: [0.92461872 0.05306648 0.01710261]
Variância acumulada nas colunas remanescentes: 0.9947878161267247
Vamos utilizar nossas funções de treinamento e avaliação de redes neurais para treinar um modelo nos dados reduzidos.
do_everything(iris_X_3cp, iris_y)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 40
_________________________________________________________________
dense_1 (Dense) (None, 8) 88
_________________________________________________________________
dense_2 (Dense) (None, 6) 54
_________________________________________________________________
dense_3 (Dense) (None, 3) 21
=================================================================
Total params: 203
Trainable params: 203
Non-trainable params: 0
_________________________________________________________________
None
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
[[13 0 0]
[ 0 6 0]
[ 0 0 11]]
É possível observar que a rede neural é treinada com sucesso utilizando apenas as 3 características mais informativas provenientes do PCA. A rede ainda alcança altos valores de precisão, revocação e F1-score. Também é possível perceber que, assim como no notebook anterior, o overfitting ainda ocorre caso treinemos a rede neural por épocas demais. Isso, o PCA não resolve.
Vamos reduzir nossa base de dados mais ainda, mantendo apenas 2 colunas.
iris_X_2cp = apply_PCA(iris_X, 2)
Tamanho da base de dados antes: (150, 4)
Tamanho da base de dados depois: (150, 2)
Porcentagem da variância explicada por cada coluna: [0.92461872 0.05306648]
Variância acumulada nas colunas remanescentes: 0.977685206318795
Mesmo mantendo metade das colunas originais, ainda retemos 97,7% da informação de nossas flores. Vamos treinar nosso classificador usando esta nova base.
do_everything(iris_X_2cp, iris_y)
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_8 (Dense) (None, 10) 30
_________________________________________________________________
dense_9 (Dense) (None, 8) 88
_________________________________________________________________
dense_10 (Dense) (None, 6) 54
_________________________________________________________________
dense_11 (Dense) (None, 3) 21
=================================================================
Total params: 193
Trainable params: 193
Non-trainable params: 0
_________________________________________________________________
None
precision recall f1-score support
0 1.00 1.00 1.00 13
1 0.83 0.83 0.83 6
2 0.91 0.91 0.91 11
accuracy 0.93 30
macro avg 0.91 0.91 0.91 30
weighted avg 0.93 0.93 0.93 30
[[13 0 0]
[ 0 5 1]
[ 0 1 10]]
Novamente, a rede neural ainda classifica os dados com sucesso, porém pode começar a demonstrar perda de desempenho, como demonstrado por valores de precisão, revocação e F1-score menores ou uma matriz de confusão com valores fora da diagonal principal.
Outra informação muito interessante de ser reparada é a redução no número de pesos treináveis da rede, exibido no resumo da rede neural no início da célula anterior (Trainable params
). A rede neural treinada para classificar os dados originais (com 4 características) precisa aprender mais pesos (213) do que uma rede que classifica dados com menos características (2 características, 193 pesos). Isso significa que, ao utilizar o PCA, uma rede neural menor que processa menos dados pode potencialmente ter o mesmo desempenho que uma rede maior que processa mais dados.
Vamos treinar a rede para classificar nossa flores utilizando apenas 1 característica. A base de dados mantém 92,4% da informação, a rede neural possui 10 pesos a menos e ainda alcança um desempenho satisfatório.
iris_X_1cp = apply_PCA(iris_X, 1)
do_everything(iris_X_1cp, iris_y)
Tamanho da base de dados antes: (150, 4)
Tamanho da base de dados depois: (150, 1)
Porcentagem da variância explicada por cada coluna: [0.92461872]
Variância acumulada nas colunas remanescentes: 0.9246187232017271
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_16 (Dense) (None, 10) 20
_________________________________________________________________
dense_17 (Dense) (None, 8) 88
_________________________________________________________________
dense_18 (Dense) (None, 6) 54
_________________________________________________________________
dense_19 (Dense) (None, 3) 21
=================================================================
Total params: 183
Trainable params: 183
Non-trainable params: 0
_________________________________________________________________
None
precision recall f1-score support
0 1.00 1.00 1.00 13
1 0.67 1.00 0.80 6
2 1.00 0.73 0.84 11
accuracy 0.90 30
macro avg 0.89 0.91 0.88 30
weighted avg 0.93 0.90 0.90 30
[[13 0 0]
[ 0 6 0]
[ 0 3 8]]
Este notebook explicou o PCA, uma técnica que transforma uma base de dados, representada por uma matriz, em uma nova matriz cujas colunas possuem menos correlação e maior concentração de informação em um menor número ce colunas.
O PCA foi exemplificado na base de dados Iris e redes neurais foram treinadas para classificar as flores utilizando cada vez menos características.
]]>Este notebook exemplifica o treinamento de uma perceptron multi-camadas na classificação da base de dados Iris. O notebook utiliza o pacote scikit-learn para carregamento e separação da base de dados em treinamento e teste, o pacote keras para criação e treinamento da rede neural e o matplotlib para a geração de gráficos.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
A base de dados Iris contém 4 medidas de 150 flores individuais de 3 espécies distintas (50 flores de cada espécie). Ela foi criada em 1936 por Ronald Fisher [link].
As medidas são a altura e larguras das pétalas e sépalas de cada flor e as espécies de flores são iris setosa, iris versicolor e iris virginica, apresentadas nas imagens abaixo.
Apenas momentaneamente, utilizaremos o pacote seaborn para exibir as características das flores e suas espécies em um gráfico. Os pontos significam flores individuais, as cores simbolizam as espécies às quais cada flor pertence e linhas/colunas exibem pares de características plotados uns contra os outros.
import seaborn as sns
iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species")
Para a tarefa de classificação, a rede neural deve conseguir diferenciar entre as 3 espécies de flores, dadas suas medidas. É possível perceber visualmente que esta é uma tarefa fácil.
Por exemplo, todas as flores da espécie iris setosa (em azul) possuem altura e largura de pétala (petal_length
e petal_width
) menores do que qualquer flor de qualquer outra espécie. Uma rede neural que conseguisse aprender que toda flor da espécie iris setosa possui petal_length < 2
ou petal_width < 0.8
conseguiria classificar esta classe com 100% de precisão.
Tudo o que uma rede neural faz durante o aprendizado é descobrir pontos de separação entre as classes, como o exemplificado acima, utilizando características (ou combinações de características) presentes na base de dados.
Usaremos o pacote scikit-learn para carregar a base de dados e visualizar suas informações. As medidas das flores e suas respectivas classes são carregadas em variáveis distintas.
Repare que a base possui 150 flores, que as classes são representadas pelos números inteiros 0, 1 e 2 e as características das flores são armazenadas em uma única matriz de 150 linhas (as flores) por 4 colunas (as medidas).
iris_X, iris_y = load_iris(return_X_y=True)
print('Qtd. itens no conj. de treinamento:', len(iris_X))
print('Classes:', iris_y)
print('\nExemplos de medida de 5 flores:')
print(iris_X[:5])
Qtd. itens no conj. de treinamento: 150
Classes: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Exemplos de medida de 5 flores:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
Vamos separar nossa base de dados em conjuntos de treinamento e de teste. O conjunto de treinamento será utilizado para treinar o modelo criado e o conjunto de testes, utilizado na avaliação do modelo após o treinamento.
O desempenho da rede neural no conjunto de treinamento é um indicativo da capacidade de aprendizado do modelo, enquanto o desempenho no conjunto de testes indica a capacidade de generalização da rede, ou seja, sua capacidade de classificar novos dados (nunca antes vistos) corretamente.
X_train, X_test, y_train, y_test = train_test_split(iris_X,
iris_y,
test_size = 0.2,
random_state=123)
print('Qtd. itens no conj. de treinamento:', len(y_train))
print('Qtd. itens no conj. de teste:', len(y_test))
Qtd. itens no conj. de treinamento: 120
Qtd. itens no conj. de teste: 30
A forma como as classes são representadas atualmente (0, 1 e 2) não é a melhor para que a rede aprenda a tarefa de classificação. Isso porque, caso a rede tenha apenas uma saída (um número entre 0 e 2), ao aprender a realizar a saída 2, a rede pode aos poucos “desaprender” a saída 0, por serem valores de classes opostos.
Uma forma de mitigar este problema é fazer com que o aprendizado de cada classe seja independente do aprendizado de todas as outras. Se a rede tiver 3 saídas e aprendar a maximizar o valor de uma saída independente das outras, podemos escolher o maior valor dos três como a classificação correta.
Para isso, modificaremos a representação de nossas classes utilizando utilizando one-hot encoding, uma espécie de categorização de dados que torna o aprendizado de classe linearmente independente para a rede neural.
Perceba como os números 0, 1 e 2 se tornam as sequências 100, 010 e 001, respectivamente.
y_train_onehot = keras.utils.to_categorical(y_train, num_classes = 3)
y_test_onehot = keras.utils.to_categorical(y_test, num_classes = 3)
# imprime os 5 primeiros valores para exemplificar
print("### Antes ###")
print(y_train[:5])
print(y_test[:5])
print("\n### Depois ###")
print(y_train_onehot[:5])
print(y_test_onehot[:5])
### Antes ###
[2 2 0 0 1]
[1 2 2 1 0]
### Depois ###
[[0. 0. 1.]
[0. 0. 1.]
[1. 0. 0.]
[1. 0. 0.]
[0. 1. 0.]]
[[0. 1. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]
Neste notebook, uma rede neural será utilizada para descobrir a qual das 3 espécies cada flor pertence, dadas suas 4 medidas. Em outras palavras, a rede neural será um classificador treinado para prever em qual classe (de 3) um vetor de 4 valores pertence.
Vamos construir uma rede neural utilizando Keras. Aqui, configuramos camadas, neurônios por camada, funções de ativação, otimizador e função de erro.
Após construir a rede neural, utilizamos uma função utilitária do Keras para exibir a topologia do modelo. Neste caso, a rede receberá como entrada vetores de tamanho 4 (correspondente à quantidade de medidas de nossas pétalas e utilizará 4 camadas densas para realizar a classificação.
Repare que o tamanho da entrada de uma camada equivale ao tamanho da saída da camada anterior, permitindo que camadas sejam encadeadas para aumentar a expressividade da rede.
A última camada da rede tem como saída vetores de tamanho 3, correspondentes à quantidade de espécies de flores que desejamos classificar.
Vamos criar uma função para conseguir recriar o mesmo modelo no futuro.
def create_model():
model = Sequential()
model.add(Dense(10, activation='tanh', input_dim=4))
model.add(Dense(8,activation='tanh'))
model.add(Dense(6,activation='tanh'))
model.add(Dense(3,activation='softmax'))
model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy'])
return model
model=create_model()
keras.utils.plot_model(
model,
show_shapes=True,
show_layer_names=True,
rankdir="LR",
expand_nested=True,
dpi=96
)
Como uma biblioteca de redes neurais de alto nível, o Keras abstrai algumas complexidades da configuração do método de treinamento. Existem diferentes opções de otimizadores e funções de erro que podem ser utilizadas com a biblioteca, que podem ter diferentes efeitos no tempo de treinamento da rede. Opções comuns de otimizadores incluem sgd
, rmsprop
e adam
. Já a função de erro categorical_crossentropy
é a mais apropriada para tarefas de classificação multi-classes, quando as classes são representadas em one-hot encoding.
Vamos treinar nossa rede neural. Ela aprenderá a realizar a classificação dos dados de treinamento ( X_train
) para as classes categorizadas ( y_train_onehot
) por um número de épocas pré-determinado.
Durante o treinamento, o Keras armazena os valores da função de erro e outras medidas e permite acessá-los em uma variável, aqui chamada history
.
É possível executar a célula abaixo repetidas vezes para que a mesma rede seja treinada por mais épocas.
Também é possível separar uma parcela do conjunto de treinamento para validação, a avaliação em tempo real da rede neural em um conjunto de dados que não é utilizado para treinamento.
history = model.fit(X_train, y_train_onehot, validation_split=.1, epochs=500, verbose=2)
Epoch 1/500
4/4 - 0s - loss: 1.3739 - categorical_accuracy: 0.0556 - val_loss: 1.5228 - val_categorical_accuracy: 0.0000e+00
Epoch 2/500
4/4 - 0s - loss: 1.3304 - categorical_accuracy: 0.0093 - val_loss: 1.4575 - val_categorical_accuracy: 0.0000e+00
[...]
Epoch 499/500
4/4 - 0s - loss: 0.0104 - categorical_accuracy: 1.0000 - val_loss: 0.3896 - val_categorical_accuracy: 0.9167
Epoch 500/500
4/4 - 0s - loss: 0.0099 - categorical_accuracy: 1.0000 - val_loss: 0.3817 - val_categorical_accuracy: 0.9167
Vamos criar um gráfico com o erro e a acurácia da rede ao longo das épocas de treinamento. Caso uma parcela do conjunto de treinamento tenha sido utilizada para validação, novas linhas no gráfico exibirão o desempenho da rede neste conjunto separado de dados.
def plot_metrics(history):
fig, axes = plt.subplots(2,1, True,figsize=(8,12))
# "Accuracy"
axes[0].plot(history.history['categorical_accuracy'])
axes[0].set_title('Acurácia')
# "Loss"
axes[1].plot(history.history['loss'])
axes[1].set_title('Erro')
if 'val_loss' in history.history.keys():
axes[0].plot(history.history['val_categorical_accuracy'])
axes[1].plot(history.history['val_loss'])
axes[0].legend(['Treino', 'Validação'])
axes[1].legend(['Treino', 'Validação'])
plt.xlabel('Épocas')
plt.show()
plot_metrics(history)
O treino da rede neural por muitas épocas no conjunto de treinamento pode fazer com que o modelo se especialize neste conjunto de dados, um fenômeno chamado overfitting.
Essa especialização prejudica o desempenho da rede neural, uma vez que, ao se especializar, ela falha em generalizar sua capacidade de classificação para dados não observados.
O overfitting pode ser visualizado quando o erro de classificação no conjunto de treinamento continua a diminuir, ao mesmo tempo em que o erro no conjunto de validação começa a aumentar. Este ponto pode ser considerado um bom momento para encerrar o treinamento da rede neural.
Quando uma rede neural possui pesos demais para a tarefa que irá realizar, ela pode sofrer de overfitting se treinada por muitas épocas, ajustando-se demais aos dados de treinamento.
Algumas formas de evitar o overfitting incluem:
Considerando as observações no primeiro treinamento da rede neural, vamos treiná-la novamente, dessa vez encerrando o treinamento mais cedo e evitando o overfitting. Adicionalmente, vamos utilizar o conjunto de treinamento em sua totalidade nesta rodada de treinamento, sem validação.
model = create_model()
history = model.fit(X_train, y_train_onehot, epochs=150, verbose=0)
plot_metrics(history)
Vamos utilizar a rede treinada para classificar os dados de teste que separamos no início e visualizar as classes preditas pelo modelo. A rede neural classifica todo o conjunto de testes de uma vez.
y_pred = model.predict(X_test)
print(y_pred)
[[0.01593729 0.60561794 0.37844476]
[0.00296563 0.07939946 0.9176349 ]
[0.00180538 0.04658297 0.95161164]
[0.03136854 0.9468003 0.0218312 ]
[0.962343 0.03639156 0.00126548]
[0.01097725 0.33443207 0.65459067]
[0.02357773 0.9411847 0.03523754]
[0.96158886 0.0371284 0.00128271]
[0.95949054 0.0391942 0.00131528]
[0.02218457 0.8818209 0.09599455]
[0.00703657 0.19524635 0.7977171 ]
[0.9629769 0.03576834 0.00125479]
[0.0274128 0.92334604 0.04924115]
[0.00359261 0.09982508 0.8965823 ]
[0.0021769 0.05739778 0.94042534]
[0.00176867 0.04559813 0.9526332 ]
[0.9619265 0.0368052 0.00126826]
[0.9620646 0.03666212 0.00127322]
[0.02267583 0.88068783 0.0966363 ]
[0.9593102 0.0393738 0.00131598]
[0.9614102 0.03731174 0.00127805]
[0.01361755 0.4841447 0.50223774]
[0.9575784 0.04107764 0.0013439 ]
[0.00956561 0.2892796 0.70115477]
[0.96190244 0.03682484 0.00127265]
[0.9592192 0.03946826 0.00131252]
[0.96002924 0.03866893 0.0013018 ]
[0.00265293 0.06909027 0.92825687]
[0.00196587 0.05123645 0.9467976 ]
[0.96210897 0.03662174 0.00126931]]
A saída da rede é um conjunto de 3 valores para cada vetor de entrada. Cada um dos 3 valores indica a relevância da respectiva classe para o vetor de entrada. Quanto maior essa relevância, maiores as chances do dado de entrada pertencer àquela classe.
A função de ativação na última camada da rede neural dita a natureza dos valores de saída:
tanh
: tangente hiperbólica (intervalo $[-1; 1]$).sigmoid
: sigmoide (intervalo $[0; 1]$).softmax
: softmax (intervalo $[0; 1]$, a soma dos valores preditos para cada dado de entrada deve ser igual a 1).Para transformar os valores exibidos acima nas classes preditas, escolhemos a maior saída no eixo das classes.
Compare os valores verdadeiros com os valores preditos pela rede.
y_pred_onehot = y_pred.argmax(axis=1)
print('Classes reais:\t\t', y_test)
print('Classes preditas:\t', y_pred_onehot)
Classes reais: [1 2 2 1 0 2 1 0 0 1 2 0 1 2 2 2 0 0 1 0 0 2 0 2 0 0 0 2 2 0]
Classes preditas: [1 2 2 1 0 2 1 0 0 1 2 0 1 2 2 2 0 0 1 0 0 2 0 2 0 0 0 2 2 0]
Utilizamos o scikit-learn para produzir relatórios de classificação utilizando os valores reais e os preditos.
As medidas de desempenho da rede são precisão, recall (revocação) e F1 [link].
A precisão indica a quantidade de classificações positivas que a rede realizou que são realmente positivas. A revocação indica a fração de classificações positivas que a rede realizou, do total de classificações positivas.
Em nosso contexto, podemos imaginar que essas medidas respondem às seguintes perguntas:
Precisão: De todas as flores que meu classificador indicou como sendo da classe iris setosa, quais realmente são dessa classe?
Revocação: De todas as flores da espécie iris setosa presentes na minha base de dados, quantas o meu classificador conseguiu identificar?
print(classification_report(y_test, y_pred_onehot))
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
A precisão e a revocação são valores relevantes sobre um modelo, porém suas definições podem ser um pouco confusas. Uma forma um pouco mais informativa de se pensar nestas medidas é utilizando-as na análise da eficácia de um teste médico hipotético (que toma o lugar de nosso modelo classificador) para uma doença e quais questões cada uma das medidas responde:
Precisão: De todos os testes positivos realizados, quantos pacientes realmente possuem a doença?
Revocação: De todos os pacientes que realmente possuem a doença, qual a porcentagem que eu consegui identificar utilizando meu teste?
Caso o teste médico possua alta precisão, temos certeza que todos os pacientes que informarmos como portadores da doença, certamente terão a doença. Ou seja, o teste não estará errado nas classificações positivas que realiza.
Caso o teste possua alta revocação, temos certeza que todos os portadores da doença em nossa população foram identificados.
Porém, algumas situações anômalas podem surgir. Por exemplo, se há 1000 doentes em nossa amostra e nosso classificador realiza apenas 1 classificação positiva e o paciente está realmente doente, a precisão do modelo é de 100%, pois não fez nenhuma classificação positiva errada.
Igualmente, se há 1000 doentes numa população de 1 milhão de pessoas e nosso classificador realiza 1 milhão de classificações positivas, ele possuirá uma revocação de 100%, uma vez que detectou todos os doentes.
Devido a estas anomalias, é sempre preciso ponderar qual medida é mais importante para um classificador em uma determinada tarefa ou, na melhor das hipóteses, treinar um classificador que maximize ambas as medidas.
A medida F-score (ou F-measure ou F1), também exibida acima, é uma combinação da precisão e da revocação e pode auxiliar na avaliação do classificador.
A matriz de confusão compara, em números absolutos, as classes reais e preditas. Valores na diagonal principal indicam classificações corretas e qualquer outro valor indica erros de classificação no conjunto de testes.
print(confusion_matrix(y_test, y_pred_onehot))
[[13 0 0]
[ 0 6 0]
[ 0 0 11]]
Este colab apresentou um exemplo clássico e básico de classificação de uma base de dados utilizando redes neurais. Os seguintes conceitos foram apresentados ao longo do colab: