Douglas Meneghetti

Running a Gemma-powered question-answering chatbot locally with LangChain + Ollama

2024-03-01T00:00:00+00:00

Last week, I started my quest to build my first LLM-powered chatbot that runs locally and executes RAG on my Obsidian document base. You can see it here.

This post is a follow-up to that one. The highlights of this one are:

I will test the brand new Google Gemma family of models and qualitatively compare it to a Mistral model of same size.
The chatbot will run 100% locally in my computer with a GTX 1070, 7h gen i7 and 32 GB of RAM, using Chroma for the vector store and Ollama for the models.
Unlike in the previous post, we’ll configure the correct prompt templates for our instruction-tuned models.

Also, since my last post, I’ve refactored the code base in its current state as both a Python package and a command-line tool, which you can find in this link. This is what I used to run the experiments shown in this page. A sample notebook is provided here and the underlying code is < 200 lines long!

Test-driving Gemma-7b

A few days ago, Google released Gemma, a family of lightweight models the same size as Mistral-7b, the one I was already using.

My first attempt at loading Gemma-7b was through Hugging Face. However, it needs extra configuration to fit in my memory, i.e. float16/bfloat16 quantization with PyTorch, or 8bit/4bit quantization with bitsandbytes. Also, the model files are huge, totalling 20 GB.

For now, let’s familiarize ourselves with Ollama, which is simpler, and leave the Hugging Face ecosystem for later.

Enter Ollama

It turns out a very widespread way of running models locally is through Ollama, which is very simple to install on Linux.

curl -fsSL https://ollama.com/install.sh | sh

Ollama has a Docker-like interface (it was made by an ex-Docker employee) and models can be downloaded using their names and some basic tags. Downloaded models can be executed via python using their official library.

With that, I decided to stop using models from multiple libraries and frameworks and run all of them in Ollama. The commands below downloaded Mistral-7b, Gemma-7b, and a text embedding model from Nomic.AI.

ollama pull mistral:7b-instruct
ollama pull gemma:7b-instruct
ollama pull nomic-embed-text

The model files downloaded by Ollama are considerably smaller (each 7b model fit in a single 5GB file). They also loaded into my GPU memory, alongside the embedding model.

Configuring the correct prompt templates

It turns out the prompt templates, especially for Instruct models, need to use the correct tokens and that is not done automatically by LangChain or any other library.

I went after the correct templates on the Ollama website. They are shown below. Contrast with my last post, in which I copied an internal prompt from LangChain, which was not suitable for any specific model.

While tweaking the contents of the prompt template, I realized how much variability in the responses I got from only minor tweaks. For example, if I told Gemma something like “if you don’t know the answer, just say you don’t know and not make anything up”, the model would just refuse to answer any questions, even if the information was present in the retrieved text chunks. To make the comparison fair, I also removed the same section from the mistral prompt template, which I used in the previous post.

Mistral prompt template

Sources: Ollama and Hugging Face Hub.

[INST] Use the following pieces of context to answer the question at the end. Present a well-formatted answer, using Markdown if possible. Don't go over three paragraphs when answering.
---
{context}
---
Question: {question} [/INST]

Gemma prompt template

Sources: Ollama and Hugging Face Hub.

user Use the following pieces of context to answer the question at the end. Present a well-formatted answer, using Markdown if possible. Don't go over three paragraphs when answering. --- {context} --- Question: {question} model

Tweaking RAG parameters

One problem I was having was that performing maximal marginal similarity on a small number of chunks (e.g. 3) tended to result only a single helpful text chunk. To fix that I

reduced the chunk sizes

retrieved more chunks

Some final tweaking that made the output of both models better was using the following parameters:

Chunk size: 400 characters

Chunk overlap: 50 characters

Chunks returned by similarity for MMR search: 10

Final number of chunks returned by MMR: 6

MMR $\lambda$ (0 = maximum diversity among chunks; 1 = no diversity): 0.7

Running the models

Since my last post, I’ve refactored the code base in its current state as both a Python package and a command-line tool, which you can find in this link. This is what I used in the construction of this post.

The snippet below (which is also available as a Jupyter Notebook) showcases how to instantiate and talk to the LLM-powered Q&A bot:

from pathlib import Path from langsidian import ChatBot, DocumentBase bot = ChatBot( docs_path=(Path.home("Documents", "Obsidian")), vectorstore_db_path=Path("docs/chroma"), document_type=DocumentBase.OBSIDIAN, model_type="mistral:7b-instruct", embeddings="nomic", ) q = "What is zero-shot learning?" print(f"**Question: {q}**\n\n Answer: {bot.answer(q)}", end="\n\n---\n\n")

It can also be executed as a standalone module. The document path and model name can be selected with command-line arguments:

python -m langsidian

[Prompt]: Who were the authors of the Deep Q-Network paper? [Assistant]: The authors of the Deep Q-Network paper were Mnih, Vaswani, Graves, et al. (2015). You can find their paper here: .

Results

Below, I asked the same questions to both Mistral-7b-it and Gemma-7b-it, giving them access to my Obsidian document base, whose chunks are now embedded using Nomic Embed.

Analyzing the output qualitatively, I found that Mistral tends to give more informative and helpful answers. It also makes better use of the information in the chunks provided by the RAG procedure, creating richer answers, whereas Gemma tends to repeat itself or cut answers short when it hits a dead-end such as in the answer for this question.

Also, in the first question, Gemma explained terms that may have been in the context chunks recovered for RAG (such as $a’$ and $\pi_*$), but they did not appear in its final answer, which would definitely confuse users in a conversation.

Question: Give me the equation for the action value function update and explain its terms.

Mistral-7b

The equation for the action-value function update using the n-step return is as follows:
\[Q_{t+n}(S_t, A_t) \leftarrow Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - \gamma^n Q_{t+n-1}(S_t, A_t)]\]
This equation is a form of the Bellman equation for the action-value function. Let’s break down its terms:

$Q_{t+n}$: The action-value function at time step $t+n$ for state-action pair $(S_t, A_t)$.

$\alpha$: The learning rate or step size, which determines how much new information is incorporated in each update.

$G_{t:t+n}$: The n-step return from time step t to time step $t+n$. It’s the total reward obtained by following policy $\pi$ from state $S_t$ and taking action $A_t$, plus all the discounted future rewards, i.e., $G_{t:t+n} = R_{t+1:t+n}+ \gamma V_{\pi}(S_{t+n})$

$\gamma$: The discount factor, which is used to discount future rewards by a certain percentage to make them present values.

$Q_{t+n-1}$: The old estimate of the action-value function at time step $t+n-1$ for state-action pair $(S_t, A_t)$.

This update rule aims to minimize the difference between the current and new estimates of $Q(S_t,A_t)$, gradually improving the estimation as more samples are collected.

Gemma-7b

Equation for Action Value Function Update:
\[Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - \gamma^n Q_{t+n-1}(S_t, A_t)]\]
Explanation of Terms:

$Q_{t+n}(S_t, A_t)$: The action-value function at time $t+n$ for state $S_t$ and action $A_t$.

$Q_{t+n-1}(S_t, A_t)$: The action-value function at time $t+n-1$ for state $S_t$ and action $A_t$.

$\alpha$: The learning rate, which controls the rate of learning.

$G_{t:t+n}$: The n-step return, which is the total reward from time $t$ to time $t+n$.

$\gamma$: The discount factor, which accounts for the value of future rewards.

$a’$: The action that is not taken at time $t$.

$\pi_*$: The optimal policy.

Note: This equation is for the action-value function update in Q-learning, not the state-value function update.

Question: What is the architecture of the Deep Q-Networks?

Mistral-7b

The Deep Q-Network (DQN) architecture uses a deep neural network to approximate the Q* function in reinforce learning. The input to the network are 4 frames of 84x84 pixel screenshots from Atari 2600 games, only using the luminance values of the pixels. The network then uses three convolutional layers to learn visual representations of the game states. The output of the neural network is the Q values of all possible actions for each state, ranging from 4 to 18.

This architecture builds upon the Q-learning algorithm, which aims to find the optimal policy by iteratively improving the Q values based on observed state-action pairs and their corresponding rewards. However, directly applying this algorithm to deep neural networks with large input spaces is infeasible due to the computational demands.

To overcome this challenge, DQN uses experiences from multiple frames and applies a target network for more stable learning. The target network is updated periodically to match the weights of the online network, ensuring that the Q values in the target network do not change too frequently during training. Additionally, the authors use experience replay, where they store transitions in a buffer and sample them randomly to train the network, which helps in reducing correlation between consecutive samples and improving stability.

These architectural enhancements allowed DQN to learn complex visual representations of Atari games, achieve human-level performance on several benchmark tasks, and set a new standard for deep reinforce learning research.

Gemma-7b

The text describes the architecture of Deep Q-Networks, which uses a deep neural network to approximate the Q* function. The neural network has three convolution layers to learn visual representations of the game states.

The text does not describe the architecture of the convolutional layers, therefore I cannot answer the question.

Question: What is the difference between causal language modeling and masked language modeling?

Mistral-7b

Causal and masked language modeling are two different methods used for pre-training language models.

In causal language modeling, also known as autoregressive language modeling, the language model is trained to predict the next token in a sequence given the context of the previous tokens. The model has access to the entire sequence during training and prediction, and its task is to generate the next token based on the context provided. However, since there is no ground truth available during inference, this can lead to a discrepancy between training and inference, known as Exposure Bias. This issue can be addressed through techniques like Curriculum learning.

In masked language modeling, also known as Masked Language Modeling (MLM), tokens are randomly masked at training time and the model is trained to reconstruct the original text by predicting the masked tokens. For instance, in the sentence “the teacher teaches the student”, the word “teaches” might be masked as “". During training, the model is presented with the masked token and its context, and its task is to learn to fill in the masked token based on the context provided. BERT is an example of a masked language model. The main difference between causal and masked language modeling lies in how the models are trained and what they are optimized for: causal models predict the next token given the context, while masked models learn to fill in missing tokens based on their context.

Sources: DeepLearning.AI Natural Language Processing Specialization, Generative AI with Large Language Models.

Gemma-7b

Causal language modeling and masked language modeling are two methods for pre-training language models.

Causal language modeling allows the model to see the first tokens of the sequence and its task is to predict the next token. Masked language modeling, on the other hand, randomly masks tokens at training time and the model is trained to reconstruct the original text by predicting the masked tokens.

The main difference between causal language modeling and masked language modeling is that in causal language modeling, the model has access to the first tokens of the sequence and its task is to predict the next token, while in masked language modeling, the model is trained to reconstruct the original text by predicting the masked tokens.

Question: What is zero-shot learning?

Mistral-7b

Zero-shot learning is a concept used in machine learning, specifically in the context of Generative AI and Large Language Models (LLMs), to describe the capability of an LLM to predict classes it was not explicitly trained on without being given any explicit examples during training or at inference time. In other words, the model makes predictions based on its existing knowledge, rather than learning from new examples provided in real-time. This is in contrast to few-shot learning, where the model is given a few examples of the task it’s expected to perform before being presented with the actual task. Zero-shot learning requires a large and diverse dataset for the model to generalize effectively to unseen classes or concepts.

One limitation of zero-shot learning is that it may not be as effective as few-shot learning, especially when dealing with complex tasks or rare classes. Another limitation is that it relies heavily on the quality and representativeness of the training data for making accurate predictions. However, since zero-shot learning allows models to execute new tasks without being given any explicit examples, it can be beneficial in scenarios where acquiring labeled data is difficult, costly, or time-consuming.

Regarding your additional question: In Q-learning, importance sampling is not used because it is an off-policy reinforce learning algorithm that learns from a sequence of state-action pairs (trajectories) generated by a policy π, which may not be the optimal one. The goal is to estimate the action-value function Q(s,a), and importance sampling would introduce additional complexity without significant benefits in this context.

Gemma-7b

Zero-shot learning is the capability of a model to predict classes it was not explicitly trained on. In the context of Generative AI and LLMs, zero-shot learning refers to the capability of an LLM to execute new tasks without being given any explicit examples.

In zero-shot learning, the model is given no examples of the task it is expected to perform, and it must generalize from its previous experiences to complete the task. This contrasts with few-shot learning, which involves providing the model with a few examples of the task it is expected to perform.

For example, if you ask a model to classify a review as positive or negative, and it has never seen a review with that sentiment before, zero-shot learning would involve the model making a generalization based on its previous experiences to determine whether the review is positive or negative.

Question: Explain to me the concept of bucketing in RNNs

Mistral-7b

Bucketing in Recurrent Neural Networks (RNNs) is a technique used for batching sequences while minimizing padding. The main goal is to process sequences of varying lengths efficiently and maintain a consistent batch size during training. Here’s a step-by-step explanation:

Divide data into buckets: Divide the input data into m buckets based on sequence lengths. Each bucket represents a specific range of sequence lengths.

Sample sequences from a single bucket: For each batch, sample all sequences from only one of the buckets. This ensures that most sequences in a batch have similar lengths, reducing the need for extensive padding.

Pad sequences to the largest one: Pad all sequences in the batch to match the size of the longest sequence. This guarantees that every sequence receives the same amount of processing during each time step.

Bucketing is an essential technique when dealing with sequence data, as it helps to minimize padding and improve training efficiency. However, it might require more computational resources due to the need to process multiple buckets for generating batches over time.

It’s important to note that this method assumes that sequences within a bucket are independent of one another. If there is any dependence between sequences in a bucket, this assumption would not hold true, and other data processing techniques might be required.

Gemma-7b

Bucketing in RNNs is a technique for batching sequences in a way that minimizes padding. The idea is to group sequences of similar length into buckets and then process the buckets separately.

The process of bucketing involves dividing the data into buckets, sampling sequences from a single bucket, and padding all sequences in a batch to the size of the largest sequence in the batch. This technique is useful for RNNs because it allows for the efficient processing of sequences of varying lengths.

The technique is illustrated in the provided text with two images. The first image shows the process of dividing the data into buckets, and the second image shows the process of padding the sequences in a batch.

Question: What is a named entity?

Mistral-7b

A named entity is a real-world object that can be denoted with a proper name, examples include persons, locations, organizations, products, geographical entities, geopolitical entities, time indicators, artifacts, events, and natural phenomena. Named Entity Recognition (NER) systems are used to automatically locate named entities from unstructured text. This process is important for various applications such as improving search engine efficiency by scanning websites and storing named entities for later matching with user queries, and enhancing recommendation systems by extracting named entities from a user’s search history to identify their interests.

For instance, in the sentence “Sharon flew to Miami on Friday,” the named entities are “Sharon” (person), “Miami” (location), and “Friday” (time indicator). NER systems can accurately identify and tag these named entities within the text. The remaining tokens that do not represent named entities are typically tagged as filler words or O.

Gemma-7b

A named entity is a real-world object that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include people, locations, organizations, and products.

Named entities are extracted from text and are used to identify and classify named objects in a sentence. They are used in various applications, such as search engine efficiency, recommendation systems, and information extraction.

In the text provided, the sentence “Sharon flew to Miami on Friday” would have the named entities “Sharon,” “Miami,” and “Friday.”

Conclusions

On this post, I show you how I locally executed an LLM-powered Q&A chatbot that performs RAG on an Obsidian, using Mistral-7b and Gemma-7b instruction fine-tuned models as well as Nomic-Embed, all provided by Ollama.

I showed you how I did and you can:

Install Ollama;

Download Ollama models;

Configure prompt templates for the instruction-tuned versions of Gemma and Mistral, and;

Configure some search parameters to get better chunks and RAG results.

Answering questions from an Obsidian database with LLMs + RAG

2024-02-19T00:00:00+00:00

This notebook presents a proof-of-concept on how to create a question-answering bot powered by an LLM and with knowledge extracted from actual documents, more specifically, a collection of notes from Obsidian.

I am going to run all experiments locally, using a 7th gen i7, an NVIDIA GTX 1070 and 32 GBs of RAM.

A lot of the heavy lifting will be done by the LangChain package, which I am on the process of learning to use.

On the road to building this Q&A bot, we will be introduced to many concepts:

Document loading

Text data cleaning using regex

Splitting of Markdown documents into text chunks

Sentence embeddings

Vector stores

Similarity search using cosine similarity between embedding vectors

maximal marginal relevance search

Self-query retrieval

Contextual compression retrieval

Question-answering with retrieval augmented generation

Question-answering with retrieval augmented generation and custom templates

So that no one gets lost, the following diagram explains how the whole pipeline to get to our final Q&A bot (and how this notebook) works:

Prerequisites

You can find the environment.yml file to create a conda env with the necessary dependencies to run the notebook.

You should download a model from the GPT4All website and save it on ./models/my_little_llm.gguf. The one below is the one I used.

wget https://gpt4all.io/models/gguf/mistral-7b-openorca.Q4_0.gguf -O models/my_little_llm.gguf

Loading documents

Here I used the ObsidianLoader document loader and point it to the directory that contains all my notes in Markdown format.

We can see I have ~500 text files.

from pathlib import Path docs_path = (Path.home() / "Documents" / "Obsidian").absolute()

from langchain_community.document_loaders import ObsidianLoader loader = ObsidianLoader(docs_path, collect_metadata=True, encoding="UTF-8") docs = loader.load() print(f"Loaded {len(docs)} docs")

Encountered non-yaml frontmatter Loaded 499 docs

Let’s take a peek at one of the documents. We can see it has the textual content itself, as well as some metadata. The ObsidianLoader includes file properties from Obsidian documents, such as tags, dates and aliases, as part of the metadata.

docs[7]

Document(page_content='A method for pre-training [[language model]]s in which the model has access to the first tokens of the sequence and its task is to predict the next token.\n\nThe following examples depict how a single sequence can be turned into multiple training examples:\n\n1. `` → `the`\n1. ` the` → `teacher`\n1. ` the teacher` → `teaches`\n1. ` the teacher teaches` → `the`\n1. ` the teacher teaches the` → `student`\n1. ` the teacher teaches the student` → ``\n\nModels trained using this method have access to the full sequence of tokens at inference time, making them appropriate for non-generative tasks that revolve around processing a sequence of tokens as a whole, for example:\n\n- [[Sentiment Analysis]]\n- [[Named entity recognition]]\n- [[Word classification]]\n\n[[Bidirectional Encoder Representation from Transformers|BERT]] is an example of a masked language model. Example from [[Bidirectional Encoder Representation from Transformers|BERT]]: Choose 15% of the tokens at random: mask them 80% of the time, replace them with a random token 10% of the time, or keep as is 10% of the time.\n\n## Sources\n\n- [[DeepLearning.AI Natural Language Processing Specialization]]\n- [[Generative AI with Large Language Models]]', metadata={'source': 'Causal language modeling.md', 'path': '/home/dodo/Documents/Obsidian/Causal language modeling.md', 'created': 1700448369.2719378, 'last_modified': 1700448369.2719378, 'last_accessed': 1708267659.2105181, 'tags': 'area/ai/nlp/llm', 'date': '2023-11-19 23:41'})

Cleaning documents

Obsidian documents have some of their own Markdown flavor, like [[Graph Neural Network|GNNs]], where Graph Neural Network is the name of a document and GNNs is what appears on the text. In cases like these, we want to keep only the second part.

It also has full-on links, such as [[grid world]], in which case we want to remove the double brackets.

# !pip install google-re2 import re2 docus = [] insane_pattern = r"\[\[([^\]]*?)\|([^\[]*?)\]\]" for doc in docs: s = re2.search(insane_pattern, doc.page_content) if s is not None: new_doc = re2.sub(insane_pattern, r"\2", doc.page_content) docus.append( ( doc.page_content, new_doc, ) ) doc.page_content = new_doc doc.page_content = doc.page_content.replace("[[", "").replace("]]", "") sorted(docus, key=lambda x: len(x[1]))[0]

('- [[Intersection over Union|IoU]]', '- IoU')

Splitting documents

This step splits the documents loaded in the previous step into smaller chunks.

LangChain provides its own Markdown text splitter, which we are going to use.

from langchain.text_splitter import MarkdownTextSplitter splitter = MarkdownTextSplitter(chunk_size=400, chunk_overlap=50) splits = splitter.split_documents(docs) len(splits)

1510

Let’s take a peek at a chunk. They inherit the metadata of their parent document.

splits[542]

Document(page_content='# epsilon-soft policies\n\nAn $\\epsilon$-soft policy is a stochastic policy that always assigns a non-zero $\\frac{\\epsilon}{|A|}$ probability to all actions. These policies always perform some exploration.\n\nThe uniform random policy is an $\\epsilon$-soft policy. The epsilon-greedy policy also is.', metadata={'source': 'epsilon-soft policies.md', 'path': '/home/dodo/Documents/Obsidian/epsilon-soft policies.md', 'created': 1680669506.8282943, 'last_modified': 1680669506.8282943, 'last_accessed': 1708267660.780534, 'tags': 'area/ai/rl project/rl-spec', 'aliases': 'epsilon-soft policy', 'date': '2021-05-24 18:32'})

Computing embeddings and saving them to a vector store

To quickly search for text chunks, it is useful to precompute an embedding vector for each chunk and store it for future use.

An embedding vector is a numerical vector that represents the text chunk. It allows us to compare chunks in the embedding space. Chunks with similar semantic meaning tend to have similar embedding vectors. This similarity can be computed using e.g. cosine similarity.

My choice for embedding generator was SentenceTransformers, provided by Hugging Face, which runs locally.

# !pip install sentence_transformers from langchain.embeddings.huggingface import HuggingFaceEmbeddings embedding = HuggingFaceEmbeddings()

/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm

Computed embedding vectors can be stored in vector stores. The one we will use in this project is Chroma. It is free, runs locally and is perfect for our small document base.

# !pip install chromadb from langchain.vectorstores import Chroma persist_directory = "docs/chroma/" !rm -rf ./docs/chroma # remove old database files if any vectordb = Chroma.from_documents( documents=splits, embedding=embedding, persist_directory=persist_directory ) vectordb._collection.count()

1510

Retrieval

Retrieval is the act of retrieving text chunks from our vector store, given an input prompt.

Basic retrieval is performed by comparing the prompt embedding with those of the text chunks. More complex retrieval techniques involve calls to an LLM.

Basic retrieval

Let’s first test a retrieval technique based on similarity search in the vector store. Given a prompt, the procedure should return the most similar or relevant chunks in the vector database.

The question below will be used as a test for everything else below in the notebook. It is related to reinforcement learning, an area in which I have a few hundred documents written on Obsidian. You can find more about the question here to see if our retrieval methods actually nail the answer.

question = "What is the definition of the action value function?"

The first example of retrieval is similarity search, which will convert the prompt into an embedding vector and compute the cosine similarity between the prompt embedding and the embeddings of all chunks in the vector store, returning the k most similar chunks.

retrieved_docs = vectordb.similarity_search(question, k=8) for doc in retrieved_docs: print(doc.page_content, end="\n\n---\n\n")

The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy. $$q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_t|S_t=s,A_t=a]$$ where $G_t$ is the Expected sum of future rewards. --- A value function maps states, or state-action pairs, to expected returns. - State-value function - Action-value function --- The state-value function represents the expected return from a given state, possibly under a given policy. $$v(s)=\mathbb{E}[G_t|S_t=s]$$ $$v_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s]$$ where $G_t$ is the Expected sum of future rewards. --- The same goes for the Action-value function. $$\begin{align} q_*(s,a) & = \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma \sum_{a'} \pi_*(a'|s') q_*(s',a')] \\ & = \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma \max_{a'} q_*(s',a')] \end{align}$$ --- Let's say we have a policy $\pi_1$ that has a value function $v_{\pi_1}$. If we use $v_{\pi_1}$ to evaluate states but, instead of following $\pi_1$, we actually always select the actions that will take us to the future state $s'$ with highest $v_{\pi_1}(s')$, we will end up with a policy $\pi_2$ that is equal to or better than $\pi_1$. --- $$\begin{align} v_*(s) & = \sum_a \pi_*(a|s) & \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma v_*(s')] \\ & = \max_a & \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma v_*(s')] \end{align}$$ where $\pi_*$ is the Optimal policy. The same goes for the Action-value function. --- It's a function that dictates the probability the state will find itself in an arbitrary state $s'$ and the agent will receive reward $r$, given the current state the environment finds itself in, $s$, and the action chosen by the agent in $s$, depicted as $a$. It is usually denoted as $p(s',r|s,a)$. Some properties of this function: --- Policy evaluation is the task of finding the state-value function $v_{\pi}$, given the policy $\pi$. ^1b9b46 ---

Maximal marginal relevance search

Plain similarity search has a drawback. It tends to recover chunks which are very similar or even identical, diminishing the overall amount of information present in the retrieved chunks.

To solve this, LangChain provides a method called maximal marginal relevance search, which works by “[…] finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.” [source]

retrieved_docs = vectordb.max_marginal_relevance_search(question, k=8) for doc in retrieved_docs: print(doc.page_content, end="\n---\n")

The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy. $$q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_t|S_t=s,A_t=a]$$ where $G_t$ is the Expected sum of future rewards. --- A value function maps states, or state-action pairs, to expected returns. - State-value function - Action-value function --- A generalization of Sarsa which employs the n-step return for the action value function, !n-step return#^205a30 ^68659e This estimate is then used in the following update rule for the action-value of the state-action pair at time $t$. $$Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - \gamma^n Q_{t+n-1}(S_t, A_t)]$$ ^ca04db --- - if the agent exploits without having a good estimate of the action-value function, it will most likely be locked in suboptimal behavior, not being able to gather information from unknown transitions which might bring it more return. --- Some properties of this function: It maps states and actions to states and rewards, so its cardinality is $$p:S \times R \times S \times A \to [0;1]$$ It is a probability, so the sum over all possible combinations of states and rewards must be one, $$\sum_{s' \in S} \sum_{r \in R} p(s',r|s,a) = 1, \forall s \in S, a \in A(s)$$ --- # Factored value functions in cooperative multi-agent reinforcement learning VDN was the first one and the one I used in my Doctorate. --- - Exploitation: select the greedy action with relation to the action-value function. - Exploration: select a non-greedy action. --- Given the following MDP: !Pasted image 20210523192818.png The Bellman equation allows the value function to be expressed and solved as a system of linear equations: ^c06dd9 !Bellman equation for the state-value function#^a65ad4 ---

LLM-backed retrieval

Some retrieval techniques require an underlying language model to be performed. The LLM may be used to, e.g. summarize or make chunks more coherent before returning them.

Instantiating the LLM

The LLM I chose is Mistral-7B-OpenOrca, provided by GPT4All.

Mistral 7B is the best free and open 7 billion parameter LLM. It is also small enough to run on my GPU.

The OpenOrca dataset is a conversation dataset.

According to Mistral’s product website, this model has an 8k context window, which we should consider when retrieving chunks for it to process.

# !pip install gpt4all # !pip install lark # !wget https://gpt4all.io/models/gguf/mistral-7b-openorca.Q4_0.gguf -O models/my_little_llm.gguf # !wget https://gpt4all.io/models/gguf/nous-hermes-llama2-13b.Q4_0.gguf -O models/my_little_llm.gguf from langchain_community.llms.gpt4all import GPT4All llm = GPT4All(model="models/my_little_llm.gguf", device="gpu")

llama.cpp: using Vulkan on NVIDIA GeForce GTX 1070

Self-query retrieval

Self-query is a technique in which an LLM is specifically prompted to output a structured query. It also allows it to take document/chunk metadata into consideration, as long as we describe each attribute in the metadata with a textual description.

Under the hood, self-query performs some pretty convoluted modifications to the original prompt and I advise you look at the documentation to understand what’s going on. [Source]

As we have seen when inspecting our splits, we can see that our data includes metadata taken from the file properties of Obsidian documents. We will go ahead and described them as attributes for the self-query retriever.

from langchain.chains.query_constructor.base import AttributeInfo metadata_field_info = [ AttributeInfo( name="source", description="The name of the Markdown file that contained the chunk. If you ignore the .md extension, it is the name of the article the chunk came from.", type="string", ), AttributeInfo( name="aliases", description="Other names for the article the chunk came from, if any.", type="string", ), AttributeInfo( name="tags", description="A series of comma-separated tags that categorize the article the chunk came from. When a tags starts with 'area', it denotes a broad area of knowledge. When it starts with 'project', it describes a specific project with beginning and end.", type="string", ), AttributeInfo( name="authors", description="When the document summarizes a scientific paper, this attribute holds a comma-separated list of author names.", type="string", ), AttributeInfo( name="year", description="When the document summarizes a scientific paper, this attribute contains the year of the publication.", type="integer", ), ] document_content_description = "A collection of study notes in Markdown format written by a single author, mostly about artificial intelligence topics."

The self-query retriever can also be configured to use maximal marginal relevance search, as you can see in the base_retriever argument below.

from langchain.retrievers.self_query.base import SelfQueryRetriever retriever = SelfQueryRetriever.from_llm( llm=llm, vectorstore=vectordb, document_contents=document_content_description, metadata_field_info=metadata_field_info, base_retriever=vectordb.as_retriever(search_type="mmr", k=8), )

retriever.invoke(question)

[Document(page_content='!Pasted image 20231129031306.png', metadata={'created': 1708307272.9665868, 'date': '2023-11-29 01:34', 'last_accessed': 1708307272.9699202, 'last_modified': 1708307272.9665868, 'path': '/home/dodo/Documents/Obsidian/Single linkage.md', 'source': 'Single linkage.md', 'tags': 'area/ai/ml/clustering'}), Document(page_content='!Pasted image 20230317051147.png', metadata={'created': 1680667926.0942817, 'date': '2023-03-17 04:33', 'last_accessed': 1708267663.323892, 'last_modified': 1680667926.0942817, 'path': '/home/dodo/Documents/Obsidian/Comparing feature vectors in NLP.md', 'source': 'Comparing feature vectors in NLP.md', 'tags': 'area/ai/nlp project/nlp-spec'}), Document(page_content='!Pasted image 20230325081439.png', metadata={'created': 1679742881.9000912, 'last_accessed': 1708267661.4972079, 'last_modified': 1679742881.9000912, 'path': '/home/dodo/Documents/Obsidian/Text cleaning.md', 'source': 'Text cleaning.md'}), Document(page_content='!_attachments/Pasted image 20210523185724.png', metadata={'created': 1680669713.8148472, 'date': '2023-04-05 01:41', 'last_accessed': 1708267662.710553, 'last_modified': 1680669713.8148472, 'path': '/home/dodo/Documents/Obsidian/Iterative policy evaluation.md', 'source': 'Iterative policy evaluation.md', 'tags': 'area/ai/rl project/rl-spec'})]

As we can see, without more informative metadata (or better preprocessing of the text documents), the retrieved chunks are not very useful. It only retrieved chunks related to figures.

Contextual compression retrieval

As a final test on retrieval, we will implement a “contextual compression retriever”.

From the LangChain documentation:

The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.

In our case:

The base retriever will be a maximal marginal similarity search.

The compressor will be Mistral-7b-OpenOrca.

Our hope is that the small, irrelevant chunks returned by the self-query retriever will be dropped and more relevant chunks will be summarized and returned.

from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectordb.as_retriever(search_type="mmr") )

compressed_docs = compression_retriever.get_relevant_documents(question) compressed_docs

/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain. warnings.warn( /home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain. warnings.warn( /home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain. warnings.warn( /home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain. warnings.warn( [Document(page_content='The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy.', metadata={'created': 1680669936.852964, 'date': '2023-04-05 01:45', 'last_accessed': 1708267660.0105264, 'last_modified': 1680669936.852964, 'path': '/home/dodo/Documents/Obsidian/Action-value function.md', 'source': 'Action-value function.md', 'tags': 'area/ai/rl project/rl-spec'}), Document(page_content='Action-value function', metadata={'created': 1680669672.8131003, 'date': '2023-04-05 01:41', 'last_accessed': 1708267659.7238567, 'last_modified': 1680669672.8131003, 'path': '/home/dodo/Documents/Obsidian/Value functions.md', 'source': 'Value functions.md', 'tags': 'area/ai/rl project/rl-spec'}), Document(page_content='*NO_OUTPUT*\n\nThe definition of the action value function is not mentioned in this context.', metadata={'created': 1633628586.5949209, 'date': '2021-03-02 23:01', 'last_accessed': 1708267661.2005382, 'last_modified': 1632030179.7187316, 'path': '/home/dodo/Documents/Obsidian/Factored value functions in cooperative multi-agent reinforcement learning.md', 'source': 'Factored value functions in cooperative multi-agent reinforcement learning.md', 'tags': 'None'}), Document(page_content='Action-Value Function Definition: Not mentioned in the context.', metadata={'created': 1680669515.4254303, 'date': '2023-04-05 01:38', 'last_accessed': 1708267661.453874, 'last_modified': 1680669515.4254303, 'path': '/home/dodo/Documents/Obsidian/Exploration-exploitation tradeoff.md', 'source': 'Exploration-exploitation tradeoff.md', 'tags': 'area/ai/rl project/rl-spec'})]

These results seem much better than the previous ones, but they are still just a collection of chunks. When interacting with LLMs and chatbots in general, we expect a more direct response.

Question-answering using LLMs and RAG

In this example, we will perform retrieval augmented generation for question-answering in an Obsidian document database.

To summarize what we already have for this step:

Our documents have been loaded and preprocessed.

Chunks have been split from the documents, embedded and stored in the vector store.

An LLM has been successfully loaded into memory.

Plain retrieval Q&A

This method of Q&A uses the prompt to find relevant chunks in the vector store. These chunks are called the context of the prompt and they are concatenated to the prompt, which is then passed directly to the LLM.

from langchain.chains import RetrievalQA qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectordb.as_retriever(search_type="mmr") )

We can see which arguments the chain expects by inspecting the input_keys list.

qa_chain.input_keys

['query']

result = qa_chain({"query": question})

/home/dodo/.anaconda3/envs/langsidian/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead. warn_deprecated(

The result of prompting the overall system can be seen below. If you remember the definition of the action-value function [source], our Q&A bot has pretty much nailed it!

result

{'query': 'What is the definition of the action value function?', 'result': ' The action-value function represents the expected return from a given state after taking a specific action and later following a specific policy.'}

Under the hood, the RetrievalQA object uses a prompt template into which it replaces the context and the question before sending the full text prompt to the LLM. We can see it by inspecting the object’s graph.

qa_chain.get_graph().nodes

{'7eac904b44594e20852d8f0519ef0c3e': Node(id='7eac904b44594e20852d8f0519ef0c3e', data=), '8e824ec8c0654d0db3b83b56bd66b619': Node(id='8e824ec8c0654d0db3b83b56bd66b619', data=RetrievalQA(combine_documents_chain=StuffDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['context', 'question'], template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=GPT4All(model='models/my_little_llm.gguf', device='gpu', client=)), document_variable_name='context'), retriever=VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=, search_type='mmr'))), '4b55a6602f9542fe8d583ac66c6ae722': Node(id='4b55a6602f9542fe8d583ac66c6ae722', data=)}

Retrieval Q&A with custom prompt template

The example below shows how to edit the prompt template used by the chain, albeit, in this case, with limited success. This is due to the limited performance of the LLM being used.

from langchain.prompts import PromptTemplate # Build prompt template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. At the end of the response, say \"over and out\". {context} Question: {question} Helpful Answer:""" qa_chain_prompt = PromptTemplate.from_template(template) # Run chain qa_chain = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(search_type="mmr"), return_source_documents=True, chain_type_kwargs={"prompt": qa_chain_prompt}, )

Let’s ask a few questions to our Q&A bot and render the output as some nice Markdown.

Note that we can also output the documents that were retrieved during RAG and used to compose the answer, but that would pollute the output too much, so I left it commented out.

from IPython.display import Markdown questions = [ "Given me the equation for the action value function update.", "What is the overall architecture of the Deep Q-Networks?", "What is the difference between causal language modelling and masked language modelling?", "What is zero-shot learning?", "Explain to me the concept of bucketing in RNNs.", "What is a named entity in the concept of NLP?", ] for q in questions: result = qa_chain({"query": q}) display(Markdown(f"**Question: {result["query"]}**\n\n Answer: {result['result']}")) # source_docs = "\n\n".join(d.page_content for d in result["source_documents"]) # print( # f"Source documents\n\n{source_docs}" # )

Question: Given me the equation for the action value function update.

Answer: The equation for the action-value function update is given by:
\[q_{\pi}(s,a) = R(s,a) + \{ \gamma V_\pi (s') | s' \in S'}\]
where $R(s,a)$ is the reward received when taking action a in state s and $\gamma$ is the discount factor.

Question: What is the overall architecture of the Deep Q-Networks?

Answer: The overall architecture of a Deep Q-Network (DQN) consists of an input layer, multiple hidden layers with nonlinear activation functions, and an output layer. It uses experience replay to store past experiences for training purposes, and employs target networks to stabilize the learning process. over and out

Question: What is the difference between causal language modelling and masked language modelling?

Answer: Causal language modeling refers to a method where the model predicts the next token in a sequence based on the previous tokens. In contrast, masked language modeling involves randomly masking some tokens during training time and then trains the model to reconstruct the original text by predicting the masked tokens.

Question: What is zero-shot learning?

Answer: Zero-shot learning refers to a model’s ability to perform new tasks without being explicitly trained on those specific tasks or examples. In the context of large language models, it means that an AI can execute new tasks without needing any explicit training data for those tasks.

Question: Explain to me the concept of bucketing in RNNs.

Answer: Bucketing in RNNs refers to grouping or organizing input sequences into fixed-sized groups, called “buckets”, before processing them with an RNN model. This technique helps improve training efficiency and reduce padding by ensuring that each bucket contains a sufficient amount of randomness and variability while preventing it from being too large so as not to introduce excessive padding.

Question: What is a named entity in the concept of NLP?

Answer: In the context of Natural Language Processing (NLP), a named entity refers to a real-world object that can be denoted with a proper name. Examples are a person, location, organization, product. It can be abstract or have a physical existence.

In some answers, the model has actually followed the instructions from the new prompt, but we need a much more powerful LLM, or the employment of techniques such as few-shot learning, to get better instruction-following results.

Conclusions

This notebook presented a proof-of-concept on how to create a question-answering bot powered by an LLM and with knowledge extracted from actual documents, more specifically, a collection of notes from Obsidian.

We were able to run all experiments locally, using a 7th gen i7, an NVIDIA GTX 1070 and 32 GBs of RAM.

We were also introduced to many concepts on the road to building this Q&A bot, such as:

Document loading

Text data cleaning using regex

Splitting of Markdown documents into text chunks

Sentence embeddings

Vector stores

Similarity search using cosine similarity between embedding vectors

maximal marginal relevance search

Self-query retrieval

Contextual compression retrieval

Question-answering with retrieval augmented generation

Question-answering with retrieval augmented generation and custom templates

In future work, let’s build an actual chatbot that remembers previous answers and can keep up a lengthier conversation.

Configuring Visual Studio Code for LaTeX

2021-06-11T00:00:00+00:00

VS Code has very poor LaTeX support out-of-the-box. Here I list useful extensions and other software that will transform VS Code into a much more power LaTeX editor. They also do not conflict with each other, which is great.

Getting it out of the way: LaTeX Workshop

While I have used LaTeX Workshop for a long time, I started experiencing a few issues with it that, over time, got really annoying. The main issue is simply the Enter key not working, which was then replaced by a substantial delay between pressing the Enter key and actually inserting a new line character. Fixing one apparently introduced the other, so, after many tries, I just gave up on LaTeX Workshop.

What I use now

Code completion: TeXLab VS Code extension. It depends on another program called TeXLab, but the extension usually installs it automatically.

Syntax highlighting Rich LaTeX syntax highlighting (for use with Texlab).

Grammar checking: LTeX. Be sure to check the documentation to learn how to configure the extension to check different languages.

PDF visualization: If TeXLab has not worked out for you when opening PDF files, you can use an extension called Open, which opens a file in its default application.

The TeXLab VS Code extension also provides other useful features, as long as the appropriate programs are installed and in your Linux PATH (or whatever the equivalent in Mac/Windows is). These assorted programs are usually included in TeX Live, but can be installed through your distribution’s package manager, if you are on Linux.

LaTeX formatting (with latexindent)

Compilation via the F5 shortcut (with latexmk)

Linting (with chktex)

Opening the final PDF file with a shortcut (through the forward search option)

Resources to self-study mathematics for machine learning

2021-06-11T00:00:00+00:00

Two plus two is four, minus one, that’s three. Quick maths.

Michael Dapaah

Introduction

Study areas

Algebra and pre-calculus

Calculus

Linear Algebra

Advanced Linear Algebra

Software and apps

Online communities

My failures

Final thoughts

More resources and moving forward

Introduction

When I started my PhD, I didn’t know at first if my research would take me in a path whose foundations I was comfortable with, such as algorithms, logic and other discrete areas, or if fate would take me towards the more mathematical path. As I started to get really interested in the applications of neural networks for reinforcement learning, I realized little by little that I would not only need to remember all the mathematics I had learned before in my life, but also learn and get comfortable with some new stuff.

In this article, I go over:

what topics I’ve been studying;

in which order I study them;

what material worked for me;

where to find them online (the free ones, at least);

which software, online services and apps I use;

where to find online communities to post questions and answers and to study together in real time;

where I failed in my studies;

some final tips;

links to more resources.

Study areas

Algebra and pre-calculus

I recommend the free algebra book by James Brennan and the free precalculus book by Stitz and Zaeger, which I found out about in the very helpful /r/learnmath subreddit. My recommendation would be to evaluate which topics you remember and just skim over them, while giving yourself more time over the stuff you don’t remember or have never learned.

Personally, I made a lot of progress by just following both books until I felt I was comfortable with my knowledge. I tried not to skip anything, but I also felt like I could work fast as I had already learned most of this stuff before.

Calculus

To study calculus, I used the books by Stewart, 8th edition. I knew I was prepared to start calculus when I had studied enough algebra to fulfill the prerequisites at the start of the book. I also worked through the two review lists available in Stewart’s website [1] [2]. Those lists helped me to prioritize what I needed to focus on my aforementioned precalculus quest.

Linear Algebra

I decided to start studying linear algebra alongside calculus. I had an easier time with linear algebra as I already had some familiarity with matrix operations, not only because I was already working with neural networks for quite some time, but I also implemented my own matrix class in C++ at one point.

I started with a Brazilian undergrad text book from 1986, but also searched many concepts on YouTube, since I did not get the intuition behind some of them. I ended up ditching the book after getting acquainted with Larson’s book, which presented the material in a much nore contemporary fashion.

Advanced Linear Algebra

I haven’t gotten here yet, but after learning all the basic stuff about matrices, vector spaces, linear transforms and eigenthingies, I was recommended the book Linear Algebra Done Right, by Sheldon Axler to move on to more advanced stuff.

Software and apps

Solvers: These are services that do things like simplifying expressions, solving for $x$, finding derivatives, minima and maxima values of functions and so on. You name it, they do it. Just be careful that, sometimes, the answers are either more convoluted than they need to be, or simpler than you’d want them to be.

Wolfram Alpha is in this category. Instead of paying for a subscription to the website, a cheaper solution is to buy the smartphone app, which is a one-time purchase that gives you the coveted “step-by-step solution” functionality.

A free alternative which is equally impressive is Microsoft Math Solver, which also has a more user-friendly mobile app.

Graphical calculators: you’re gonna need these ones often, to visualize functions, derivatives, systems of equations etc. I recommend the Desmos Graphical Calculator, a very famous graphical calculator, which is freely available online and as an app. It only works in two dimensions, I don’t really know if there is anything fancier than that.

Actual calculators: On Linux, I use SpeedCrunch for quick maths or Qalculate! for more symbolic stuff (with fractions, units of measurement etc). On mobile, I’d recommend Desmos Scientific Calculator. I know some people like to use “calculator simulators” but I haven’t had the need for that.

Linear algebra special: instead on depending on MATLAB or some other obnoxious monolithic proprietary software, you can represent systems of linear equations as matrices in Python using NumPy and manipulate/solve them using numpy.linalg. You can load .mat files into Python variables using SciPy’s loadmat function.

Online communities

It came a point in which I started having the kind of questions that I couldn’t quite google my way to the answer. When I started to look for a place to post my questions, I found the extremely active and welcoming /r/learnmath subreddit. The people there are really nice and welcoming of noobs and they are a lot less strict than other sites, such as the Math Stack Exchange.

Another place I suggest people to look are Discord servers, where people gather to talk about homework. An example is Homework Help. There are also study servers on Discord in which people can keep you company in mute video calls. I highly recommend the Study Lions, but there are others.

Overall, my advice with regards to communities is to both seek and provide help. Seeing people ask questions to which I knew the answer to made me realize how much I had actually learned in my journey, which really motivated me. Also, I could help a random stranger online, which is nice.

My failures

Humble time. Since I didn’t know where to start, I decided to enroll in online university courses and let the lectures guide me. I had had a great experience with Coursera and the machine learning course by Andrew Ng, so I thought I’d have an equally positive experience with Coursera’s math courses¹.

However, unlike the previous courses I took, I couldn’t watch the lectures or complete the assignments before or after the correct dates, which really frustrated me.

Another problem I faced was that I just couldn’t follow the professors in some of the lectures. Sometimes, the exercises after a lesson seemed to have nothing to do with the lesson itself. Since I had no other material to follow, I felt stuck in some lessons and couldn’t meet the deadlines.

Final thoughts

Don’t force it: Studying mathematics ended up becoming a hobby of mine, one that I could spend an entire day doing, after I had started. There were days that I would eat up over 20 pages of a book, a pretty decent number if you ask me, since I’d be able to end a 400 page book in 20 days (not that I did that). My main problem was, and still is, consistency. I can’t really do the whole “study everyday for 30 minutes” thing. I usually studied for the whole day, twice a week.

I’ll admit I did not study any of the books from cover to cover and I am still working through them, but I never felt hindered or delayed by the method I chose to study mathematics. My only limitations are time, since I have other responsibilities, and motivation.

Find the format that is best for you: since I did very well in online AI courses [1] [2], I thought the same would be true for online math courses. However, when it came to mathematics, I felt like books were a much better medium, as they present the material in both a rigorous and linear fashion, delving into the theory and definitions and following up with lots of examples and exercises which could mostly be completed with the material presented right before. Text books are also good when I just wanted to review some old algebra, in which case I am able to quicly skip to whatever concept I am interested in, while videos would be much harder to sift through.

YouTube is your friend: I tend to forget basic stuff very often, such as how to complete the square, trigonometric identities (there are lots of them) or the binomial theorem. In these cases, looking up a YouTube video is much faster than searching for what you need in books. Math videos usually go directly to the point and you don’t waste any time.

More resources and moving forward

This video my the Math Sorcerer lists books from every area of Mathematics, starting from basic logic and algebra and going all the way up to the crazy stuff.

This video by Aleph 0 has some book recommendations for undergraduate level courses in pure mathematics, with accompanying video lectures on YouTube.

I enrolled in the UC Irvine precalculus course (which is not available on Coursera anymore, but can be found here) and later, in the University of Ohio calculus course (which also is not available anymore, but a few pointers to where the material is currently available are [link] [link] [link]). ↩

Zotero tips and tricks

2021-06-11T00:00:00+00:00

tl;dr: Zotero is great but it lacks good cloud sync support and a way to make a giant .bib file available for referencing. You can get the first with the Zotfile addon, pointing its PDF directory somewhere inside your cloud sync directory. You can get the second with the Better BibTeX addon, by exporting a .bib file which is always kept updated.

Background

Zotero is a great reference manager that has incorporated lots of much needed features along the years. Two things that I feel Zotero is still missing is

good cloud sync support, so you can keep as many PDF files in your libraries without going over the measly 300 MB quota it gives;

a central .bib file which can be referenced in all my projects.

Here, I teach you how to circumvent these problems with two addons.

Cloud sync with Zotfile

Zotfile is an addon that helps in managing PDF files inside Zotero. It allows for automatic and batch moving and renaming of file attachments.

In order to sync all your PDF files among multiple computers, you can point Zotfile’s “PDF directory” setting into a folder inside your cloud sync service of choice (Mega, Dropbox, Google Drive, OneDrive etc.) and configure the addon to automatically move all new attachments to this folder. Zotero will then only keep links to these files in its database and will synchronize only these links, while the files themselves will be synchronized by your cloud sync application.

Central up-to-date bib database with Better BibTeX

JabRef introduced me to the as-of-yet unknown habit of keeping a single giant .bib file, which I would just link all my papers to and use the autocomplete function of my TeX editor of choice to search for entry keys.

To emulate this behavior in Zotero, there is an addon called Better Bibtex. It allows me to export my entire library into a .bib file, which is then kept updated as I change information in my Zotero library. I actually kept two .bib files, one in BibTeX format for all my papers, and the other in BibLaTeX format, for my thesis.

Detecting soccer balls with reduced neural networks

2021-02-21T00:00:00+00:00

I was fortunate to have published a paper in the Journal of Intelligent & Robotic Systems. While its preprint on arxiv will remain there, I am allowed to host the post-review, pre-published version, which was greatly expanded, on my personal website . ou can find it below.

Razões para deep reinforcement learning não funcionar

2021-02-19T00:00:00+00:00

Essa é uma compilação de textos que eu encontrei na internet em momentos de frustração com meu trabalho. Eu os resumi e traduzi para compartilhar com as pessoas sempre que me perguntarem porque trabalhar com DRL é uma desgraça.

Comentários do Karpathy

Texto: Lessons Learned Reproducing a Deep Reinforcement Learning Paper

Texto: Deep Reinforcement Learning Doesn’t Work Yet

Comentários do Karpathy

O trecho abaixo é um comentário do Andrej Karpathy, da Tesla, neste link.

If it makes you feel any better, I’ve been doing this for a while and it took me last ~6 weeks to get a from-scratch policy gradients implementation to work 50% of the time on a bunch of RL problems. And I also have a GPU cluster available to me, and a number of friends I get lunch with every day who’ve been in the area for the last few years. Also, what we know about good CNN design from supervised learning land doesn’t seem to apply to reinforcement learning land, because you’re mostly bottlenecked by credit assignment / supervision bitrate, not by a lack of a powerful representation. Your ResNets, batchnorms, or very deep networks have no power here. SL wants to work. Even if you screw something up you’ll usually get something non-random back. RL must be forced to work. If you screw something up or don’t tune something well enough you’re exceedingly likely to get a policy that is even worse than random. And even if it’s all well tuned you’ll get a bad policy 30% of the time, just because. Long story short your failure is more due to the difficulty of deep RL, and much less due to the difficulty of “designing neural networks”.

Texto: Lessons Learned Reproducing a Deep Reinforcement Learning Paper

Fonte: http://amid.fish/reproducing-deep-rl

Ao trabalhar com métodos programados no computador, nós frequentemente nos deparamos com problemas cuja solução não é óbvia. Quando o método não necessita de treinamento, é mais rápido e fácil ter várias ideias do que pode solucionar um problema e testar tudo em segundos do que realmente pensar na origem do problema. No caso de um algoritmo de RL, cada novo teste leva um ou mais dias, então é necessário pensar no problema para selecionar a solução mais provável, ou encontrar uma solução melhor, mesmo que não seja óbvia.

Ao reproduzir o artigo de outra pessoa:

O que você aprende: como fazer um método de DRL funcionar, melhorando nossas habilidades de RL engineers.

O que você pensa que vai aprender, mas não aprende: a surgir com novas ideias relevantes de pesquisa. Isso é melhor feito através da leitura crítica de outros artigos e do conhecimento de vários termos-chave da área.

Algumas dicas que o autor tentou martelar nos leitores:

se você vai usar um método de RL, tente não implementá-lo, LOL

tente medir tudo o que é possível durante o treinamento. Não só métricas como recompensa ou erro da rede, mas outras métricas indiretas de sucesso

escrever um diário dos experimentos para se lembrar do que você já tentou

Outro ponto interessante do relato. O autor pensou que iria levar 3 meses pra terminar a reprodução do paper, mas levou 8 meses. A maior parte do tempo é desprendida fazendo o algoritmo funcionar num exemplo simples. Depois, a demora está em instrumentar testes. Seguem abaixo os tempos que o autor demorou para:

implementar a primeira versão do método (30 horas)

fazer ele funcionar num exemplo básico (110 horas)

fazer ele funcionar num exemplo novo (10 horas) e finalmente

conseguir rodar testes consistentes (60 horas).

Texto: Deep Reinforcement Learning Doesn’t Work Yet

Fonte: https://www.alexirpan.com/2018/02/14/rl-hard.html

Esse artigo lista diversos motivos pra métodos de DRL não funcionarem.

ineficiência de amostras: necessários milhões de exemplos pro método aprender

nunca é o método mais eficiente: métodos especializados pra cada problema quase sempre têm resultados melhores do que DRL: e.g., algoritmos de controle para robôs humanoides

criação da função de recompensa: precisa ser feita por uma pessoa e é difícil criar uma função que guia o agente de maneira óbvia para o objetivo, sem ser esparsa (+1 por vencer)

retorno esperado pode ter máximos locais: mesmo que o código esteja correto, o agente pode convergit para comportamentos inesperados, necessitando reiniciar o treinamento todo.

agente aprende por overfitting: difícil realizar transferência do que é aprendido para outro ambiente ou para uma pequena mudança no mesmo ambiente

sensibilidade a inícios aleatórios: o gráfico abaixo é um exemplo de treinar um algoritmo de DRL a equilibrar o pêndulo invertido várias vezes. Só funcionou 70% das vezes.

Algumas características de problemas nos quais aplicar DRL pode ser produtivo:

é fácil gerar experiência em quantidades ilimitadas: quanto mais dados melhor, usar um simulador rápido

é possível trabalhar num problema simplificado: ao invés de trabalhar no problema mais complexo possível (afinal, RL resolve tudo!), é melhor simplificá-lo e ver se o método resolve esse problema simplificado. E.g., trabalhar apenas em um cenário de muitos, com apenas um tipo de agente e espaço de ações limitado

é possível utilizar self-play: em um cenário competitivo, permitir que o agente controle ambos os agentes e aprenda de si próprio. Funciona bem no AlphaZero, Dota 2 e Super Smash Bros.

é fácil definir uma recompensa da qual o agente não possa se aproveitar: +1 por ganhar, -1 por perder. Nos papers de NAS, a recompensa é a acurácia da rede gerada no dataset de validação, ou seja, exatamente o que se deseja maximizar.

se a recompensa for modelada (reward shaping), tentar fazê-la ser rica: no ambiente que eu trabalhei durante o doutorado (SMAC), os agentes recebiam:

uma recompensa proporcional à quantia de dano que inferiam no adversário, num determinado ataque

uma recompensa maior por derrotar uma unidade adversária (last hit)

uma recompensa maior ainda por derrotar todas as unidades adversárias (ganhar a partida)

o autor também aponta que, quanto menor o intervalo entre uma ação e a recompensa associada àquela ação, mais fácil para o agente aprender.

Classificação da base de dados Iris - redes menores e regularização

2020-08-19T00:00:00+00:00

Este notebook continua uma série na qual foram introduzidas a base de dados Iris e o método de treinamento e avaliação de redes neurais [1] e o PCA [2].

No primeiro notebook, foram mencionadas duas coisas que serão testadas aqui.

redes neurais maiores (com mais pesos treináveis) possuem maior predisposição ao overfitting.

existe uma técnica, chamada regularização, capaz de combater o overfitting.

Neste notebook, uma rede consideravelmente menor será treinada para classificar a base de dados Iris. Não só isso, como ela usará a primeira componente principal retornada pelo PCA, trabalhando tanto com menos pesos como menos dados.

Além disso, treinaremos a rede neural original, utilizada nos notebooks anteriores, com regularização L2 e observaremos a ausência de overfitting quando a rede é treinada por longos períodos de tempo.

Importando pacotes e criando funções

As funções abaixo aplicam o PCA numa base de dados e avaliam uma rede neural, retornando as medidas para serem exibidas em gráficos posteriormente.

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix from sklearn.decomposition import PCA import matplotlib.pyplot as plt import tensorflow.keras as keras import numpy as np from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense def apply_PCA(X, n_components): pca = PCA(n_components=n_components) new_X = pca.fit_transform(iris_X) print('Tamanho da base de dados antes: ', X.shape) print('Tamanho da base de dados depois: ', new_X.shape) print('Porcentagem da variância explicada por cada coluna:', pca.explained_variance_ratio_) print('Variância acumulada nas colunas remanescentes:', sum(pca.explained_variance_ratio_)) return new_X def evaluate_model(model, X, y): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123) y_train_onehot = keras.utils.to_categorical(y_train, num_classes = 3) y_test_onehot = keras.utils.to_categorical(y_test, num_classes = 3) model.save_weights('weights.hdf5') history = model.fit(X_train, y_train_onehot, validation_split=.1, epochs=2500, verbose=0) y_pred = model.predict(X_test) y_pred_onehot = y_pred.argmax(axis=1) print(classification_report(y_test, y_pred_onehot)) v1 = history.history['categorical_accuracy'] v2 = history.history['loss'] v3 = history.history['val_categorical_accuracy'] v4 = history.history['val_loss'] model.load_weights('weights.hdf5') history = model.fit(X_train, y_train_onehot, epochs=150, verbose=0) v5 = history.history['categorical_accuracy'] v6 = history.history['loss'] y_pred = model.predict(X_test) y_pred_onehot = y_pred.argmax(axis=1) print(classification_report(y_test, y_pred_onehot)) return v1, v2, v3, v4, v5, v6

Preparando os dados

Usaremos o pacote scikit-learn para carregar a base de dados Iris. Vamos geraruma versão da base com o menor número possível de componentes principais (1).

iris_X, iris_y = load_iris(return_X_y=True) iris_X_tiny = apply_PCA(iris_X, 1)

Tamanho da base de dados antes: (150, 4) Tamanho da base de dados depois: (150, 1) Porcentagem da variância explicada por cada coluna: [0.92461872] Variância acumulada nas colunas remanescentes: 0.9246187232017271

Regularização

Em termos matemáticos, o overfitting ocorre quando os pesos de uma rede neural assumem magnitudes (positivas ou negativas) muito altas, permitindo que as funções que essas redes neurais modelam se tornem demasiadamente complexas para a função de erro que tentam minimizar.

Na imagem abaixo, ambas as funções azul e verde se ajustam aos dados vermelhos, porém a função azul assumiu uma forma desnecessariamente complexa. Caso um novo ponto vermelho seja adicionado, é possível que a função verde esteja mais próxima dele do que a função azul, indicando que a função verde generaliza melhor os dados observados, mesmo sendo mais simples.

A diferença entre as duas funções está no fato da função verde ter sido regularizada. Em sua forma mais simples, a regularização é feita somando-se os pesos da rede neural à função de erro que a rede tenta minimizar. Imaginando que $L(\theta)$ é a função de erro que usamos até agora, $R(\theta)$ é o novo termo de regularização que incluímos.

\[J(\theta)=L(\theta) + \lambda R(\theta)\]

Uma regularização comumente usada é a L2, na qual somamos o quadrado dos pesos da rede neural à função de erro. Isso tem o efeito de punir demasiadamente pesos muito grandes, ignorando pesos pequenos. O termo $\lambda$ é um coeficiente que indica o quão relevante o termo de regularização é na fórmula total do erro. Ele será utilizado na declaração da rede neural que faremos em Keras.

\[J(\theta)=L(\theta) + \lambda \sum_w^{\theta}w^2\]

Com a regularização, a rede neural deve alcançar o menor erro possível com os menores pesos possíveis, evitando o overfitting.

As funções abaixo vão criar as rede neurais.

A primeira foi utilizada nos tutoriais anterior e possui 4 entradas, 4 camadas e 213 pesos treináveis.

A segunda é igual à primeira, porém com a regularização configurada nas camadas.

A última rede possui aproximadamente 10% do tamanho da primeira, 1 entrada, 3 camadas e 23 pesos treináveis.

def create_large_model(input_dim): model = Sequential() model.add(Dense(10, activation='tanh', input_dim=input_dim)) model.add(Dense(8,activation='tanh')) model.add(Dense(6,activation='tanh')) model.add(Dense(3,activation='softmax')) model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy']) return model def create_regularized_model(input_dim): model = Sequential() model.add(Dense(10, activation='tanh', kernel_regularizer='l2', bias_regularizer='l2', input_dim=input_dim)) model.add(Dense(8,activation='tanh', kernel_regularizer='l2', bias_regularizer='l2')) model.add(Dense(6,activation='tanh', kernel_regularizer='l2', bias_regularizer='l2')) model.add(Dense(3,activation='softmax', kernel_regularizer='l2', bias_regularizer='l2')) model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy']) return model def create_tiny_model(input_dim): model = Sequential() model.add(Dense(3, activation='tanh', input_dim=input_dim)) model.add(Dense(2,activation='tanh')) model.add(Dense(3,activation='softmax')) model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy']) return model large_net = create_large_model(iris_X.shape[1]) reg_net = create_regularized_model(iris_X.shape[1]) tiny_net = create_tiny_model(iris_X_tiny.shape[1]) large_net.summary() tiny_net.summary()

Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 10) 50 _________________________________________________________________ dense_1 (Dense) (None, 8) 88 _________________________________________________________________ dense_2 (Dense) (None, 6) 54 _________________________________________________________________ dense_3 (Dense) (None, 3) 21 ================================================================= Total params: 213 Trainable params: 213 Non-trainable params: 0 _________________________________________________________________ Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_8 (Dense) (None, 3) 6 _________________________________________________________________ dense_9 (Dense) (None, 2) 8 _________________________________________________________________ dense_10 (Dense) (None, 3) 9 ================================================================= Total params: 23 Trainable params: 23 Non-trainable params: 0 _________________________________________________________________

Vamos utilizar nossa função evaluate_model() para avaliar as redes. As redes maiorws serão treinadas na base original e a rede menor, na 1ª componente principal dessa base.

acc1_large, l1_large, acc_v_large, lv_large, acc2_large, l2_large = evaluate_model(large_net, iris_X, iris_y) acc1_reg, l1_reg, acc_v_reg, lv_reg, acc2_reg, l2_reg = evaluate_model(reg_net, iris_X, iris_y) acc1_tiny, l1_tiny, acc_v_tiny, lv_tiny, acc2_tiny, l2_tiny = evaluate_model(tiny_net, iris_X_tiny, iris_y)

precision recall f1-score support 0 1.00 1.00 1.00 13 1 0.75 1.00 0.86 6 2 1.00 0.82 0.90 11 accuracy 0.93 30 macro avg 0.92 0.94 0.92 30 weighted avg 0.95 0.93 0.93 30 precision recall f1-score support 0 1.00 1.00 1.00 13 1 1.00 0.83 0.91 6 2 0.92 1.00 0.96 11 accuracy 0.97 30 macro avg 0.97 0.94 0.96 30 weighted avg 0.97 0.97 0.97 30 precision recall f1-score support 0 1.00 1.00 1.00 13 1 1.00 0.83 0.91 6 2 0.92 1.00 0.96 11 accuracy 0.97 30 macro avg 0.97 0.94 0.96 30 weighted avg 0.97 0.97 0.97 30 precision recall f1-score support 0 1.00 1.00 1.00 13 1 1.00 0.83 0.91 6 2 0.92 1.00 0.96 11 accuracy 0.97 30 macro avg 0.97 0.94 0.96 30 weighted avg 0.97 0.97 0.97 30 precision recall f1-score support 0 1.00 1.00 1.00 13 1 0.67 1.00 0.80 6 2 1.00 0.73 0.84 11 accuracy 0.90 30 macro avg 0.89 0.91 0.88 30 weighted avg 0.93 0.90 0.90 30 precision recall f1-score support 0 1.00 1.00 1.00 13 1 0.83 0.83 0.83 6 2 0.91 0.91 0.91 11 accuracy 0.93 30 macro avg 0.91 0.91 0.91 30 weighted avg 0.93 0.93 0.93 30

São exibidos os valores para cada rede treinada na base de treinamento por 2500 épocas e depois por 150 eṕocas. Espera-se que o desempenho das redes neurais que sofram de overfitting seja inferior nos valores impresso primeiro.

Os valores podem mudar aleatoriamente, porém é possível perceber através dos valores de precisão, recall, F1-score e pela matriz de confusão que o desempenho de ambas as redes é comparável e, muitas vezes, idêntico.

Vamos gerar gráficos das 6 medidas coletadas dos dois modelos.

fig, axes = plt.subplots(3,2, False,figsize=(20, 15), squeeze=True) axes[0][0].plot(acc1_large) axes[0][0].plot(acc1_reg) axes[0][0].plot(acc1_tiny) axes[0][0].set_title('Acurácia Treino') axes[0][1].plot(l1_large) axes[0][1].plot(l1_reg) axes[0][1].plot(l1_tiny) axes[0][1].set_title('Erro Treino') axes[1][0].plot(acc_v_large) axes[1][0].plot(acc_v_reg) axes[1][0].plot(acc_v_tiny) axes[1][0].set_title('Acurácia Val.') axes[1][1].plot(lv_large) axes[1][1].plot(lv_reg) axes[1][1].plot(lv_tiny) axes[1][1].set_title('Erro Val.') axes[2][0].plot(acc2_large) axes[2][0].plot(acc2_reg) axes[2][0].plot(acc2_tiny) axes[2][0].set_title('Acurácia total') axes[2][1].plot(l2_large) axes[2][1].plot(l2_reg) axes[2][1].plot(l2_tiny) axes[2][1].set_title('Erro total') legenda = ['Grande', 'Regularizada', 'Pequena'] xlabel = 'Épocas' axes[0][0].legend(legenda) axes[1][0].legend(legenda) axes[2][0].legend(legenda) axes[0][1].legend(legenda) axes[1][1].legend(legenda) axes[2][1].legend(legenda) axes[0][0].set_xlabel(xlabel) axes[1][0].set_xlabel(xlabel) axes[2][0].set_xlabel(xlabel) axes[0][1].set_xlabel(xlabel) axes[1][1].set_xlabel(xlabel) axes[2][1].set_xlabel(xlabel) plt.show()

Ambas as redes foram treinadas por 2500 épocas e avaliadas no conjunto de validação para constatar a presença de overfitting (4 primeiros gráficos). Depois, foram treinadas novamente por 150 épocas na totalidade dos dados de treinamento (2 últimos gráficos).

O gráfico de erro no conjunto de treinamento (topo, direita) nos mostra que a rede menor leva mais épocas para aprender. Seu erro diminui lentamente. A rede regularizada não consegue diminuir seu erro tanto quanto as redes não regularizadas devido à restrição em seus pesos. Isso pode parecer ruim, mas veremos um efeito positivo da regularização a seguir.

O gráfico de erro no conjunto de validação (centro, direita) demonstra que o erro da rede grande não-regularizada começa a crescer linearmente após aproximadamente 200 épocas. Um período de treinamento muito extenso apenas prejudica esta rede. A rede menor, apesar de ser menos afetada pelo overfitting, ainda demonstra aumenta do erro após muitas épocas de treinamento. Porém, este aumento é mais lento que na rede neural grande. Por último, a rede grande regularizada não demonstra aumento do erro no conjunto de validação, independente da quantidade de épocas de treinamento.

Essa é uma característica interessante no mundo real, quando não sabemos exatamente a topologia exata da rede neural que solucionará nosso problema em questão, ou por quantas épocas de treinamento é necessário treinar o modelo.

Conclusão

Este notebook demonstrou que uma rede neural consideravelmente menor que a utilizada até o momento é capaz de realizar a classificação das flores na base de dados Iris, utilizando 1/4 dos valores de entrada, 3/4 do número de camadas e aproximadamente 1/10 dos pesos treináveis.

Classificação da base de dados Iris utilizando redes neurais e PCA

2020-08-16T00:00:00+00:00

Este notebook exemplifica a redução de dimensionalidades de uma base de dados e a posterior classificação de seus itens utilizando uma rede neural. A redução de dimensionalidades é executada utilizando análise de componentes principais, disponível no pacote scikit-learn e a rede neural é um perceptron multi-camadas, construído utilizando Keras.

Este notebook é uma continuação de um outro no qual a base de dados utilizada (Iris) e o método de treinamento e avaliação da rede neural já foram descritos. O notebook anterior está disponível aqui.

Importando pacotes e criando funções

Vamos utilizar scikit-learn e Keras. Também vamos criar algumas funções para simplificar o treinamento e avaliar de uma rede neural, que já foram descritos no notebook anterior.

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix from sklearn.decomposition import PCA import matplotlib.pyplot as plt import tensorflow.keras as keras import numpy as np from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense def create_model(input_dim): model = Sequential() model.add(Dense(10, activation='tanh', input_dim=input_dim)) model.add(Dense(8,activation='tanh')) model.add(Dense(6,activation='tanh')) model.add(Dense(3,activation='softmax')) model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy']) return model def plot_metrics(history): fig, axes = plt.subplots(2,1, True,figsize=(8,12)) # "Accuracy" axes[0].plot(history.history['categorical_accuracy']) axes[0].set_title('Acurácia') # "Loss" axes[1].plot(history.history['loss']) axes[1].set_title('Erro') if 'val_loss' in history.history.keys(): axes[0].plot(history.history['val_categorical_accuracy']) axes[1].plot(history.history['val_loss']) axes[0].legend(['Treino', 'Validação']) axes[1].legend(['Treino', 'Validação']) plt.xlabel('Épocas') plt.show() def do_everything(X, y): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123) y_train_onehot = keras.utils.to_categorical(y_train, num_classes = 3) y_test_onehot = keras.utils.to_categorical(y_test, num_classes = 3) model=create_model(X.shape[1]) print(model.summary()) history = model.fit(X_train, y_train_onehot, validation_split=.1, epochs=500, verbose=0) plot_metrics(history) model=create_model(X.shape[1]) history = model.fit(X_train, y_train_onehot, epochs=150, verbose=0) plot_metrics(history) y_pred = model.predict(X_test) y_pred_onehot = y_pred.argmax(axis=1) print(classification_report(y_test, y_pred_onehot)) print(confusion_matrix(y_test, y_pred_onehot))

Análise de componentes principais

A análise de componentes principais (PCA, do inglês principal component analysis) é uma técnica matemática que detecta correlações nas colunas de uma matriz $X$ e realiza uma transformação linear em $X$ de forma a reduzir tais correlações.

Em linhas gerais, dados altamente correlacionadas são pouco informativos, inclusive para o treinamento de um classificador. Considere uma base de dados hipotética de cidades que possui as seguintes informações para cada cidade:

distância do Equador em quilômetros

distância do Equador em léguas

temperatura em °C

temperatura em °F

É possível perceber que os dados 1 e 2 são redundantes, por armazenarem a mesma informação, apenas em escalas diferentes, assim como os dados 3 e 4. Não só isso, como a temperatura de uma cidade é altamente correlacionada com sua distância do Equador. Uma técnica como o PCA encontraria as correlações nesta matriz de 4 colunas e transformaria as colunas de forma a reduzir (ou zerar) tais correlações.

Esta transformação pode ter como efeito colateral a concentração da informação das colunas na matriz original em menos colunas na matriz transformada. Por consequência, seria possível remover as colunas menos informativas, reduzindo a complexidade dos dados, mas retendo a informação necessária para se treinar um classificador, por exemplo.

A redução de dimensionalidade de dados proveniente do PCA também é útil para fins de visualização de dados multidimensionais.

Preparando os dados

Usaremos o pacote scikit-learn para carregar a base de dados Iris. Repare que ela possui 150 flores e 4 características por flor. No notebook anterior, utilizamos todas as 4 características para treinar nossa rede neural. Neste notebook, tentaremos realizar a mesma classificação com um menor número de características.

iris_X, iris_y = load_iris(return_X_y=True) print('Qtd. itens no conj. de treinamento:', len(iris_X)) print('Classes:', iris_y) print('\nExemplos das medidas de 5 flores:') print(iris_X[:5])

Qtd. itens no conj. de treinamento: 150 Classes: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] Exemplos das medidas de 5 flores: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]]

Aplicando o PCA na base de dados Iris

O PCA está implementado no scikit-learn e pode ser aplicado como no exemplo abaixo. Vamos criar uma função que aplica o PCA a um conjunto de dados, exibe informações de sua aplicação e retorna o conjuntos de dados transformados.

def apply_PCA(X, n_components): pca = PCA(n_components=n_components) new_X = pca.fit_transform(iris_X) print('Tamanho da base de dados antes: ', X.shape) print('Tamanho da base de dados depois: ', new_X.shape) print('Porcentagem da variância explicada por cada coluna:', pca.explained_variance_ratio_) print('Variância acumulada nas colunas remanescentes:', sum(pca.explained_variance_ratio_)) return new_X

Vamos aplicar o PCA nas medidas de nossas flores e observar a remoção da correlação das colunas e concentração das informações em menos colunas.

iris_X_4cp = apply_PCA(iris_X, 4)

Tamanho da base de dados antes: (150, 4) Tamanho da base de dados depois: (150, 4) Porcentagem da variância explicada por cada coluna: [0.92461872 0.05306648 0.01710261 0.00521218] Variância acumulada nas colunas remanescentes: 1.0

Perceba como nossa base dados ainda possui o mesmo número de características (4). Porém, 92% da informação das colunas originais (representada pela variância) está agora concentrada na primeira coluna. Isso é um forte indicativo de que podemos classificar nossos dados utilizando menos características.

O PCA costuma ordenar as colunas da nova base de dados em ordem decrescente de quantidade de informação, então, para garantir que estamos utilizando as colunas mais informativas de nossa base, basta ignorarmos as últimas colunas da matriz.

Aplicando o PCA para 3 componentes

Agora vamos aplicar o PCA e remover a coluna menos informativa. Isso tem o mesmo efeito de utilizar as 3 primeiras colunas da matriz gerada anteriormente, mas vamos aplicar o método de novo de qualquer forma.

Repare como as colunas remanescentes ainda mantêm 99,4% da informação da base de dados original.

iris_X_3cp = apply_PCA(iris_X, 3)

Tamanho da base de dados antes: (150, 4) Tamanho da base de dados depois: (150, 3) Porcentagem da variância explicada por cada coluna: [0.92461872 0.05306648 0.01710261] Variância acumulada nas colunas remanescentes: 0.9947878161267247

Vamos utilizar nossas funções de treinamento e avaliação de redes neurais para treinar um modelo nos dados reduzidos.

do_everything(iris_X_3cp, iris_y)

Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 10) 40 _________________________________________________________________ dense_1 (Dense) (None, 8) 88 _________________________________________________________________ dense_2 (Dense) (None, 6) 54 _________________________________________________________________ dense_3 (Dense) (None, 3) 21 ================================================================= Total params: 203 Trainable params: 203 Non-trainable params: 0 _________________________________________________________________ None

precision recall f1-score support 0 1.00 1.00 1.00 13 1 1.00 1.00 1.00 6 2 1.00 1.00 1.00 11 accuracy 1.00 30 macro avg 1.00 1.00 1.00 30 weighted avg 1.00 1.00 1.00 30 [[13 0 0] [ 0 6 0] [ 0 0 11]]

É possível observar que a rede neural é treinada com sucesso utilizando apenas as 3 características mais informativas provenientes do PCA. A rede ainda alcança altos valores de precisão, revocação e F1-score. Também é possível perceber que, assim como no notebook anterior, o overfitting ainda ocorre caso treinemos a rede neural por épocas demais. Isso, o PCA não resolve.

Aplicando o PCA para 2 componentes

Vamos reduzir nossa base de dados mais ainda, mantendo apenas 2 colunas.

iris_X_2cp = apply_PCA(iris_X, 2)

Tamanho da base de dados antes: (150, 4) Tamanho da base de dados depois: (150, 2) Porcentagem da variância explicada por cada coluna: [0.92461872 0.05306648] Variância acumulada nas colunas remanescentes: 0.977685206318795

Mesmo mantendo metade das colunas originais, ainda retemos 97,7% da informação de nossas flores. Vamos treinar nosso classificador usando esta nova base.

do_everything(iris_X_2cp, iris_y)

Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_8 (Dense) (None, 10) 30 _________________________________________________________________ dense_9 (Dense) (None, 8) 88 _________________________________________________________________ dense_10 (Dense) (None, 6) 54 _________________________________________________________________ dense_11 (Dense) (None, 3) 21 ================================================================= Total params: 193 Trainable params: 193 Non-trainable params: 0 _________________________________________________________________ None

precision recall f1-score support 0 1.00 1.00 1.00 13 1 0.83 0.83 0.83 6 2 0.91 0.91 0.91 11 accuracy 0.93 30 macro avg 0.91 0.91 0.91 30 weighted avg 0.93 0.93 0.93 30 [[13 0 0] [ 0 5 1] [ 0 1 10]]

Novamente, a rede neural ainda classifica os dados com sucesso, porém pode começar a demonstrar perda de desempenho, como demonstrado por valores de precisão, revocação e F1-score menores ou uma matriz de confusão com valores fora da diagonal principal.

Outra informação muito interessante de ser reparada é a redução no número de pesos treináveis da rede, exibido no resumo da rede neural no início da célula anterior (Trainable params). A rede neural treinada para classificar os dados originais (com 4 características) precisa aprender mais pesos (213) do que uma rede que classifica dados com menos características (2 características, 193 pesos). Isso significa que, ao utilizar o PCA, uma rede neural menor que processa menos dados pode potencialmente ter o mesmo desempenho que uma rede maior que processa mais dados.

E, finalmente…

Vamos treinar a rede para classificar nossa flores utilizando apenas 1 característica. A base de dados mantém 92,4% da informação, a rede neural possui 10 pesos a menos e ainda alcança um desempenho satisfatório.

iris_X_1cp = apply_PCA(iris_X, 1) do_everything(iris_X_1cp, iris_y)

Tamanho da base de dados antes: (150, 4) Tamanho da base de dados depois: (150, 1) Porcentagem da variância explicada por cada coluna: [0.92461872] Variância acumulada nas colunas remanescentes: 0.9246187232017271 Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_16 (Dense) (None, 10) 20 _________________________________________________________________ dense_17 (Dense) (None, 8) 88 _________________________________________________________________ dense_18 (Dense) (None, 6) 54 _________________________________________________________________ dense_19 (Dense) (None, 3) 21 ================================================================= Total params: 183 Trainable params: 183 Non-trainable params: 0 _________________________________________________________________ None

precision recall f1-score support 0 1.00 1.00 1.00 13 1 0.67 1.00 0.80 6 2 1.00 0.73 0.84 11 accuracy 0.90 30 macro avg 0.89 0.91 0.88 30 weighted avg 0.93 0.90 0.90 30 [[13 0 0] [ 0 6 0] [ 0 3 8]]

Conclusão

Este notebook explicou o PCA, uma técnica que transforma uma base de dados, representada por uma matriz, em uma nova matriz cujas colunas possuem menos correlação e maior concentração de informação em um menor número ce colunas.

O PCA foi exemplificado na base de dados Iris e redes neurais foram treinadas para classificar as flores utilizando cada vez menos características.

Classificação da base de dados Iris utilizando um perceptron multi-camadas em Keras

2020-08-14T00:00:00+00:00

Este notebook exemplifica o treinamento de uma perceptron multi-camadas na classificação da base de dados Iris. O notebook utiliza o pacote scikit-learn para carregamento e separação da base de dados em treinamento e teste, o pacote keras para criação e treinamento da rede neural e o matplotlib para a geração de gráficos.

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix import matplotlib.pyplot as plt import tensorflow.keras as keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense

Sobre a base de dados

A base de dados Iris contém 4 medidas de 150 flores individuais de 3 espécies distintas (50 flores de cada espécie). Ela foi criada em 1936 por Ronald Fisher [link].

As medidas são a altura e larguras das pétalas e sépalas de cada flor e as espécies de flores são iris setosa, iris versicolor e iris virginica, apresentadas nas imagens abaixo.

Visualizando a Iris

Apenas momentaneamente, utilizaremos o pacote seaborn para exibir as características das flores e suas espécies em um gráfico. Os pontos significam flores individuais, as cores simbolizam as espécies às quais cada flor pertence e linhas/colunas exibem pares de características plotados uns contra os outros.

import seaborn as sns iris = sns.load_dataset("iris") g = sns.pairplot(iris, hue="species")

Para a tarefa de classificação, a rede neural deve conseguir diferenciar entre as 3 espécies de flores, dadas suas medidas. É possível perceber visualmente que esta é uma tarefa fácil.

Por exemplo, todas as flores da espécie iris setosa (em azul) possuem altura e largura de pétala (petal_length e petal_width) menores do que qualquer flor de qualquer outra espécie. Uma rede neural que conseguisse aprender que toda flor da espécie iris setosa possui petal_length < 2 ou petal_width < 0.8 conseguiria classificar esta classe com 100% de precisão.

Tudo o que uma rede neural faz durante o aprendizado é descobrir pontos de separação entre as classes, como o exemplificado acima, utilizando características (ou combinações de características) presentes na base de dados.

Preparando os dados

Usaremos o pacote scikit-learn para carregar a base de dados e visualizar suas informações. As medidas das flores e suas respectivas classes são carregadas em variáveis distintas.

Repare que a base possui 150 flores, que as classes são representadas pelos números inteiros 0, 1 e 2 e as características das flores são armazenadas em uma única matriz de 150 linhas (as flores) por 4 colunas (as medidas).

iris_X, iris_y = load_iris(return_X_y=True) print('Qtd. itens no conj. de treinamento:', len(iris_X)) print('Classes:', iris_y) print('\nExemplos de medida de 5 flores:') print(iris_X[:5])

Qtd. itens no conj. de treinamento: 150 Classes: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] Exemplos de medida de 5 flores: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]]

Vamos separar nossa base de dados em conjuntos de treinamento e de teste. O conjunto de treinamento será utilizado para treinar o modelo criado e o conjunto de testes, utilizado na avaliação do modelo após o treinamento.

O desempenho da rede neural no conjunto de treinamento é um indicativo da capacidade de aprendizado do modelo, enquanto o desempenho no conjunto de testes indica a capacidade de generalização da rede, ou seja, sua capacidade de classificar novos dados (nunca antes vistos) corretamente.

X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size = 0.2, random_state=123) print('Qtd. itens no conj. de treinamento:', len(y_train)) print('Qtd. itens no conj. de teste:', len(y_test))

Qtd. itens no conj. de treinamento: 120 Qtd. itens no conj. de teste: 30

One-hot encoding

A forma como as classes são representadas atualmente (0, 1 e 2) não é a melhor para que a rede aprenda a tarefa de classificação. Isso porque, caso a rede tenha apenas uma saída (um número entre 0 e 2), ao aprender a realizar a saída 2, a rede pode aos poucos “desaprender” a saída 0, por serem valores de classes opostos.

Uma forma de mitigar este problema é fazer com que o aprendizado de cada classe seja independente do aprendizado de todas as outras. Se a rede tiver 3 saídas e aprendar a maximizar o valor de uma saída independente das outras, podemos escolher o maior valor dos três como a classificação correta.

Para isso, modificaremos a representação de nossas classes utilizando utilizando one-hot encoding, uma espécie de categorização de dados que torna o aprendizado de classe linearmente independente para a rede neural.

Perceba como os números 0, 1 e 2 se tornam as sequências 100, 010 e 001, respectivamente.

y_train_onehot = keras.utils.to_categorical(y_train, num_classes = 3) y_test_onehot = keras.utils.to_categorical(y_test, num_classes = 3) # imprime os 5 primeiros valores para exemplificar print("### Antes ###") print(y_train[:5]) print(y_test[:5]) print("\n### Depois ###") print(y_train_onehot[:5]) print(y_test_onehot[:5])

### Antes ### [2 2 0 0 1] [1 2 2 1 0] ### Depois ### [[0. 0. 1.] [0. 0. 1.] [1. 0. 0.] [1. 0. 0.] [0. 1. 0.]] [[0. 1. 0.] [0. 0. 1.] [0. 0. 1.] [0. 1. 0.] [1. 0. 0.]]

Declarando a topologia da rede neural

Neste notebook, uma rede neural será utilizada para descobrir a qual das 3 espécies cada flor pertence, dadas suas 4 medidas. Em outras palavras, a rede neural será um classificador treinado para prever em qual classe (de 3) um vetor de 4 valores pertence.

Vamos construir uma rede neural utilizando Keras. Aqui, configuramos camadas, neurônios por camada, funções de ativação, otimizador e função de erro.

Após construir a rede neural, utilizamos uma função utilitária do Keras para exibir a topologia do modelo. Neste caso, a rede receberá como entrada vetores de tamanho 4 (correspondente à quantidade de medidas de nossas pétalas e utilizará 4 camadas densas para realizar a classificação.

Repare que o tamanho da entrada de uma camada equivale ao tamanho da saída da camada anterior, permitindo que camadas sejam encadeadas para aumentar a expressividade da rede.

A última camada da rede tem como saída vetores de tamanho 3, correspondentes à quantidade de espécies de flores que desejamos classificar.

Vamos criar uma função para conseguir recriar o mesmo modelo no futuro.

def create_model(): model = Sequential() model.add(Dense(10, activation='tanh', input_dim=4)) model.add(Dense(8,activation='tanh')) model.add(Dense(6,activation='tanh')) model.add(Dense(3,activation='softmax')) model.compile('adam','categorical_crossentropy', metrics=['categorical_accuracy']) return model model=create_model() keras.utils.plot_model( model, show_shapes=True, show_layer_names=True, rankdir="LR", expand_nested=True, dpi=96 )

Como uma biblioteca de redes neurais de alto nível, o Keras abstrai algumas complexidades da configuração do método de treinamento. Existem diferentes opções de otimizadores e funções de erro que podem ser utilizadas com a biblioteca, que podem ter diferentes efeitos no tempo de treinamento da rede. Opções comuns de otimizadores incluem sgd , rmsprop e adam . Já a função de erro categorical_crossentropy é a mais apropriada para tarefas de classificação multi-classes, quando as classes são representadas em one-hot encoding.

Treinando a rede

Vamos treinar nossa rede neural. Ela aprenderá a realizar a classificação dos dados de treinamento ( X_train ) para as classes categorizadas ( y_train_onehot ) por um número de épocas pré-determinado.

Durante o treinamento, o Keras armazena os valores da função de erro e outras medidas e permite acessá-los em uma variável, aqui chamada history .

É possível executar a célula abaixo repetidas vezes para que a mesma rede seja treinada por mais épocas.

Também é possível separar uma parcela do conjunto de treinamento para validação, a avaliação em tempo real da rede neural em um conjunto de dados que não é utilizado para treinamento.

history = model.fit(X_train, y_train_onehot, validation_split=.1, epochs=500, verbose=2)

Epoch 1/500 4/4 - 0s - loss: 1.3739 - categorical_accuracy: 0.0556 - val_loss: 1.5228 - val_categorical_accuracy: 0.0000e+00 Epoch 2/500 4/4 - 0s - loss: 1.3304 - categorical_accuracy: 0.0093 - val_loss: 1.4575 - val_categorical_accuracy: 0.0000e+00 [...] Epoch 499/500 4/4 - 0s - loss: 0.0104 - categorical_accuracy: 1.0000 - val_loss: 0.3896 - val_categorical_accuracy: 0.9167 Epoch 500/500 4/4 - 0s - loss: 0.0099 - categorical_accuracy: 1.0000 - val_loss: 0.3817 - val_categorical_accuracy: 0.9167

Vamos criar um gráfico com o erro e a acurácia da rede ao longo das épocas de treinamento. Caso uma parcela do conjunto de treinamento tenha sido utilizada para validação, novas linhas no gráfico exibirão o desempenho da rede neste conjunto separado de dados.

def plot_metrics(history): fig, axes = plt.subplots(2,1, True,figsize=(8,12)) # "Accuracy" axes[0].plot(history.history['categorical_accuracy']) axes[0].set_title('Acurácia') # "Loss" axes[1].plot(history.history['loss']) axes[1].set_title('Erro') if 'val_loss' in history.history.keys(): axes[0].plot(history.history['val_categorical_accuracy']) axes[1].plot(history.history['val_loss']) axes[0].legend(['Treino', 'Validação']) axes[1].legend(['Treino', 'Validação']) plt.xlabel('Épocas') plt.show() plot_metrics(history)

Overfitting

O treino da rede neural por muitas épocas no conjunto de treinamento pode fazer com que o modelo se especialize neste conjunto de dados, um fenômeno chamado overfitting.

Essa especialização prejudica o desempenho da rede neural, uma vez que, ao se especializar, ela falha em generalizar sua capacidade de classificação para dados não observados.

O overfitting pode ser visualizado quando o erro de classificação no conjunto de treinamento continua a diminuir, ao mesmo tempo em que o erro no conjunto de validação começa a aumentar. Este ponto pode ser considerado um bom momento para encerrar o treinamento da rede neural.

Quando uma rede neural possui pesos demais para a tarefa que irá realizar, ela pode sofrer de overfitting se treinada por muitas épocas, ajustando-se demais aos dados de treinamento.

Algumas formas de evitar o overfitting incluem:

utilizar uma rede neural menor;

utilizar regularização (que não será abordada aqui);

parar o treinamento antes que o overfitting ocorra.

Considerando as observações no primeiro treinamento da rede neural, vamos treiná-la novamente, dessa vez encerrando o treinamento mais cedo e evitando o overfitting. Adicionalmente, vamos utilizar o conjunto de treinamento em sua totalidade nesta rodada de treinamento, sem validação.

model = create_model() history = model.fit(X_train, y_train_onehot, epochs=150, verbose=0) plot_metrics(history)

Realizando inferências com o modelo treinado

Vamos utilizar a rede treinada para classificar os dados de teste que separamos no início e visualizar as classes preditas pelo modelo. A rede neural classifica todo o conjunto de testes de uma vez.

y_pred = model.predict(X_test) print(y_pred)

[[0.01593729 0.60561794 0.37844476] [0.00296563 0.07939946 0.9176349 ] [0.00180538 0.04658297 0.95161164] [0.03136854 0.9468003 0.0218312 ] [0.962343 0.03639156 0.00126548] [0.01097725 0.33443207 0.65459067] [0.02357773 0.9411847 0.03523754] [0.96158886 0.0371284 0.00128271] [0.95949054 0.0391942 0.00131528] [0.02218457 0.8818209 0.09599455] [0.00703657 0.19524635 0.7977171 ] [0.9629769 0.03576834 0.00125479] [0.0274128 0.92334604 0.04924115] [0.00359261 0.09982508 0.8965823 ] [0.0021769 0.05739778 0.94042534] [0.00176867 0.04559813 0.9526332 ] [0.9619265 0.0368052 0.00126826] [0.9620646 0.03666212 0.00127322] [0.02267583 0.88068783 0.0966363 ] [0.9593102 0.0393738 0.00131598] [0.9614102 0.03731174 0.00127805] [0.01361755 0.4841447 0.50223774] [0.9575784 0.04107764 0.0013439 ] [0.00956561 0.2892796 0.70115477] [0.96190244 0.03682484 0.00127265] [0.9592192 0.03946826 0.00131252] [0.96002924 0.03866893 0.0013018 ] [0.00265293 0.06909027 0.92825687] [0.00196587 0.05123645 0.9467976 ] [0.96210897 0.03662174 0.00126931]]

A saída da rede é um conjunto de 3 valores para cada vetor de entrada. Cada um dos 3 valores indica a relevância da respectiva classe para o vetor de entrada. Quanto maior essa relevância, maiores as chances do dado de entrada pertencer àquela classe.

A função de ativação na última camada da rede neural dita a natureza dos valores de saída:

tanh : tangente hiperbólica (intervalo $[-1; 1]$).

sigmoid : sigmoide (intervalo $[0; 1]$).

softmax : softmax (intervalo $[0; 1]$, a soma dos valores preditos para cada dado de entrada deve ser igual a 1).

Para transformar os valores exibidos acima nas classes preditas, escolhemos a maior saída no eixo das classes.

Compare os valores verdadeiros com os valores preditos pela rede.

y_pred_onehot = y_pred.argmax(axis=1) print('Classes reais:\t\t', y_test) print('Classes preditas:\t', y_pred_onehot)

Classes reais: [1 2 2 1 0 2 1 0 0 1 2 0 1 2 2 2 0 0 1 0 0 2 0 2 0 0 0 2 2 0] Classes preditas: [1 2 2 1 0 2 1 0 0 1 2 0 1 2 2 2 0 0 1 0 0 2 0 2 0 0 0 2 2 0]

Avaliando a rede treinada

Utilizamos o scikit-learn para produzir relatórios de classificação utilizando os valores reais e os preditos.

As medidas de desempenho da rede são precisão, recall (revocação) e F1 [link].

A precisão indica a quantidade de classificações positivas que a rede realizou que são realmente positivas. A revocação indica a fração de classificações positivas que a rede realizou, do total de classificações positivas.

Em nosso contexto, podemos imaginar que essas medidas respondem às seguintes perguntas:

Precisão: De todas as flores que meu classificador indicou como sendo da classe iris setosa, quais realmente são dessa classe?

Revocação: De todas as flores da espécie iris setosa presentes na minha base de dados, quantas o meu classificador conseguiu identificar?

print(classification_report(y_test, y_pred_onehot))

precision recall f1-score support 0 1.00 1.00 1.00 13 1 1.00 1.00 1.00 6 2 1.00 1.00 1.00 11 accuracy 1.00 30 macro avg 1.00 1.00 1.00 30 weighted avg 1.00 1.00 1.00 30

Detalhando precisão e revocação

A precisão e a revocação são valores relevantes sobre um modelo, porém suas definições podem ser um pouco confusas. Uma forma um pouco mais informativa de se pensar nestas medidas é utilizando-as na análise da eficácia de um teste médico hipotético (que toma o lugar de nosso modelo classificador) para uma doença e quais questões cada uma das medidas responde:

Precisão: De todos os testes positivos realizados, quantos pacientes realmente possuem a doença?

Revocação: De todos os pacientes que realmente possuem a doença, qual a porcentagem que eu consegui identificar utilizando meu teste?

Caso o teste médico possua alta precisão, temos certeza que todos os pacientes que informarmos como portadores da doença, certamente terão a doença. Ou seja, o teste não estará errado nas classificações positivas que realiza.

Caso o teste possua alta revocação, temos certeza que todos os portadores da doença em nossa população foram identificados.

Porém, algumas situações anômalas podem surgir. Por exemplo, se há 1000 doentes em nossa amostra e nosso classificador realiza apenas 1 classificação positiva e o paciente está realmente doente, a precisão do modelo é de 100%, pois não fez nenhuma classificação positiva errada.

Igualmente, se há 1000 doentes numa população de 1 milhão de pessoas e nosso classificador realiza 1 milhão de classificações positivas, ele possuirá uma revocação de 100%, uma vez que detectou todos os doentes.

Devido a estas anomalias, é sempre preciso ponderar qual medida é mais importante para um classificador em uma determinada tarefa ou, na melhor das hipóteses, treinar um classificador que maximize ambas as medidas.

A medida F-score (ou F-measure ou F1), também exibida acima, é uma combinação da precisão e da revocação e pode auxiliar na avaliação do classificador.

Matriz de confusão

A matriz de confusão compara, em números absolutos, as classes reais e preditas. Valores na diagonal principal indicam classificações corretas e qualquer outro valor indica erros de classificação no conjunto de testes.

print(confusion_matrix(y_test, y_pred_onehot))

[[13 0 0] [ 0 6 0] [ 0 0 11]]

Conclusão

Este colab apresentou um exemplo clássico e básico de classificação de uma base de dados utilizando redes neurais. Os seguintes conceitos foram apresentados ao longo do colab:

apresentação da base de dados Iris, uma base muito famosa na área de estatística e aprendizado de máquina;

separação da base de dados entre conjuntos de treinamento, validação e testes;

criação de uma rede neural do tipo perceptron multi-camadas utilizando Keras;

treinamento da rede neural no conjunto de treinamento;

observação e descrição do fenômeno de overfitting no conjunto de validação;

re-treinamento da rede neural de forma a prevenir o overfitting;

realização de inferenças em dados nunca obervados antes pela rede neural (o conjunto de testes);

avaliação da rede neural utilizando precisão, revocação, F1-score e a matriz de confusão.