Welcome to my blog!
Jiamin's Blog
Implementing Custom Storage Formats in Apache Hive
Implementing Custom Storage Formats in Apache Hive Background In certain business scenarios, downstream processing systems need to handle data files directly. Although Hive officially supports formats like text, orc, parquet, etc., learning how to develop custom storage formats is crucial for addressing a more diverse range of business scenarios. Hive currently offers the ROW FORMAT SERDE mechanism for this purpose.
ROW FORMAT SERDE The ROW FORMAT SERDE in Hive is a key data formatting concept, defining how to parse and map data stored in Hive tables.
Jiamin's Blog
Harnessing the Power of OpenAI's Latest Innovations
Introduction: Embracing the Future with OpenAI’s Updates In the ever-evolving landscape of artificial intelligence, staying updated with the latest advancements is not just a matter of curiosity, but a necessity for those looking to leverage AI for their projects. On the 11th of June, 2023, OpenAI introduced a slew of new features, marking a significant update to their Python SDK, now at version 1.0.0. In this blog, we’ll dive into these updates and explore how they can revolutionize the way we interact with AI.
Jiamin's Blog
Langchain LLM Streaming
Langchain LLM Streaming Langchain offers the capability to perform real-time processing of tokens generated by LLM through a callback mechanism.
from langchain.chat_models import ChatOpenAI from langchain.schema import ( HumanMessage, ) from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler chat = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0) resp = chat([HumanMessage(content="Write me a song about sparkling water.")]) Langchain supports both synchronous and asynchronous IO for token output. This corresponds to StreamingStdOutCallbackHandler and AsyncIteratorCallbackHandler, respectively.
StreamingStdOutCallbackHandler First, let’s take a look at the Langchain official implementation of StreamingStdOutCallbackHandler, which allows for real-time printing of LLM-generated tokens to the terminal.
Jiamin's Blog
Adding Real-time Domain Knowledge to LLM with LangChain Vector Database
Adding Latest Domain Knowledge to LLM with LangChain Vector Database When using chatgpt, we often encounter certain prompts:
Training data is up until September 2021. Therefore, I may not have information on events or updates that occurred after that time. If you have any questions regarding post-September 2021 topics, I might not be able to provide the latest information. To ensure that the LLM model possesses real-time domain knowledge, it becomes necessary to incorporate up-to-date information into the model.
Jiamin's Blog
Setting Up Free HTTPS Certificates for Nginx using Let's Encrypt
Introduction Securing your website with HTTPS not only ensures data integrity but also boosts user trust and search engine rankings. With Let’s Encrypt, you can obtain free SSL/TLS certificates for your Nginx web server effortlessly. In this guide, we’ll walk you through the process of setting up a Let’s Encrypt certificate for your Nginx server on CentOS.
sudo yum install epel-release sudo yum install certbot Step 2: Installing the Certbot Nginx Plugin To simplify the process of obtaining and installing certificates, Certbot offers a dedicated plugin for Nginx.
Jiamin's Blog
Vector Database: Weaviate
Vector Database: Weaviate Weaviate is an innovative vector database that offers powerful features for data storage and retrieval.
By using vectors to index data objects, Weaviate can store and retrieve data objects based on their semantic properties. Weaviate can be used independently (bring your vectors) or in conjunction with various modules that vectorize and enhance core functionalities for you. Thanks to its unique design, Weaviate ensures fast performance and efficient operations.
Jiamin's Blog
About Me
Hello everyone! I’m a seasoned Big Data engineer with five years of experience.
My Skills: Programming languages: Python, Java, Scala. Big Data Technologies: Apache Spark, Apache Flink, Apache Hadoop, Apache Iceberg Container Orchestration: Kubernetes (K8s), Azure, Google cloud, AWS Certificates I have achieved: Google Cloud Data Engineer: Certified Kubernetes Application Developer (CKAD): Best Regards,
Jiamin
Jiamin's Blog
Creating Anki Cards using ChatGPT-based Chrome Extension
The vast expanse of the internet serves us with nuggets of valuable information every day. But how do we effectively organize and memorize this fragmented information? One solution is the memory software, Anki, which uses the method of spaced repetition to aid our recall. However, the process of creating Anki cards can be quite time-consuming. To tackle this, we have developed a Chrome extension tool based on ChatGPT, aiding you in quickly and efficiently creating Anki cards.