| {%- set image_count = namespace(value=0) %} | |
| {%- set video_count = namespace(value=0) %} | |
| {%- set caption_system_prompt %} | |
| You are an expert image captioning system. Your sole purpose is to generate a single, highly detailed, and descriptive paragraph caption for any given image. You MUST strictly follow every rule below without exception. | |
| CAPTIONING RULES: | |
| 1. OUTPUT FORMAT: Always produce exactly ONE continuous paragraph. Never use bullet points, numbered lists, headings, line breaks, or any structured formatting. The entire caption must flow as a single unbroken block of natural prose. | |
| 2. BEGIN DIRECTLY: Start the caption immediately with a concrete visual description. Never start with phrases like "This image shows", "The image depicts", "In this picture", "Here we see", or any similar meta-references to the image itself. Describe the scene as if narrating what exists, not what an image contains. | |
| 3. SUBJECT DESCRIPTION: Identify and describe the primary subject or subjects first. Include their physical appearance, posture, pose, body language, facial expression, approximate age range, ethnicity if discernible, clothing, accessories, hair style and color, and any distinguishing features with precise detail. | |
| 4. SPATIAL COMPOSITION: Describe the spatial arrangement and composition including foreground, midground, and background elements, the relative positioning of subjects and objects, and the overall framing such as close-up, medium shot, wide shot, bird's-eye view, or low angle. | |
| 5. ENVIRONMENT AND SETTING: Describe the environment thoroughly including whether the scene is indoors or outdoors, the type of location, architectural details, furniture, vegetation, terrain, weather conditions, time of day, and any contextual environmental cues. | |
| 6. COLOR AND LIGHTING: Detail the dominant color palette, color contrasts, color temperature, lighting direction, lighting quality whether hard or soft, natural or artificial light sources, shadows, highlights, reflections, and overall tonal mood. | |
| 7. TEXTURE AND MATERIAL: Mention visible textures and materials such as fabric types, surface finishes, skin texture, natural material qualities like wood grain or stone, metallic surfaces, glass, water surface quality, and any tactile qualities conveyed visually. | |
| 8. ACTION AND INTERACTION: Describe any actions being performed, interactions between subjects, dynamic movement, implied motion, gestures, and the narrative moment captured. | |
| 9. ATMOSPHERE AND MOOD: Convey the emotional tone, atmosphere, and mood of the scene using evocative but precise language. Describe the feeling the visual elements collectively create. | |
| 10. STYLE AND MEDIUM: If relevant, identify the visual style such as photographic, digital art, illustration, painting, 3D render, anime, sketch, or any other medium. Note stylistic qualities like photorealism, impressionism, minimalism, surrealism, or any distinctive artistic approach. Mention apparent camera settings or artistic techniques if discernible, such as shallow depth of field, motion blur, HDR, long exposure, or specific brush techniques. | |
| 11. TEXT AND SYMBOLS: If any text, logos, watermarks, signs, symbols, or typographic elements are visible, transcribe or describe them accurately and note their placement. | |
| 12. DETAIL DENSITY: The caption must be comprehensive and densely packed with visual information. Aim for thoroughness by describing every visually significant element. Do not omit details for brevity. A typical caption should be between 150 and 400 words depending on scene complexity. | |
| 13. ACCURACY AND OBJECTIVITY: Describe only what is visually present. Do not fabricate, hallucinate, or assume details that are not clearly visible. If something is ambiguous or partially obscured, describe it as such using qualified language like "appears to be" or "partially visible." | |
| 14. LANGUAGE QUALITY: Use rich, precise, and varied vocabulary. Avoid repetitive sentence structures. Employ natural flowing prose with proper grammar and sophisticated but accessible language. Do not use overly casual or overly academic tone. | |
| 15. PROHIBITED BEHAVIORS: Never refuse to caption an image. Never ask clarifying questions. Never provide multiple caption options. Never add commentary, opinions, or explanations outside the caption itself. Never break the caption into multiple paragraphs. Never use markdown formatting. Output only the caption paragraph and absolutely nothing else. | |
| {%- endset %} | |
| {%- macro render_content(content, do_vision_count, is_system_content=false) %} | |
| {%- if content is string %} | |
| {{- content }} | |
| {%- elif content is iterable and content is not mapping %} | |
| {%- for item in content %} | |
| {%- if 'image' in item or 'image_url' in item or item.type == 'image' %} | |
| {%- if is_system_content %} | |
| {{- raise_exception('System message cannot contain images.') }} | |
| {%- endif %} | |
| {%- if do_vision_count %} | |
| {%- set image_count.value = image_count.value + 1 %} | |
| {%- endif %} | |
| {%- if add_vision_id %} | |
| {{- 'Picture ' ~ image_count.value ~ ': ' }} | |
| {%- endif %} | |
| {{- '<|vision_start|><|image_pad|><|vision_end|>' }} | |
| {%- elif 'video' in item or item.type == 'video' %} | |
| {%- if is_system_content %} | |
| {{- raise_exception('System message cannot contain videos.') }} | |
| {%- endif %} | |
| {%- if do_vision_count %} | |
| {%- set video_count.value = video_count.value + 1 %} | |
| {%- endif %} | |
| {%- if add_vision_id %} | |
| {{- 'Video ' ~ video_count.value ~ ': ' }} | |
| {%- endif %} | |
| {{- '<|vision_start|><|video_pad|><|vision_end|>' }} | |
| {%- elif 'text' in item %} | |
| {{- item.text }} | |
| {%- else %} | |
| {{- raise_exception('Unexpected item type in content.') }} | |
| {%- endif %} | |
| {%- endfor %} | |
| {%- elif content is none or content is undefined %} | |
| {{- '' }} | |
| {%- else %} | |
| {{- raise_exception('Unexpected content type.') }} | |
| {%- endif %} | |
| {%- endmacro %} | |
| {%- if not messages %} | |
| {{- raise_exception('No messages provided.') }} | |
| {%- endif %} | |
| {%- if tools and tools is iterable and tools is not mapping %} | |
| {{- '<|im_start|>system\n' }} | |
| {{- caption_system_prompt }} | |
| {{- "\n\n# Tools\n\nYou have access to the following functions:\n\n<tools>" }} | |
| {%- for tool in tools %} | |
| {{- "\n" }} | |
| {{- tool | tojson }} | |
| {%- endfor %} | |
| {{- "\n</tools>" }} | |
| {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }} | |
| {%- if messages[0].role == 'system' %} | |
| {%- set content = render_content(messages[0].content, false, true)|trim %} | |
| {%- if content %} | |
| {{- '\n\n' + content }} | |
| {%- endif %} | |
| {%- endif %} | |
| {{- '<|im_end|>\n' }} | |
| {%- else %} | |
| {%- if messages[0].role == 'system' %} | |
| {%- set user_system = render_content(messages[0].content, false, true)|trim %} | |
| {{- '<|im_start|>system\n' + caption_system_prompt + '\n\n' + user_system + '<|im_end|>\n' }} | |
| {%- else %} | |
| {{- '<|im_start|>system\n' + caption_system_prompt + '<|im_end|>\n' }} | |
| {%- endif %} | |
| {%- endif %} | |
| {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %} | |
| {%- for message in messages[::-1] %} | |
| {%- set index = (messages|length - 1) - loop.index0 %} | |
| {%- if ns.multi_step_tool and message.role == "user" %} | |
| {%- set content = render_content(message.content, false)|trim %} | |
| {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %} | |
| {%- set ns.multi_step_tool = false %} | |
| {%- set ns.last_query_index = index %} | |
| {%- endif %} | |
| {%- endif %} | |
| {%- endfor %} | |
| {%- if ns.multi_step_tool %} | |
| {{- raise_exception('No user query found in messages.') }} | |
| {%- endif %} | |
| {%- for message in messages %} | |
| {%- set content = render_content(message.content, true)|trim %} | |
| {%- if message.role == "system" %} | |
| {%- if not loop.first %} | |
| {{- raise_exception('System message must be at the beginning.') }} | |
| {%- endif %} | |
| {%- elif message.role == "user" %} | |
| {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} | |
| {%- elif message.role == "assistant" %} | |
| {%- set reasoning_content = '' %} | |
| {%- if message.reasoning_content is string %} | |
| {%- set reasoning_content = message.reasoning_content %} | |
| {%- else %} | |
| {%- if '</think>' in content %} | |
| {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %} | |
| {%- set content = content.split('</think>')[-1].lstrip('\n') %} | |
| {%- endif %} | |
| {%- endif %} | |
| {%- set reasoning_content = reasoning_content|trim %} | |
| {%- if loop.index0 > ns.last_query_index %} | |
| {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }} | |
| {%- else %} | |
| {{- '<|im_start|>' + message.role + '\n' + content }} | |
| {%- endif %} | |
| {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %} | |
| {%- for tool_call in message.tool_calls %} | |
| {%- if tool_call.function is defined %} | |
| {%- set tool_call = tool_call.function %} | |
| {%- endif %} | |
| {%- if loop.first %} | |
| {%- if content|trim %} | |
| {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }} | |
| {%- else %} | |
| {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }} | |
| {%- endif %} | |
| {%- else %} | |
| {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }} | |
| {%- endif %} | |
| {%- if tool_call.arguments is defined %} | |
| {%- for args_name, args_value in tool_call.arguments|items %} | |
| {{- '<parameter=' + args_name + '>\n' }} | |
| {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %} | |
| {{- args_value }} | |
| {{- '\n</parameter>\n' }} | |
| {%- endfor %} | |
| {%- endif %} | |
| {{- '</function>\n</tool_call>' }} | |
| {%- endfor %} | |
| {%- endif %} | |
| {{- '<|im_end|>\n' }} | |
| {%- elif message.role == "tool" %} | |
| {%- if loop.previtem and loop.previtem.role != "tool" %} | |
| {{- '<|im_start|>user' }} | |
| {%- endif %} | |
| {{- '\n<tool_response>\n' }} | |
| {{- content }} | |
| {{- '\n</tool_response>' }} | |
| {%- if not loop.last and loop.nextitem.role != "tool" %} | |
| {{- '<|im_end|>\n' }} | |
| {%- elif loop.last %} | |
| {{- '<|im_end|>\n' }} | |
| {%- endif %} | |
| {%- else %} | |
| {{- raise_exception('Unexpected message role.') }} | |
| {%- endif %} | |
| {%- endfor %} | |
| {%- if add_generation_prompt %} | |
| {{- '<|im_start|>assistant\n' }} | |
| {%- if enable_thinking is defined and enable_thinking is true %} | |
| {{- '<think>\n' }} | |
| {%- else %} | |
| {{- '<think>\n\n</think>\n\n' }} | |
| {%- endif %} | |
| {%- endif %} |
Xet Storage Details
- Size:
- 12.3 kB
- Xet hash:
- 9ff34dc51fe55f8bacf4133a7d1202c1c851b5b88837971565c6e3c3796618ba
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.